note7_sim_based
YeeKal
•
•
"#"
Simulation Based Search
forward search
Build a search tree with the current state $s_t$ and each final node reaches the end. Then select the action which maximum the reward. 1. simulate episodes os experience from now with the model 2. apply model_free RL to simulated episodes
MCTS
Monte-Carlo Tree Search
simple Monte-Carlo search
- given a model $M_v$ and a simulation policy $\pi$
- for each action $a\in A$
- simulate k episodes from current real state $s_t$
- evaluate actions by mean return (Monte-Carlo evaluation)
- select action with maximum value
Monte-Carlo tree search
- given a model $M_v$
- simulate K episodes from current state $s_t$ using current simulation policy $\pi$
- build a search tree containing visited states and actions
- evaluate actions $Q(s,a)$ by mean return of episodes from s,a
- select action with maximum value in search tree
Each simulation consists of two phases(in-tree,out-of-tree): - tree policy(improves): pick actions to maximise $Q(S,A)$ - default policy(fixed): pick actions randomly
Repeat (each simulation): - evaluate $Q(S,A)$ - improve tree policy, e.g by $\epsilon - greedy(Q)$
Uppder Confidence Bound(UCB)
UCT: Uppder Confidence Bound for Trees, UCT=MCTS+UCB