YeeKal

note6_model_based

YeeKal โ€ข โ€ข
"#"
  • learn a model from experience
  • plan value function (and/or policy) from model

A model $M=$ represents the state transitions $P_\eta=P$ and rewards $R_\eta=R$: Typicallt assume conditional independence between state transitions and reward:

train process

select some complete episodes to train then this becomes a supervised learning problem.

  • learning $s,a\rightarrow r$ is a regression problem
  • learning $s,a\rightarrow s'$ is a density estimation problem
  • update $\eta$ to minimise loss function

Dyna

Integrating learning and planning.

dyna_framework.png

Dyna-Q algorithm:

dyna_q.png

Dyna2 algorithm:

dyna2.png

alpha-go zero