Policy-based Approach

Actor/Policy
Action = π ( O b s e r v a t i o n ) \pi(Observation) π(Observation)
input: observation
output: action
通过reward,学习到policy。

Neural network as Actor

input of actor(NN): the observation, such as image or vector
output of actor(NN): 每一个action对应输出层的一个神经元。
李宏毅-DRL-S2一般来说, policy is stochastic.

Goodness of Actor

Given a actor π θ ( s ) \pi_{\theta}\left(s\right) πθ(s) with NN parameter θ \theta θ.

  1. Use the actor π θ ( s ) \pi_{\theta}\left(s\right) πθ(s) to play game for one episode, therefore, we can acquire total reward in this episode R θ = ∑ t = 1 T r t R_{\theta} = \sum_{t=1}^{T}r_{t} Rθ=t=1Trt.

Start with observation s 1 s_{1} s1
actor π θ ( s 1 ) = a 1 \pi_{\theta}\left(s_{1}\right) = a_{1} πθ(s1)=a1
obtain reward r 1 r_{1} r1
get observation s 2 s_{2} s2
actor π θ ( s 2 ) = a 2 \pi_{\theta}\left(s_{2}\right) = a_{2} πθ(s2)=a2
obtain reward r 2 r_{2} r2

get observation s t s_{t} st
actor π θ ( s t ) = a t \pi_{\theta}\left(s_{t}\right) = a_{t} πθ(st)=at
obtain reward r t r_{t} rt

Note that, 即使是同样的actor,但是 R θ R_{\theta} Rθ in each episode是不一样的,因为game 和 actor policy 都是 stochastic。因此,我们最终优化的是Expected reward,即 R θ ˉ \bar{R_{\theta}} Rθˉ.

Therefore, R θ ˉ \bar{R_{\theta}} Rθˉ can evaluates the goodness of an actor π θ ( s ) \pi_{\theta}\left(s\right) πθ(s).

  1. An episode is considered as a trajectory τ \tau τ,
    τ = { s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , ⋯   , s T , a T , r T } \tau = \left\{s_{1}, a_{1}, r_{1}, s_{2}, a_{2}, r_{2}, \cdots, s_{T}, a_{T}, r_{T}\right\} τ={s1,a1,r1,s2,a2,r2,,sT,aT,rT}
    R ( τ ) = ∑ t = 1 T r t R(\tau) = \sum_{t=1}^{T}r_{t} R(τ)=t=1Trt
  2. each trajectory τ \tau τ has a probability, which depened on the parameters θ \theta θ of actor, denoteed as P ( τ ∣ θ ) P(\tau | \theta) P(τθ).
  3. Expected reward can be formulated as follows:
    R θ ˉ = ∑ τ R ( τ ) P ( τ ∣ θ ) \bar{R_{\theta}} = \sum_{\tau}R(\tau)P(\tau | \theta) Rθˉ=τR(τ)P(τθ)
    Note that, 一般trajectory τ \tau τ 是非常庞大的,很难被遍历,所以取N个episode,即可获得N个trajectories,记为 Γ = { τ 1 , τ 2 , . . . , τ N } \Gamma = \left\{ \tau^{1}, \tau^{2}, ..., \tau^{N} \right\} Γ={τ1,τ2,...,τN}. 并基于finite trajectories set Γ \Gamma Γ 求取近似 Expected reward.
    R θ ˉ = ∑ τ R ( τ ) P ( τ ∣ θ ) ≈ 1 N ∑ n = 1 N R ( τ n ) \bar{R_{\theta}} = \sum_{\tau}R(\tau)P(\tau | \theta) \approx \frac{1}{N} \sum_{n=1}^{N} R\left(\tau^{n}\right) Rθˉ=τR(τ)P(τθ)N1n=1NR(τn)

Gradient Ascent

Problem statement:
θ   ∗ = arg ⁡ max ⁡ θ R ˉ θ \theta^{\medspace *} = \arg\max_{\theta} \bar{R}_{\theta} θ=argθmaxRˉθ

Steps:

Start with θ   0 \theta^{\medspace 0} θ0
θ   1 ← θ   0 + η ∇ R ˉ θ   0 \theta^{\medspace 1} \leftarrow \theta^{\medspace 0} + \eta \nabla \bar{R}_{\theta^{\medspace 0}} θ1θ0+ηRˉθ0
θ   2 ← θ   1 + η ∇ R ˉ θ   1 \theta^{\medspace 2} \leftarrow \theta^{\medspace 1} + \eta \nabla \bar{R}_{\theta^{\medspace 1}} θ2θ1+ηRˉθ1

where θ = { w 1 , w 2 , ⋯   , b 1 , ⋯   } \theta=\left\{w_{1}, w_{2}, \cdots, b_{1}, \cdots\right\} θ={w1,w2,,b1,} and
∇ R ˉ θ = [ ∂ R ˉ θ / ∂ w 1 ∂ R ˉ θ / ∂ w 2 ⋮ ∂ R ˉ θ / ∂ b 1 ⋮ ] \nabla \bar{R}_{\theta}= \left[\begin{array}{c} \partial \bar{R}_{\theta} / \partial w_{1} \\ \partial \bar{R}_{\theta} / \partial w_{2} \\ \vdots \\ \partial \bar{R}_{\theta} / \partial b_{1} \\ \vdots \end{array}\right] Rˉθ=Rˉθ/w1Rˉθ/w2Rˉθ/b1

Considering R θ ˉ = ∑ τ R ( τ ) P ( τ ∣ θ ) \bar{R_{\theta}} = \sum_{\tau}R(\tau)P(\tau | \theta) Rθˉ=τR(τ)P(τθ), we can formulate ∇ R ˉ θ \nabla \bar{R}_{\theta} Rˉθ as follows:
∇ R ˉ θ = ∑ τ R ( τ ) ∇ P ( τ ∣ θ ) \nabla \bar{R}_{\theta} = \sum_{\tau}R(\tau)\nabla P(\tau | \theta) Rˉθ=τR(τ)P(τθ)
beacuse of R ( τ ) R(\tau) R(τ) is irrelevant with parameter θ \theta θ, 即一个trajectory的奖励和当前actor的参数没有什么关系。
And then, we introduce a little trick in above equation:
∇ R ˉ θ = ∑ τ R ( τ ) P ( τ ∣ θ ) ∇ P ( τ ∣ θ ) P ( τ ∣ θ ) = ∑ τ R ( τ ) P ( τ ∣ θ ) ∇ log ⁡ P ( τ ∣ θ ) \nabla \bar{R}_{\theta} = \sum_{\tau}R(\tau)P(\tau | \theta) \frac{\nabla P(\tau | \theta)}{P(\tau | \theta)} = \sum_{\tau}R(\tau)P(\tau | \theta) \nabla \log P(\tau | \theta) Rˉθ=τR(τ)P(τθ)P(τθ)P(τθ)=τR(τ)P(τθ)logP(τθ)
同时,利用Expected reward提到的近似计算,我们可以将 ∇ R ˉ θ \nabla \bar{R}_{\theta} Rˉθ近似为:
∇ R ˉ θ ≈ 1 N ∑ n = 1 N R ( τ n ) ∇ log ⁡ P ( τ n ∣ θ ) \nabla \bar{R}_{\theta} \approx \frac{1}{N} \sum_{n=1}^{N}R(\tau^{n}) \nabla \log P(\tau^{n} | \theta) RˉθN1n=1NR(τn)logP(τnθ)

现在的问题就成了如何求取 ∇ log ⁡ P ( τ n ∣ θ ) \nabla \log P(\tau^{n} | \theta) logP(τnθ)
对于给定的trajectory τ = { s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , ⋯   , s T , a T , r T } \tau = \left\{s_{1}, a_{1}, r_{1}, s_{2}, a_{2}, r_{2}, \cdots, s_{T}, a_{T}, r_{T}\right\} τ={s1,a1,r1,s2,a2,r2,,sT,aT,rT};
可以求取其后验概率 P ( τ ∣ θ ) P(\tau | \theta) P(τθ):
P ( τ ∣ θ ) = p ( s 1 ) p ( a 1 ∣ s 1 , θ ) p ( s 2 ∣ a 1 , s 1 ) p ( a 2 ∣ s 2 , θ ) . . . p ( a T ∣ s T , θ ) p ( s T + 1 ∣ s T , a T ) = p ( s 1 ) ∏ t = 1 T p ( a t ∣ s t , θ ) p ( s t + 1 ∣ s t , a t ) P(\tau | \theta) = p(s_{1})p(a_1|s_1, \theta)p(s_2|a_1, s_1)p(a_2|s_2, \theta) \\ ...p(a_{T}|s_{T}, \theta)p(s_{T+1}|s_{T}, a_{T}) \\ =p\left(s_{1}\right) \prod_{t=1}^{T} p\left(a_{t} \mid s_{t}, \theta\right) p\left(s_{t+1} \mid s_{t}, a_{t}\right) P(τθ)=p(s1)p(a1s1,θ)p(s2a1,s1)p(a2s2,θ)...p(aTsT,θ)p(sT+1sT,aT)=p(s1)t=1Tp(atst,θ)p(st+1st,at)

相关文章: