Policy-based Approach
Actor/Policy
Action =
π
(
O
b
s
e
r
v
a
t
i
o
n
)
\pi(Observation)
π(Observation)
input: observation
output: action
通过reward,学习到policy。
Neural network as Actor
input of actor(NN): the observation, such as image or vector
output of actor(NN): 每一个action对应输出层的一个神经元。一般来说, policy is stochastic.
Goodness of Actor
Given a actor π θ ( s ) \pi_{\theta}\left(s\right) πθ(s) with NN parameter θ \theta θ.
- Use the actor π θ ( s ) \pi_{\theta}\left(s\right) πθ(s) to play game for one episode, therefore, we can acquire total reward in this episode R θ = ∑ t = 1 T r t R_{\theta} = \sum_{t=1}^{T}r_{t} Rθ=∑t=1Trt.
Start with observation s 1 s_{1} s1
actor π θ ( s 1 ) = a 1 \pi_{\theta}\left(s_{1}\right) = a_{1} πθ(s1)=a1
obtain reward r 1 r_{1} r1
get observation s 2 s_{2} s2
actor π θ ( s 2 ) = a 2 \pi_{\theta}\left(s_{2}\right) = a_{2} πθ(s2)=a2
obtain reward r 2 r_{2} r2
…
get observation s t s_{t} st
actor π θ ( s t ) = a t \pi_{\theta}\left(s_{t}\right) = a_{t} πθ(st)=at
obtain reward r t r_{t} rt
Note that, 即使是同样的actor,但是 R θ R_{\theta} Rθ in each episode是不一样的,因为game 和 actor policy 都是 stochastic。因此,我们最终优化的是Expected reward,即 R θ ˉ \bar{R_{\theta}} Rθˉ.
Therefore, R θ ˉ \bar{R_{\theta}} Rθˉ can evaluates the goodness of an actor π θ ( s ) \pi_{\theta}\left(s\right) πθ(s).
- An episode is considered as a trajectory
τ
\tau
τ,
τ = { s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , ⋯ , s T , a T , r T } \tau = \left\{s_{1}, a_{1}, r_{1}, s_{2}, a_{2}, r_{2}, \cdots, s_{T}, a_{T}, r_{T}\right\} τ={s1,a1,r1,s2,a2,r2,⋯,sT,aT,rT}
R ( τ ) = ∑ t = 1 T r t R(\tau) = \sum_{t=1}^{T}r_{t} R(τ)=t=1∑Trt - each trajectory τ \tau τ has a probability, which depened on the parameters θ \theta θ of actor, denoteed as P ( τ ∣ θ ) P(\tau | \theta) P(τ∣θ).
- Expected reward can be formulated as follows:
R θ ˉ = ∑ τ R ( τ ) P ( τ ∣ θ ) \bar{R_{\theta}} = \sum_{\tau}R(\tau)P(\tau | \theta) Rθˉ=τ∑R(τ)P(τ∣θ)
Note that, 一般trajectory τ \tau τ 是非常庞大的,很难被遍历,所以取N个episode,即可获得N个trajectories,记为 Γ = { τ 1 , τ 2 , . . . , τ N } \Gamma = \left\{ \tau^{1}, \tau^{2}, ..., \tau^{N} \right\} Γ={τ1,τ2,...,τN}. 并基于finite trajectories set Γ \Gamma Γ 求取近似 Expected reward.
R θ ˉ = ∑ τ R ( τ ) P ( τ ∣ θ ) ≈ 1 N ∑ n = 1 N R ( τ n ) \bar{R_{\theta}} = \sum_{\tau}R(\tau)P(\tau | \theta) \approx \frac{1}{N} \sum_{n=1}^{N} R\left(\tau^{n}\right) Rθˉ=τ∑R(τ)P(τ∣θ)≈N1n=1∑NR(τn)
Gradient Ascent
Problem statement:
θ
∗
=
arg
max
θ
R
ˉ
θ
\theta^{\medspace *} = \arg\max_{\theta} \bar{R}_{\theta}
θ∗=argθmaxRˉθ
Steps:
Start with θ 0 \theta^{\medspace 0} θ0
θ 1 ← θ 0 + η ∇ R ˉ θ 0 \theta^{\medspace 1} \leftarrow \theta^{\medspace 0} + \eta \nabla \bar{R}_{\theta^{\medspace 0}} θ1←θ0+η∇Rˉθ0
θ 2 ← θ 1 + η ∇ R ˉ θ 1 \theta^{\medspace 2} \leftarrow \theta^{\medspace 1} + \eta \nabla \bar{R}_{\theta^{\medspace 1}} θ2←θ1+η∇Rˉθ1
…
where
θ
=
{
w
1
,
w
2
,
⋯
,
b
1
,
⋯
}
\theta=\left\{w_{1}, w_{2}, \cdots, b_{1}, \cdots\right\}
θ={w1,w2,⋯,b1,⋯} and
∇
R
ˉ
θ
=
[
∂
R
ˉ
θ
/
∂
w
1
∂
R
ˉ
θ
/
∂
w
2
⋮
∂
R
ˉ
θ
/
∂
b
1
⋮
]
\nabla \bar{R}_{\theta}= \left[\begin{array}{c} \partial \bar{R}_{\theta} / \partial w_{1} \\ \partial \bar{R}_{\theta} / \partial w_{2} \\ \vdots \\ \partial \bar{R}_{\theta} / \partial b_{1} \\ \vdots \end{array}\right]
∇Rˉθ=⎣⎢⎢⎢⎢⎢⎢⎡∂Rˉθ/∂w1∂Rˉθ/∂w2⋮∂Rˉθ/∂b1⋮⎦⎥⎥⎥⎥⎥⎥⎤
Considering
R
θ
ˉ
=
∑
τ
R
(
τ
)
P
(
τ
∣
θ
)
\bar{R_{\theta}} = \sum_{\tau}R(\tau)P(\tau | \theta)
Rθˉ=∑τR(τ)P(τ∣θ), we can formulate
∇
R
ˉ
θ
\nabla \bar{R}_{\theta}
∇Rˉθ as follows:
∇
R
ˉ
θ
=
∑
τ
R
(
τ
)
∇
P
(
τ
∣
θ
)
\nabla \bar{R}_{\theta} = \sum_{\tau}R(\tau)\nabla P(\tau | \theta)
∇Rˉθ=τ∑R(τ)∇P(τ∣θ)
beacuse of
R
(
τ
)
R(\tau)
R(τ) is irrelevant with parameter
θ
\theta
θ, 即一个trajectory的奖励和当前actor的参数没有什么关系。
And then, we introduce a little trick in above equation:
∇
R
ˉ
θ
=
∑
τ
R
(
τ
)
P
(
τ
∣
θ
)
∇
P
(
τ
∣
θ
)
P
(
τ
∣
θ
)
=
∑
τ
R
(
τ
)
P
(
τ
∣
θ
)
∇
log
P
(
τ
∣
θ
)
\nabla \bar{R}_{\theta} = \sum_{\tau}R(\tau)P(\tau | \theta) \frac{\nabla P(\tau | \theta)}{P(\tau | \theta)} = \sum_{\tau}R(\tau)P(\tau | \theta) \nabla \log P(\tau | \theta)
∇Rˉθ=τ∑R(τ)P(τ∣θ)P(τ∣θ)∇P(τ∣θ)=τ∑R(τ)P(τ∣θ)∇logP(τ∣θ)
同时,利用Expected reward提到的近似计算,我们可以将
∇
R
ˉ
θ
\nabla \bar{R}_{\theta}
∇Rˉθ近似为:
∇
R
ˉ
θ
≈
1
N
∑
n
=
1
N
R
(
τ
n
)
∇
log
P
(
τ
n
∣
θ
)
\nabla \bar{R}_{\theta} \approx \frac{1}{N} \sum_{n=1}^{N}R(\tau^{n}) \nabla \log P(\tau^{n} | \theta)
∇Rˉθ≈N1n=1∑NR(τn)∇logP(τn∣θ)
现在的问题就成了如何求取
∇
log
P
(
τ
n
∣
θ
)
\nabla \log P(\tau^{n} | \theta)
∇logP(τn∣θ)
对于给定的trajectory
τ
=
{
s
1
,
a
1
,
r
1
,
s
2
,
a
2
,
r
2
,
⋯
,
s
T
,
a
T
,
r
T
}
\tau = \left\{s_{1}, a_{1}, r_{1}, s_{2}, a_{2}, r_{2}, \cdots, s_{T}, a_{T}, r_{T}\right\}
τ={s1,a1,r1,s2,a2,r2,⋯,sT,aT,rT};
可以求取其后验概率
P
(
τ
∣
θ
)
P(\tau | \theta)
P(τ∣θ):
P
(
τ
∣
θ
)
=
p
(
s
1
)
p
(
a
1
∣
s
1
,
θ
)
p
(
s
2
∣
a
1
,
s
1
)
p
(
a
2
∣
s
2
,
θ
)
.
.
.
p
(
a
T
∣
s
T
,
θ
)
p
(
s
T
+
1
∣
s
T
,
a
T
)
=
p
(
s
1
)
∏
t
=
1
T
p
(
a
t
∣
s
t
,
θ
)
p
(
s
t
+
1
∣
s
t
,
a
t
)
P(\tau | \theta) = p(s_{1})p(a_1|s_1, \theta)p(s_2|a_1, s_1)p(a_2|s_2, \theta) \\ ...p(a_{T}|s_{T}, \theta)p(s_{T+1}|s_{T}, a_{T}) \\ =p\left(s_{1}\right) \prod_{t=1}^{T} p\left(a_{t} \mid s_{t}, \theta\right) p\left(s_{t+1} \mid s_{t}, a_{t}\right)
P(τ∣θ)=p(s1)p(a1∣s1,θ)p(s2∣a1,s1)p(a2∣s2,θ)...p(aT∣sT,θ)p(sT+1∣sT,aT)=p(s1)t=1∏Tp(at∣st,θ)p(st+1∣st,at)