2019-10-02 17:37:47
This blog is from: https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html
Abstract: In this post, we are going to look deep into policy gradient, why it works, and many new policy gradient algorithms proposed in recent years: vanilla policy gradient, actor-critic, off-policy actor-critic, A3C, A2C, DPG, DDPG, D4PG, MADDPG, TRPO, PPO, ACER, ACTKR, SAC, TD3 & SVPG.
What is Policy Gradient
Policy gradient is an approach to solve reinforcement learning problems. If you haven’t looked into the field of reinforcement learning, please first read the section “A (Long) Peek into Reinforcement Learning » Key Concepts” for the problem definition and key concepts.
Notations
Here is a list of notations to help you read through equations in the post easily.
| Symbol | Meaning |
|---|---|
| s∈S | States. |
| a∈A | Actions. |
| r∈R | Rewards. |
| St,At,Rt | State, action, and reward at time step t of one trajectory. I may occasionally use st,at,rt as well. |
| γ | Discount factor; penalty to uncertainty of future rewards; 0<γ≤1. |
| Gt | Return; or discounted future reward; Gt=∑k=0∞γkRt+k+1. |
| P(s′,r|s,a) | Transition probability of getting to the next state s’ from the current state s with action a and reward r. |
| π(a|s) | Stochastic policy (agent behavior strategy); πθ(.) is a policy parameterized by θ. |
| μ(s) | Deterministic policy; we can also label this as μ is what a reinforcement learning algorithm aims to learn. |
| V(s) | State-value function measures the expected return of state s; Vw(.) is a value function parameterized by w. |
| Vπ(s) | The value of state s when we follow a policy π; Vπ(s)=Ea∼π[Gt|St=s]. |
| Q(s,a) | Action-value function is similar to Qw(.) is a action value function parameterized by w. |
| Qπ(s,a) | Similar to Qπ(s,a)=Ea∼π[Gt|St=s,At=a]. |
| A(s,a) | Advantage function, A(s,a)=Q(s,a)−V(s); it can be considered as another version of Q-value with lower variance by taking the state-value off as the baseline. |
Policy Gradient
The goal of reinforcement learning is to find an optimal behavior strategy for the agent to obtain optimal rewards. The policy gradient methods target at modeling and optimizing the policy directly. The policy is usually modeled with a parameterized function respect to θ, πθ(a|s). The value of the reward (objective) function depends on this policy and then various algorithms can be applied to optimize θ for the best reward.
The reward function is defined as:
where this.
It is natural to expect policy-based methods are more useful in the continuous space. Because there is an infinite number of actions and (or) states to estimate the values for and hence value-based approaches are way too expensive computationally in the continuous space. For example, in generalized policy iteration, the policy improvement step curse of dimensionality.
Using gradient ascent, we can move θ toward the direction suggested by the gradient πθ that produces the highest return.
Policy Gradient Theorem
Computing the gradient πθ). Given that the environment is generally unknown, it is difficult to estimate the effect on the state distribution by a policy update.
Luckily, the policy gradient theorem comes to save the world! Woohoo! It provides a nice reformation of the derivative of the objective function to not involve the derivative of the state distribution ∇θJ(θ) a lot.
Proof of Policy Gradient Theorem
This session is pretty dense, as it is the time for us to go through the proof (Sutton & Barto, 2017; Sec. 13.1) and figure out why the policy gradient theorem is correct.
We first start with the derivative of the state value function: