李宏毅深度强化学习- Actor-Critic
李宏毅深度强化学习课程 https://www.bilibili.com/video/av24724071
李宏毅深度强化学习笔记(一)Outline
李宏毅深度强化学习笔记(二)Proximal Policy Optimization (PPO)
李宏毅深度强化学习笔记(三)Q-Learning
李宏毅深度强化学习笔记(五)Sparse Reward
李宏毅深度强化学习笔记(六)Imitation Learning
李宏毅深度强化学习课件
Asynchronous Advantage Actor-Critic (A3C)
Review – Policy Gradient
Points of policy gradient:
probability of action given state, accumulated reward after t, baseline, discount rate.
In reality, it’s not possible to sample so many number of G, thus, Q-learning is introduced.
Review – Q-Learning
State value function :
When using actor , the accumulated reward expects to be obtained after visiting state s
State-action value function : When using actor , the accumulated reward expects to be obtained after taking a at state s.
Actor-Critic
Use State-action value function Q and State value function V to replace the accumulated reward and baseline of policy gradient. (Combination of policy gradient and Q-learning, Actor and Critic)
However, in this case, we need to estimate 2 networks, which is more unstable, thus, advantage actor-critic is introduced.
Advantage Actor-Critic
The variance introduced by r here is still smaller than the variance of the original G. Experiments show that the above equation is the best estimator of Q.
Algorithm:
Tips:
- The parameters of first few layers of actor 's network and critic 's network can be shared since they have the same input s.
- Use output entropy as regularization for : Larger entropy is preferred → exploration(similar to the previous introduced exploration)
Asynchronous Advantage Actor-Critic (A3C)
Efficiency: multi-workers work parallely
- Each worker copy global parameters
- Each worker interact with the environmrnt and sample some data
- Compute gradients
- Update global models
Pathwise Derivative Policy Gradient
Another Way to use Critic:
For the original actor-critic, the critic just tell the actor whether the action is good or not.
However, the Pathwise derivative policy gradient will not only tell the goodness of action but also tell which action should be taken (told by the actor trained below).
The above network is actually a combination of two networks.
Tip: similar to conditional GAN, where Actor is generator, Q is discriminator.
Algorithm:
- The action taken is now determined by the actor trained.
- Q value is computed with s and the action taken by
- Not only need to update Q but also need to update .