李宏毅深度强化学习课程 https://www.bilibili.com/video/av24724071

李宏毅深度强化学习笔记(一)Outline
李宏毅深度强化学习笔记(二)Proximal Policy Optimization (PPO)
李宏毅深度强化学习笔记(三)Q-Learning
李宏毅深度强化学习笔记(五)Sparse Reward
李宏毅深度强化学习笔记(六)Imitation Learning
李宏毅深度强化学习课件

Asynchronous Advantage Actor-Critic (A3C)

Review – Policy Gradient

【完结】李宏毅深度强化学习笔记(四)Actor-Critic
Points of policy gradient:
probability of action given state, accumulated reward after t, baseline, discount rate.
In reality, it’s not possible to sample so many number of G, thus, Q-learning is introduced.

Review – Q-Learning

【完结】李宏毅深度强化学习笔记(四)Actor-Critic
State value function Vπ(s)V^\pi(s):
When using actor π\pi, the accumulated reward expects to be obtained after visiting state s
State-action value function Qπ(s,a)Q^\pi(s, a): When using actor π\pi, the accumulated reward expects to be obtained after taking a at state s.

Actor-Critic

【完结】李宏毅深度强化学习笔记(四)Actor-Critic
Use State-action value function Q and State value function V to replace the accumulated reward and baseline of policy gradient. (Combination of policy gradient and Q-learning, Actor and Critic)
However, in this case, we need to estimate 2 networks, which is more unstable, thus, advantage actor-critic is introduced.

Advantage Actor-Critic
【完结】李宏毅深度强化学习笔记(四)Actor-Critic
The variance introduced by r here is still smaller than the variance of the original G. Experiments show that the above equation is the best estimator of Q.

Algorithm:
【完结】李宏毅深度强化学习笔记(四)Actor-Critic
Tips:

  1. The parameters of first few layers of actor π(s)\pi(s)'s network and critic Vπ(s)V^\pi(s)'s network can be shared since they have the same input s.
  2. Use output entropy as regularization for π(s)\pi(s): Larger entropy is preferred → exploration(similar to the previous introduced exploration)

Asynchronous Advantage Actor-Critic (A3C)

Efficiency: multi-workers work parallely

  1. Each worker copy global parameters
  2. Each worker interact with the environmrnt and sample some data
  3. Compute gradients
  4. Update global models

Pathwise Derivative Policy Gradient

Another Way to use Critic:
For the original actor-critic, the critic just tell the actor whether the action is good or not.
However, the Pathwise derivative policy gradient will not only tell the goodness of action but also tell which action should be taken (told by the actor trained below).
【完结】李宏毅深度强化学习笔记(四)Actor-Critic
【完结】李宏毅深度强化学习笔记(四)Actor-Critic
The above network is actually a combination of two networks.
Tip: similar to conditional GAN, where Actor is generator, Q is discriminator.
【完结】李宏毅深度强化学习笔记(四)Actor-Critic
Algorithm:
【完结】李宏毅深度强化学习笔记(四)Actor-Critic

  1. The action taken is now determined by the actor trained.
  2. Q value is computed with s and the action taken by π^\hat{\pi}
  3. Not only need to update Q but also need to update π\pi.

【完结】李宏毅深度强化学习笔记(四)Actor-Critic

相关文章: