李宏毅深度强化学习课程 https://www.bilibili.com/video/av24724071

李宏毅深度强化学习笔记(一)Outline
李宏毅深度强化学习笔记(三)Q-Learning
李宏毅深度强化学习笔记(四)Actor-Critic
李宏毅深度强化学习笔记(五)Sparse Reward
李宏毅深度强化学习笔记(六)Imitation Learning
李宏毅深度强化学习课件

Policy Gradient

Terms and basic ideas

Basic Components:

  1. actor (The objective of policy gradient, which can be controlled)
  2. environment (can’t control)
  3. reward function (can’t control)

Policy of actor π\pi:
A network with parameter θ\theta: input observation, output action
【完结】李宏毅深度强化学习笔记(二)Proximal Policy Optimization (PPO)

Episode:
One round of game from beginning to end

Objective of actor:
Maximize the total reward

Trajectory τ\tau:
Sequence of actions and states

Probability of τ\tau given the network with parameter θ\theta:
【完结】李宏毅深度强化学习笔记(二)Proximal Policy Optimization (PPO)
Given a trajectory τ\tau, we can get its corresponding reward, by controlling actor different reward can be obtained, since the action taken by actor and state given by environment are random variable, the objective is to get the actor with biggest expected reward.
【完结】李宏毅深度强化学习笔记(二)Proximal Policy Optimization (PPO)

Policy Gradient

【完结】李宏毅深度强化学习笔记(二)Proximal Policy Optimization (PPO)
Use different samples to sample N trajectory, use them to calculate the gradient.

Tip 1: Add a Baseline
In case the probability of the actions not sampled will decrease/probabilities of actions with positive value decrease due to the restriction of sum-up probability.
【完结】李宏毅深度强化学习笔记(二)Proximal Policy Optimization (PPO)
Tip 2: Assign Suitable Credit
Since the action taken at t are not related with reward before t, just need to sum up the reward from t. And action taken at t are less related with the reward long time later, discount rate can be introduced.
【完结】李宏毅深度强化学习笔记(二)Proximal Policy Optimization (PPO)

From on-policy to off-policy ——Using the experience more than once

Terms and basic ideas

On-policy: The same agent learn and interact with the environment.
Off-policy: The agent learned and the agent interacting with the environment is different.

Why Off-policy:
【完结】李宏毅深度强化学习笔记(二)Proximal Policy Optimization (PPO)
When using πθ\pi_\theta to collect data, training data need to be sampled again when θ\theta is updated.
Goal: Using the sample from πθ\pi_{\theta}′ to train θ\theta. θ\theta′ is fixed, sample data can be re-used.

Important sampling:
When only have samples from another distribution, expected value can be changed as follow, p(x)/q(x)p(x)/q(x) is used for rectifying f(x)f(x).
【完结】李宏毅深度强化学习笔记(二)Proximal Policy Optimization (PPO)
The difference between distribution pp and qq cannot be too large, otherwise, the variance will be quite different.
【完结】李宏毅深度强化学习笔记(二)Proximal Policy Optimization (PPO)
More samples are needed if difference between distribution pp and qq is too large.
【完结】李宏毅深度强化学习笔记(二)Proximal Policy Optimization (PPO)
Use important sampling to reach the above goal:
【完结】李宏毅深度强化学习笔记(二)Proximal Policy Optimization (PPO)
Tips:

  1. Advantage function(reward minus baseline) should be changed according to the new sampling parameter
  2. The probability of state is simlilar among different parameter, so cancelled out it.
  3. The stop criteria depends on the difference between distribution
    【完结】李宏毅深度强化学习笔记(二)Proximal Policy Optimization (PPO)

Add Constraint: (θ\theta cannot be very different from θ\theta′)

Tip: it is a constraint on behavior not parameters

PPO / TRPO:
TRPO is different from PPO that it use the KL divergence as constraint which makes it difficult to work out. Thus, PPO is more often used.
【完结】李宏毅深度强化学习笔记(二)Proximal Policy Optimization (PPO)
【完结】李宏毅深度强化学习笔记(二)Proximal Policy Optimization (PPO)

PPO algorithm

Initial policy parameters θ0\theta^0
In each iteration,
【完结】李宏毅深度强化学习笔记(二)Proximal Policy Optimization (PPO)
when KL is too large, increase beta to increase penalty.
when KL is too small, decrease beta to decrease penalty

PPO2 algorithm:
【完结】李宏毅深度强化学习笔记(二)Proximal Policy Optimization (PPO)
Tips:

  1. Clip function means that the first item must be between 1-ϵ\epsilon and 1+ϵ\epsilon (blue line)
  2. The whole function is represented by the red line.

相关文章: