李宏毅深度强化学习- Proximal Policy Optimization
李宏毅深度强化学习课程 https://www.bilibili.com/video/av24724071
李宏毅深度强化学习笔记(一)Outline
李宏毅深度强化学习笔记(三)Q-Learning
李宏毅深度强化学习笔记(四)Actor-Critic
李宏毅深度强化学习笔记(五)Sparse Reward
李宏毅深度强化学习笔记(六)Imitation Learning
李宏毅深度强化学习课件
Policy Gradient
Terms and basic ideas
Basic Components:
- actor (The objective of policy gradient, which can be controlled)
- environment (can’t control)
- reward function (can’t control)
Policy of actor :
A network with parameter : input observation, output action
Episode:
One round of game from beginning to end
Objective of actor:
Maximize the total reward
Trajectory :
Sequence of actions and states
Probability of given the network with parameter :
Given a trajectory , we can get its corresponding reward, by controlling actor different reward can be obtained, since the action taken by actor and state given by environment are random variable, the objective is to get the actor with biggest expected reward.
Policy Gradient
Use different samples to sample N trajectory, use them to calculate the gradient.
Tip 1: Add a Baseline
In case the probability of the actions not sampled will decrease/probabilities of actions with positive value decrease due to the restriction of sum-up probability.
Tip 2: Assign Suitable Credit
Since the action taken at t are not related with reward before t, just need to sum up the reward from t. And action taken at t are less related with the reward long time later, discount rate can be introduced.
From on-policy to off-policy ——Using the experience more than once
Terms and basic ideas
On-policy: The same agent learn and interact with the environment.
Off-policy: The agent learned and the agent interacting with the environment is different.
Why Off-policy:
When using to collect data, training data need to be sampled again when is updated.
Goal: Using the sample from to train . is fixed, sample data can be re-used.
Important sampling:
When only have samples from another distribution, expected value can be changed as follow, is used for rectifying .
The difference between distribution and cannot be too large, otherwise, the variance will be quite different.
More samples are needed if difference between distribution and is too large.
Use important sampling to reach the above goal:
Tips:
- Advantage function(reward minus baseline) should be changed according to the new sampling parameter
- The probability of state is simlilar among different parameter, so cancelled out it.
- The stop criteria depends on the difference between distribution
Add Constraint: ( cannot be very different from )
Tip: it is a constraint on behavior not parameters
PPO / TRPO:
TRPO is different from PPO that it use the KL divergence as constraint which makes it difficult to work out. Thus, PPO is more often used.
PPO algorithm
Initial policy parameters
In each iteration,
when KL is too large, increase beta to increase penalty.
when KL is too small, decrease beta to decrease penalty
PPO2 algorithm:
Tips:
- Clip function means that the first item must be between 1- and 1+ (blue line)
- The whole function is represented by the red line.