【完结】李宏毅深度强化学习笔记（二）Proximal Policy Optimization (PPO)

李宏毅深度强化学习- Proximal Policy Optimization

Policy Gradient

Terms and basic ideas
Policy Gradient

From on-policy to off-policy ——Using the experience more than once

Terms and basic ideas
PPO algorithm

李宏毅深度强化学习课程 https://www.bilibili.com/video/av24724071

李宏毅深度强化学习笔记（一）Outline
李宏毅深度强化学习笔记（三）Q-Learning
李宏毅深度强化学习笔记（四）Actor-Critic
李宏毅深度强化学习笔记（五）Sparse Reward
李宏毅深度强化学习笔记（六）Imitation Learning
李宏毅深度强化学习课件

Policy Gradient

Terms and basic ideas

Basic Components:

actor (The objective of policy gradient, which can be controlled)
environment (can’t control)
reward function (can’t control)

Policy of actor $\pi$ :
A network with parameter $\theta$ : input observation, output action
【完结】李宏毅深度强化学习笔记（二）Proximal Policy Optimization (PPO)

Episode:
One round of game from beginning to end

Objective of actor:
Maximize the total reward

Trajectory $\tau$ :
Sequence of actions and states

Probability of $\tau$ given the network with parameter $\theta$ :
【完结】李宏毅深度强化学习笔记（二）Proximal Policy Optimization (PPO)
Given a trajectory $\tau$ , we can get its corresponding reward, by controlling actor different reward can be obtained, since the action taken by actor and state given by environment are random variable, the objective is to get the actor with biggest expected reward.
【完结】李宏毅深度强化学习笔记（二）Proximal Policy Optimization (PPO)

Policy Gradient

【完结】李宏毅深度强化学习笔记（二）Proximal Policy Optimization (PPO)
Use different samples to sample N trajectory, use them to calculate the gradient.

Tip 1: Add a Baseline
In case the probability of the actions not sampled will decrease/probabilities of actions with positive value decrease due to the restriction of sum-up probability.
【完结】李宏毅深度强化学习笔记（二）Proximal Policy Optimization (PPO)
Tip 2: Assign Suitable Credit
Since the action taken at t are not related with reward before t, just need to sum up the reward from t. And action taken at t are less related with the reward long time later, discount rate can be introduced.
【完结】李宏毅深度强化学习笔记（二）Proximal Policy Optimization (PPO)

From on-policy to off-policy ——Using the experience more than once

Terms and basic ideas

On-policy: The same agent learn and interact with the environment.
Off-policy: The agent learned and the agent interacting with the environment is different.

Why Off-policy:
【完结】李宏毅深度强化学习笔记（二）Proximal Policy Optimization (PPO)
When using $\pi_\theta$ to collect data, training data need to be sampled again when $\theta$ is updated.
Goal: Using the sample from $\pi_{\theta}′$ to train $\theta$ . $\theta′$ is fixed, sample data can be re-used.

Important sampling:
When only have samples from another distribution, expected value can be changed as follow, $p(x)/q(x)$ is used for rectifying $f(x)$ .
【完结】李宏毅深度强化学习笔记（二）Proximal Policy Optimization (PPO)
The difference between distribution $p$ and $q$ cannot be too large, otherwise, the variance will be quite different.

More samples are needed if difference between distribution $p$ and $q$ is too large.
【完结】李宏毅深度强化学习笔记（二）Proximal Policy Optimization (PPO)
Use important sampling to reach the above goal:

Tips:

Advantage function(reward minus baseline) should be changed according to the new sampling parameter
The probability of state is simlilar among different parameter, so cancelled out it.
The stop criteria depends on the difference between distribution

Add Constraint: ( $\theta$ cannot be very different from $\theta′$ )

Tip: it is a constraint on behavior not parameters

PPO / TRPO:
TRPO is different from PPO that it use the KL divergence as constraint which makes it difficult to work out. Thus, PPO is more often used.
【完结】李宏毅深度强化学习笔记（二）Proximal Policy Optimization (PPO)

PPO algorithm

Initial policy parameters $\theta^0$
In each iteration,
【完结】李宏毅深度强化学习笔记（二）Proximal Policy Optimization (PPO)
when KL is too large, increase beta to increase penalty.
when KL is too small, decrease beta to decrease penalty

PPO2 algorithm:
【完结】李宏毅深度强化学习笔记（二）Proximal Policy Optimization (PPO)
Tips:

Clip function means that the first item must be between 1- $\epsilon$ and 1+ $\epsilon$ (blue line)
The whole function is represented by the red line.