【完结】李宏毅深度强化学习笔记（五）Sparse Reward

李宏毅深度强化学习- Sparse Reward

Reward Shaping
Curriculum Learning
Hierarchical Reinforcement Learning

李宏毅深度强化学习课程 https://www.bilibili.com/video/av24724071

李宏毅深度强化学习笔记（一）Outline
李宏毅深度强化学习笔记（二）Proximal Policy Optimization (PPO)
李宏毅深度强化学习笔记（三）Q-Learning
李宏毅深度强化学习笔记（四）Actor-Critic
李宏毅深度强化学习笔记（六）Imitation Learning
李宏毅深度强化学习课件

Reward Shaping

When the reward is too sparse, it will be quite difficult for the machine to learn how to act.
Thus, reward shaping is needed for leading the actor to do what you want it to do.
Method of curiosity:
【完结】李宏毅深度强化学习笔记（五）Sparse Reward
Add another function ICM which uses $s_t, a_t, s_{t+1}$ as input, get a new kind of reward $r^i$ , the objective of the actor changed into maximizing the total reward of r and $r^i$ .
The ICM(intrinsic curiosity module) is used for generating the actor’s curiosity, and its network is shown as follow:
【完结】李宏毅深度强化学习笔记（五）Sparse Reward
In this model, the network1 and network2 are two networks trained separately.
Network1: takes $a_t$ and $s_t$ 's extracted feature as input, output an estimated $s_{t+1}$ 's extracted feature. Then true feature extracted from $s_{t+1}$ are compared with the estimated one, get the difference.
The bigger the difference, the larger the reward $r^i$ , which means the model encourages actor to take risk.
Network2: trained for extracting useful features that are related to actions. Input the $\phi$ value of $s_t$ and $s_{t+1}$ , output the estimated action $\hat{a_t}$ , if the estimated action is close to true $a_t$ , then $\phi$ can extract those useful features.
Tip: when there is no network 2, the large reward given by network1 means that $s_{t+1}$ is hard to predict, the model encourage taking risk, but sometimes the states hard to predict may be unimportant, thus, we need network2.

Curriculum Learning

Means learning tasks from easy to difficult (curriculum schedule for machine)
Reverse Curriculum Generation:
【完结】李宏毅深度强化学习笔记（五）Sparse Reward
Given a goal state $s_g$ --> Sample some states $s_1$ “close” to $s_g$ --> Start from states $s_1$ , each trajectory has reward R( $s_1$ ) --> Delete $s_1$ whose reward is too large (already learned) or too small (too difficult at this moment) --> Sample $s_2$ from $s_1$ , start from $s_2$

Hierarchical Reinforcement Learning

If lower agent cannot achieve the goal, the upper agent would get penalty. (the upper agent send its wish to the lower agent.)
If an agent get to the wrong goal, assume the original goal is the wrong one. (the achieved won’t be wasted)