李宏毅深度强化学习- Sparse Reward
李宏毅深度强化学习课程 https://www.bilibili.com/video/av24724071
李宏毅深度强化学习笔记(一)Outline
李宏毅深度强化学习笔记(二)Proximal Policy Optimization (PPO)
李宏毅深度强化学习笔记(三)Q-Learning
李宏毅深度强化学习笔记(四)Actor-Critic
李宏毅深度强化学习笔记(六)Imitation Learning
李宏毅深度强化学习课件
Reward Shaping
When the reward is too sparse, it will be quite difficult for the machine to learn how to act.
Thus, reward shaping is needed for leading the actor to do what you want it to do.
Method of curiosity:
Add another function ICM which uses as input, get a new kind of reward , the objective of the actor changed into maximizing the total reward of r and .
The ICM(intrinsic curiosity module) is used for generating the actor’s curiosity, and its network is shown as follow:
In this model, the network1 and network2 are two networks trained separately.
Network1: takes and 's extracted feature as input, output an estimated 's extracted feature. Then true feature extracted from are compared with the estimated one, get the difference.
The bigger the difference, the larger the reward , which means the model encourages actor to take risk.
Network2: trained for extracting useful features that are related to actions. Input the value of and , output the estimated action , if the estimated action is close to true , then can extract those useful features.
Tip: when there is no network 2, the large reward given by network1 means that is hard to predict, the model encourage taking risk, but sometimes the states hard to predict may be unimportant, thus, we need network2.
Curriculum Learning
Means learning tasks from easy to difficult (curriculum schedule for machine)
Reverse Curriculum Generation:
Given a goal state --> Sample some states “close” to --> Start from states , each trajectory has reward R() --> Delete whose reward is too large (already learned) or too small (too difficult at this moment) --> Sample from , start from
Hierarchical Reinforcement Learning
If lower agent cannot achieve the goal, the upper agent would get penalty. (the upper agent send its wish to the lower agent.)
If an agent get to the wrong goal, assume the original goal is the wrong one. (the achieved won’t be wasted)