【李宏毅2020 ML/DL】P107-109 Deep Reinforcement Learning

我已经有两年 ML 经历，这系列课主要用来查缺补漏，会记录一些细节的、自己不知道的东西。

关于强化学习，我专门花半年时间学习实践过，因此这里笔记只记录李老师的 outline 。我的强化学习资源仓库：
https://github.com/PiperLiu/Reinforcement-Learning-practice-zh
我的 CSDN 强化学习博客集合：
https://blog.csdn.net/weixin_42815609/category_9592110.html

本节内容综述

这年头，提到 DRL ，大家都觉得很兴奋，因为其在电子游戏上取得了惊人的效果。
李老师先介绍了强化学习的框架：环境-智能体的交互。接着对比了监督学习与强化学习。
强化学习也可以用在 chat-bot 上面，此外，还可以有应用如 interactive retrieval 。
为了说明学习过程，李老师以一个雅达利小游戏为例子讲解了一下。
强化学习的难点在哪里呢？第一个是 Reward delay ，Agent 采取的行为会影响到之后的数据。
李老师别出心裁：先不讲马尔可夫过程与DQN，不如直接讲 A3C 。
强化学习分为 Value-based 与 Policy-based ，A3C 是二者的结合。
接下来进入 Policy-based 部分。把强化学习作为一个函数 $\pi_\theta(s)$ 。首先第一步，讲解 Neural Network as Actor ；第二步 goodness of Actor ，见小细节；第三步 pick the best function ，见小细节。
截止到这里，其实是李老的某学期的最后一节课，见 Learning Map 。
下面的是补充内容，关于 PG 怎么实作。
另一个部分也是介绍强化学习，但是标题是“Learning to Interact with Environment” 。见到了李老师及其教学环境。
谷歌有一篇文章，介绍了训练 DQN 的 7 条 tips ，叫做 Rainbow 。
最后介绍了一些 A3C 的思想，并行 Asynchronous 。
介绍了一个类似 GAN 的架构：Pathwise Derivative Policy Gradient 。
最后，讲一下 Inverse Reinforcement Learning 。

【李宏毅2020 ML/DL】P107-109 Deep Reinforcement Learning | Scratching the surface

Policy-based: Goodness of Actor
Policy-based: Pick the Best Function

Gradient Ascent
Add a Baseline
How to estimate V(s)

Learning Map
A3C: Asynchronous
Pathwise Derivative Policy Gradient
Inverse Reinforcement Learning

小细节

Policy-based: Goodness of Actor

【李宏毅2020 ML/DL】P107-109 Deep Reinforcement Learning | Scratching the surface
如上， $\bar{R}_\theta$ 的定义。

Policy-based: Pick the Best Function

【李宏毅2020 ML/DL】P107-109 Deep Reinforcement Learning | Scratching the surface
如上，我们的目标为最大化 $\bar{R}_\theta$ 。

Gradient Ascent

【李宏毅2020 ML/DL】P107-109 Deep Reinforcement Learning | Scratching the surface
推导如上。

将需要微分的 $P$ 整理如上。

取 log 后，把没有用的项在求导时全部剔除了。

最终，目标的微分近似如上。

还有个问题，为什么要取 log ？
【李宏毅2020 ML/DL】P107-109 Deep Reinforcement Learning | Scratching the surface
如上，取 log 相当于在原微分下除以概率；如果不取 log ，会偏好做出现次数多的行为，而非奖励值高的。

Add a Baseline

【李宏毅2020 ML/DL】P107-109 Deep Reinforcement Learning | Scratching the surface
如上，在实际操作中，采样有缺陷，没被采样过的动作，其机率会下降。

如上，因此提出 Critics ，用于评估状态的价值。

How to estimate V(s)

How to estimate $V^{\pi}(s)$ .

希望在迭代中让 $\left(V^{\pi_{\theta}}\left(s_{t}^{n}\right)-V^{\pi_{\theta}}\left(s_{t+1}^{n}\right)\right)$ 与 $r_{t}^{n}$ 越接近越好。

Learning Map

【李宏毅2020 ML/DL】P107-109 Deep Reinforcement Learning | Scratching the surface
本节课的内容如上。

A3C: Asynchronous

【李宏毅2020 ML/DL】P107-109 Deep Reinforcement Learning | Scratching the surface
如上，在 A3C 中，交互是多线程、并行的。

Pathwise Derivative Policy Gradient

【李宏毅2020 ML/DL】P107-109 Deep Reinforcement Learning | Scratching the surface
如上，结构有些像 GAN 。可以解决连续动作的问题。

Inverse Reinforcement Learning

【李宏毅2020 ML/DL】P107-109 Deep Reinforcement Learning | Scratching the surface
如上，这是一种“模仿学习 Imitation Learning ”，且我们没有 Reward Function ，需要推出 Reward Function 。

如上，通过 expert 的数据，去找出 Reward Function 。

在 IRL 中，有一个原则：“假设老师永远是对的”。让老师的分数永远比 actor 高，由此设计奖励/评分机制。
【李宏毅2020 ML/DL】P107-109 Deep Reinforcement Learning | Scratching the surface
其框架如上，与 GAN 其实一模一样。

对比如上。