模仿学习简介

什么是模仿学习？

模仿学习(Imitation Learning)：Learns from expert demonstrations 。也就是基于这些专家经验数据进行学习。Given only the trajectories from expert，The reward function is not available。

难点就是没有reward 如果人为构建reward的话，它很难精确的描述出来。比如像开车时的油门，当前状态给多少？对应的reward又是多少？

Imitation Learning vs.Supervise learning

The solution may have important structural properties including constraints (for example, robot joint limits), dynamic smoothness and stability, or leading to a coherent, multi-step plan

Imitation Learning里面你所作的决策可能不是仅仅基于当前的state，而可能是由之前的很多信号所决定的。

The interaction between the learner’s decisions and its own input distribution (an on-policy versus off-policy distinction)

这个 learner 的 action 作用与环境之后，环境的分布会发生变化。如果按照experts 的数据进行学习，你不能完全复现数据发生的过程，因为你没有experts的policy，只是有一些data而已。因此supervised learning在环境新产生出来的数据分布上面不会工作太好。

The increased necessity of minimizing the typically high cost of gathering examples

在supervised learning里面，function是有一个明确的更新目标，而在model free的RL中，需要大量的数据去try，因为反馈的奖励信号只是告诉你好还是不好。但事实上我们经常能拿到一些expert data，这些数据的监督信号是要比RL自己采样所得到的监督信号要好。

参考文献：S. Shalev-Shwartzand S. Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.

Imitation Learning Algorithms

Imitation Learning Algorithms大体上可分为三类： Behavioral Cloning、Inverse Reinforcement Learning、Generative adversarial imitation Learning。

Behavioral Cloning

行为克隆(behavioral cloning)直接从专家的经验数据中直接学习，学习在什么样的state 采取什么样的action，而不需要去构建奖励函数。学完之后再去预测新的状态下需要采取什么样action。

模仿学习简介

但是当有一些noise之后，你action差一点点之后，之后的状态也会差一点点，后面就会差地越来越多，因此这种方式会误差累计，导致后面学地越来越不像。

模仿学习简介
这种情况也叫做Distributional shift。有点类似邯郸学步的感觉，并不能学到expert policy里面的精髓。

基于上述问题，也有一些算法对其进行改进，比如说Dagger算法。它引入了一些自己所产生的on-policy数据，混合在expert data里面进行学习。

模仿学习简介

也就是减缓一点distribution shift。

那除了这个distribution shift问题之外还有什么问题呢？

Non-Markovian behavior

第一个问题就是expert data可能并不满足马尔可夫性，可能会基于历史观测信息所做出的决策 $\pi_{\theta}(a_{t}|o_{1},o_{2},\cdots,o_{t})$ ，而agent所做决策基于当前观测，学习策略 $\pi_{\theta}(a_{t}|o_{t})$ 。那这种学习方式肯定会产生问题，学习起来会比较困难。

Multimodal behavior

第二个问题就是behavior可能会非常复杂，比如输出是混合高斯模型下的分布(mixture of Gaussians)，或者由一些非常复杂的 latent variable models，也就是最后由非常复杂的逻辑推理构成；或者我是从一个分布中采样得到的action，你只能看到我采样得到的action，但是看不到背后的分布，(Autoregressive discretization)。

behavior clone直接，好理解，但是很容易邯郸学步，学不到背后的精髓。那如果我们能把奖励函数学出来的话，我们可以说是朝着核心的policy去学，因此有了inverse reinforcement learning。

Inverse Reinforcement Learning

逆强化学习 (Inverse Reinforcement Learning)，通过expert trajectories去反推reward，之后再用reward构建学习强化学习智能体，从而得到policy。

那reward怎么学呢？简单的用linear function、Neural net来学：

$r_{\phi}(s,a) = \sum_{i}\phi_{i}f_{i} = \phi^{T}f(s,a)$

就是在当前state，expert所采取的动作 $a$ 所能拿到的reward比采取其他action所能拿到的action都高。怎么来做到这样一点呢？

Maximum causal entropy IRL：

如果我们能找到一个cost function $\tilde{c}$ 使得expert policy所拿到的cost小，而其他policy 拿到的cost大，也就是expert policy所能获得更大的奖励，其他的policy所能获得的奖励都很小。第一步我们就希望能够找到这样的一个cost function。

$\begin{aligned} \tilde{c} &=\operatorname{IRL}\left(\pi_{E}\right) \\ &=\arg \max _{c \in C}\left(\min _{\pi}-H(\pi)+\mathbb{E}_{\pi}[c(s, a)]\right)-\mathbb{E}_{\pi_{E}}[c(s, a)] \end{aligned}$

也就是去寻找一个cost，希望expert的cost $\mathbb{E}_{\pi_{E}}[\left(c\left(s,a\right)\right)]$ 比其他的policy所拿到的cost $\mathbb{E}_{\pi}[\left(c\left(s,a\right)\right)]$ 都要小。在此过程中还希望causal entropy $H(\pi)$ (给定一个状态，输出一个动作分布，这个动作分布的的熵) 要大。

从reward角度来看就是，expert的data能获得的reward比所建模空间中能找到的最好的reward还要大。

找到了这个cost function之后的话，我们需要去训练智能体，minimize cost(相当于最大化reward)：

$\tilde{\pi} = RL(\tilde{c}) = \argmin_{\pi} -H(\pi) + \mathbb{E}_{\pi}[\tilde{c}(s,a)]$

上述算法流程有个缺点，就是每找到一个cost function你都需要去计算 $\tilde{c}$ ，而它里面又需要去计算minmax，相当于是两层for循环。

参考文献：ZiebartB D et al. Maximum Entropy Inverse Reinforcement Learning.AAAI. 2008.
Generative adversarial imitation learning

生成对抗模仿学习 (Generative adversarial imitation learning GAIL) 通过expert data与构建的agent产生的数据进行对比分析，而使得智能体能够获得expert 类似的数据，达到学习的目的。

大体思想是希望智能体所产生的数据与专家数据同时送入一个分类器，这个分类器无法将其分开。这样的智能体可以被认为是学习到了专家策略。

GAIL是将Agent引入GAN，因此我们需要先来看看GAN的Loss function：

$\min _{G} \max _{D} \mathbb{E}_{x \in p_{\text {data }}(x)}[\log D(x)]+\mathbb{E}_{z \sim p_{z}(z)}[\log (1-D(G(z)))]$

在GAN中我们希望generate一些高质量的数据，但是我们不知道什么是高质量的数据，也就是无法对其打分，因此我们需要去train一个网络来判别哪些数据是高质量的数据，这个网络就是discriminator，那想要训练discriminator的话，我们又需要高质量的数据和差的数据，高质量的数据就是给定样本，而差的数据是生成器生成出来的数据。有了打分函数之后我们去训练generator就可以了。到此一个循环就完成了。

主要可分为两步：

Train a good discriminator
Train a good generator to fool the discriminator

GAIL的Loss：

对于GAIL多了一个entropy，但是步骤也是两步：

Train a good cost function
Train a good generator to fool the discriminator

$\min _{\pi} \max _{D} \mathbb{E}_{(s, a) \sim \pi_{E}}[\log D(s, a)]+\mathbb{E}_{(s, a) \sim \pi}[\log (1-D(s, a))]-\lambda H(\pi)$

参考文献：Ho J, ErmonS. Generative adversarial imitation learning. NIPS 2016.

Open Questions

How to generalize skills with complex conditions?
How to find solutions with guarantees?
How to scale up with respect to the number of dimensions? spaces? How to make it tractable?
How to perform imitation by multiple agents?
How to perform incremental/active learning in IRL?

Performance evaluation

How to establish benchmark problems for imitation learning?
What metric should be used to evaluate imitation learning methods?