Fast deep reinforcement learning using online adjustments from the past

Fast deep reinforcement learning using online adjustments from the past

文章出自 DeepMind，其提出了一种能够更加充分利用 Replay buffer 历史经验数据的RL改进算法 ---- Ephemeral Value Adjusments (EVA)。

Contribution:

提出了一种新的算法 — Ephemeral Value Adjusments (EVA)；
对EVA算法的核心模块即 trace computation algorithm 的三种预选方案，进行了实验探讨和比较。

算法理解

智能体包含三个部分：

a standard parametric reinforcement learner 用来做正常的RL算法运算；
a trace computation algorithm 用来规划和估计非参数值函数；
a small value buffer 用来存储估计的值函数。

文中的方法适用于任何基于值函数的off-policy算法，作为例子，作者使用的是DQN算法。

对DQN的其中一个修改是replay buffer，除了储存 $(s_t, a_t, r_t,s_{t+1})$ 之外，还要加上轨迹信息（指向后续元组的指针）和当前一层隐层**信息（这个真没搞明白）。

接下来的关键修改是下面这道公式：

Fast deep reinforcement learning using online adjustments from the past

其中， $Q_\theta(s,a)$ 是来自DQN的参数化Q网络， $Q_{NP}(s,a)$ 则是来自trace computation algorithm的非参数化Q值（ $_{NP}$ 即non-parametric）， $\lambda$ 是超参数。

因此，重点就是如何计算 $Q_{NP}(s,a)$ 。作者对比了三种 trace computation algorithm 方案：

n-step
trajectory-centric planning(TCP)
kernel-based RL(KBRL)

其中TCP效果最好，因此下面只介绍这个方案。
TCP主要是使用以下两个公式进行相互迭代：

其中 $a_t$ 是在状态 $s_t$ 时执行的动作，它们都是存储在reply buffer中的。而 $a$ 是指使得 $Q_\theta(s_t,a)$ 或者 $Q_{NP}(s_t,a)$ 值取最大时对应的那个 $a$ 。

论文原图如下：
Fast deep reinforcement learning using online adjustments from the past
其中，Figure 1 Right 中的 $Q_P(s_t,a)$ 就是 $Q_\theta(s_t,a)$ ，而 $k$ 个 $Q_i(s_t,a)$ 则是由TCP得到的， $Q_{NP}(s_t,a)$ 是 $k$ 个 $Q_i(s_t,a)$ 的平均。