Why does policy gradiet method has high variance?

策略梯度方法

策略梯度方法中，目标函数是使得整个episode得到的reward的均值最大：

m a x i m i z e θ E π θ [\sum t = 0 T - 1 γ t r t]

由于：

\nabla θ E [f (x)] = \nabla θ \int p θ (x) f (x) d x = \int p θ (x) p θ (x) \nabla θ p θ (x) f (x) d x = \int p θ (x) \nabla θ log p θ (x) f (x) d x = E [f (x) \nabla θ log p θ (x)]

以及：

\nabla θ log p θ (τ) = \nabla log (μ (s 0) \prod t = 0 T - 1 π θ (a t | s t) P (s t + 1 | s t, a t)) = \nabla θ [log μ (s 0) + \sum t = 0 T - 1 (log π θ (a t | s t) + log P (s t + 1 | s t, a t))] = \nabla θ \sum t = 0 T - 1 log π θ (a t | s t)

两个等式的成立，假设：

R (τ) = \sum t = 0 T - 1 r t

则目标函数对参数的梯度可以写作：

\nabla θ E τ \sim π θ [R (τ)] = E τ \sim π θ [R (τ) \cdot \nabla θ (\sum t = 0 T - 1 log π θ (a t | s t))]

The naive way is to run the agent on a batch of episodes, get a set of trajectories (call it τ^) and update with ：

θ \leftarrow θ + α \nabla θ E τ \in τ^[R (τ)]

using the empirical expectation, but this will be too slow and unreliable due to high variance on the gradient estimates. After one batch, we may exhibit a wide range of results: much better performance, equal performance, or worse performance. The high variance of these gradient estimates made the learning process very slow and unstable.

为什么方差高了，学习进程就会变慢呢？

由中心极限定理：

Why does policy gradiet method has high variance?

其中：

可知，当n无穷大时，真实均值μ
可以以1-α的概率趋近于区间：

因此，下面式子的绝对值越小，用样本均值估计真实均值的置信度越高：

分母：样本数目越大，置信度越高
分子：样本方差越小，置信度越高
通常，基于蒙特卡洛的算法通过多次采样来估计均值，实验中通常会采集到差别较大的不同样本，使得样本的方差较大（分子较大），除非采样数目无限多（分母较大），一般都会由于样本数据的高方差而使得估计的均值与真实均值间误差较大，从而降低学习效率。
为了改善这一点，可以有意识的用某些方法来采集到均值相同，方差较小的样本，从而使得样本均值的估计值更可信。

为什么减掉一个baseline就可以减小梯度估计的方差了呢？

这个baseline可以通俗的理解为一个平均水平，在减掉baseline之前，方差用下面的式子计算：
(x1+x1’-x1_)^2
减掉baseline x1’之后：
(x1-x1_)^2
不就变小了嘛。其中x1_为均值，x1’为baseline，x1为第一个样本中去掉baseline的部分。
同时，在减小方差的同时，保证了无偏估计，即样本均值的估计值等于真实的均值。我们可以计算一下加入baseline部分的期望：

可以看出与baseline有关的部分的期望是0，因此不影响样本均值的估计。