Soft Bellman Equation and Soft Value Iteration证明

本节基础知识Soft Value function基础和Soft Q Learning中Policy Improvement 证明

首先回顾一下Soft value function的定义：

$V_{\mathrm{soff}}^{\pi}(\mathbf{s}) \triangleq \log \int \exp \left(Q_{\mathrm{soft}}^{\pi}(\mathbf{s}, \mathbf{a})\right) d \mathbf{a}$

假定 $\pi(\mathbf{a} | \mathbf{s})=\exp \left(Q_{\mathrm{soft}}^{\pi}(\mathbf{s}, \mathbf{a})-V_{\mathrm{soft}}^{\pi}(\mathbf{s})\right)$ ，我们有：

$\begin{aligned} Q_{\mathrm{soft}}^{\pi}(\mathbf{s}, \mathbf{a}) &=r(\mathbf{s}, \mathbf{a})+\gamma \mathbb{E}_{\mathbf{s}^{\prime} \sim p_{\mathbf{s}}}\left[\mathcal{H}\left(\pi\left(\cdot | \mathbf{s}^{\prime}\right)\right)+\mathbb{E}_{\mathbf{a}^{\prime} \sim \pi\left(\cdot | \mathbf{s}^{\prime}\right)}\left[Q_{\mathrm{soft}}^{\pi}\left(\mathbf{s}^{\prime}, \mathbf{a}^{\prime}\right)\right]\right] \\ &=r(\mathbf{s}, \mathbf{a})+\gamma \mathbb{E}_{\mathbf{s}^{\prime} \sim p_{\mathbf{s}}}\left[V_{\mathrm{soft}}^{\pi}\left(\mathbf{s}^{\prime}\right)\right] \end{aligned}$

最后。我们定义soft value iteration operator $\mathcal{T}$ ：

$\mathcal{T} Q(\mathbf{s}, \mathbf{a}) \triangleq r(\mathbf{s}, \mathbf{a})+\gamma \mathbb{E}_{\mathbf{s}^{\prime} \sim p_{\mathbf{s}}}\left[\log \int \exp Q\left(\mathbf{s}^{\prime}, \mathbf{a}^{\prime}\right) d \mathbf{a}^{\prime}\right]$

它是一个压缩映射里面的一种映射(contraction)，我们就得证了。

Soft Bellman Equation and Soft Value Iteration证明

具体参考论文：Reinforcement Learning with Deep Energy-Based Policies