本节基础知识Soft Value function基础和Soft Q Learning中Policy Improvement 证明

  首先回顾一下Soft value function的定义:

Vsoffπ(s)logexp(Qsoftπ(s,a))daV_{\mathrm{soff}}^{\pi}(\mathbf{s}) \triangleq \log \int \exp \left(Q_{\mathrm{soft}}^{\pi}(\mathbf{s}, \mathbf{a})\right) d \mathbf{a}

  假定π(as)=exp(Qsoftπ(s,a)Vsoftπ(s))\pi(\mathbf{a} | \mathbf{s})=\exp \left(Q_{\mathrm{soft}}^{\pi}(\mathbf{s}, \mathbf{a})-V_{\mathrm{soft}}^{\pi}(\mathbf{s})\right),我们有:

Qsoftπ(s,a)=r(s,a)+γEsps[H(π(s))+Eaπ(s)[Qsoftπ(s,a)]]=r(s,a)+γEsps[Vsoftπ(s)]\begin{aligned} Q_{\mathrm{soft}}^{\pi}(\mathbf{s}, \mathbf{a}) &=r(\mathbf{s}, \mathbf{a})+\gamma \mathbb{E}_{\mathbf{s}^{\prime} \sim p_{\mathbf{s}}}\left[\mathcal{H}\left(\pi\left(\cdot | \mathbf{s}^{\prime}\right)\right)+\mathbb{E}_{\mathbf{a}^{\prime} \sim \pi\left(\cdot | \mathbf{s}^{\prime}\right)}\left[Q_{\mathrm{soft}}^{\pi}\left(\mathbf{s}^{\prime}, \mathbf{a}^{\prime}\right)\right]\right] \\ &=r(\mathbf{s}, \mathbf{a})+\gamma \mathbb{E}_{\mathbf{s}^{\prime} \sim p_{\mathbf{s}}}\left[V_{\mathrm{soft}}^{\pi}\left(\mathbf{s}^{\prime}\right)\right] \end{aligned}

  最后。我们定义soft value iteration operator T\mathcal{T}

TQ(s,a)r(s,a)+γEsps[logexpQ(s,a)da]\mathcal{T} Q(\mathbf{s}, \mathbf{a}) \triangleq r(\mathbf{s}, \mathbf{a})+\gamma \mathbb{E}_{\mathbf{s}^{\prime} \sim p_{\mathbf{s}}}\left[\log \int \exp Q\left(\mathbf{s}^{\prime}, \mathbf{a}^{\prime}\right) d \mathbf{a}^{\prime}\right]

  它是一个压缩映射里面的一种映射(contraction),我们就得证了。

Soft Bellman Equation and Soft Value Iteration证明

  具体参考论文:Reinforcement Learning with Deep Energy-Based Policies

相关文章:

  • 2022-02-10
  • 2021-06-26
  • 2022-12-23
  • 2021-06-16
  • 2021-11-06
  • 2022-01-06
  • 2022-01-26
猜你喜欢
  • 2022-12-23
  • 2021-07-15
  • 2021-08-03
  • 2021-07-13
  • 2021-11-01
  • 2021-12-03
  • 2021-12-30
相关资源
相似解决方案