Comparison of DRL Policies to Formal Methods for Moving Obstacle Avoidance

Abstract

Deep Reinforcement Learning (RL) has recently emerged as a solution for moving obstacle avoidance. Deep RL learns to simultaneously predict obstacle motions and corresponding avoidance actions directly from robot sensors, even for obstacles with different dynamics models.
However, deep RL methods typically cannot guarantee policy convergences (无法保证策略收敛性), i.e., cannot provide probabilistic collision avoidance guarantees. In contrast, stochastic reachability (SR), a computationally expensive formal method that employs a known obstacle dynamics model, identifies the optimal avoidance policy and provides strict convergence guarantees. The availability of the optimal solution for versions of the moving obstacle problem provides a baseline to compare trained deep RL policies. In this paper, we compare the expected cumulative reward and actions of these policies to SR, and find the following. 1) The state-value function approximates the optimal collision probability well, thus explaining the high empirical performance (状态值函数很好地近似了最优碰撞概率). 2) RL policies deviate from the optimal significantly thus negatively impacting collision avoidance in some cases (RL策略明显偏离了最优策略,因此在某些情况下会对避开碰撞产生负面影响). 3) Evidence suggests that the deviation is caused, at least partially, by the actor net failing to approximate the action corresponding to the highest state-action value (实验表明这个差异来源于actor net未能近似出最高’状态-动作值’对应的动作).

直观理解（critic给出了正确的评估，但是actor没有正确的执行）：本文对避障问题，将DRL和SR方法进行对比分析，来分析在整个过程中DRL算法的运作机理,最终发现在基于AC框架的DRL算法中，critic net可以比较准确的刻画碰撞概率，而实验结果差异主要在于actor net未能近似出最高’状态-动作值’对应的动作。

Main contributions

(1) We present a comprehensive comparison between a deep RL algorithm and a formal method (SR) for dynamic obstacle avoidance.
(2) We also identify the potential points of failure of RL policies that provides insights on where additional safety policies might be required.

Results

End-to-end deep RL obstacle avoidance policies have up to 15% higher success than a state of the art multi-obstacle collision avoidance method, APF-SR.
We observe evolving changes in behavior of RL policies during training. This was consistent across environments with deterministic and stochastic obstacle motions. (在RL策略的训练过程中,确定环境和随机环境变化一致.)
The state value function stored in the critic net approximates the optimal collision probability reasonably well. This explains why RL policies perform well empirically compared to the traditional methods (critic的评估比较理想).
However, the RL policy stored in the actor net deviates from the optimal policy significantly and thus negatively impacts the true policy collision probability (actor的策略没有完全执行).
Lastly, strong evidence suggests that the deviation from optimal policy is caused by the actor net failing to approximate the action corresponding to the highest state-action value (为了验证4,做了对比补充实验).

Preliminaries

Robot and obstacle dynamics
SR analysis
基于传统建立数学模型方法的碰撞概率计算和控制器设计.
Deep RL

给出深度强化学习的基本定义，最终目标为寻找一个最优策略 ${\pi^{*}}$ 来将观测映射到动作,使得期望折扣累计奖赏最大.目前DRL主流框架A3C使用了actor和critic两个神经网络.
actor网络通过策略梯度(policy gradient)学习了策略,即根据critic网络的值函数(最大化折扣累计期望)计算出来的方向更新actor网络参数.同时,critic网络根据Bellman方程来更新参数,类似于Q-learning.相比于AC,A3C通过异步收集经验来加快学习速度.

Evaluation

Policy selection and evaluation

虚线为APF-SR算法的baseline，实线为DRL结果。在同种障碍物运动方式下，DRL方法均优于APF-SR方法。
critic comparison

图a-d表示了DRL算法学习中critic躲避一个从左向右移动障碍物的过程变化，在图a时算法初始化，此时 ${V_{RL}}$ 是随机的.在图b中,机器人在学习靠近障碍物,因此有一个高碰撞概率和均方误差.图b为接下来学习躲避障碍物(图c)提供帮助,但是它没有考虑障碍物的运动,最后在收敛状态下,图d中机器人学习去考虑障碍物的运动,得到了较小的均方误差.
Actor comparison
在对比critic网络和SR算法给出的碰撞概率后,我们对比了DRL策略(actor net)和SR给出的最优策略.

图中可以看出在DRL的策略中,有些动作建议穿过障碍物,还有一些迎向障碍物运动的动作.对于两者动作选择的偏差,我们回答了一下两个问题:(1) 这些偏差如何影响障碍规避性能;(2) 什么导致了差异.
Collision probability comparison
Causes for sub-optimality
首先作者给出了一个假设:the actor net failed to approximate the action corresponding to the highest state-action value inferred from the critic net.
为了验证这个假设,作者绕过了actor net而直接使用了一个从critic net中得到的新RL策略.

第三张的结果为critic policy给出的结果,和第一张SR算法给出的结果更相似.出现上述结果的原因可以归结为:The actor-critic algorithms, e.g., A3C use the critic’s value function instead of empirical returns to estimate the cumulative rewards which helps to lower the variance of the gradient estimations but at the cost of introducing bias.

（a，c）和（b，d）分别为actor net和critic net给出的RL actions对比，可见后者对于障碍物的避障处理更优。同时验证了之前提出的猜想“actor net没有完全执行critic net给出的能最大化state-value的动作，也就是actor net的策略参数 $\pi$ 训练的不是特别理想”。