Abstract: Deep Reinforcement Learning (DRL) policies have been shown to be vulnerable to small adversarial noise in observations. Such adversarial noise can have disastrous consequences in safety-critical environments. For instance, a self-driving car receiving adversarially perturbed sensory observations about nearby signs (e.g., a stop sign physically altered to be perceived as a speed limit sign) or objects (e.g., cars altered to be recognized as trees) can be fatal. Existing approaches for making RL algorithms robust to an observation-perturbing adversary have focused on reactive approaches that iteratively improve against adversarial examples generated at each iteration. While such approaches have been shown to provide improvements over regular RL methods, they are reactive and can fare significantly worse if certain categories of adversarial examples are not generated during training. To that end, we pursue a more proactive approach that relies on directly optimizing a well-studied robustness measure, regret instead of expected value. We provide a principled approach that minimizes maximum regret over a "neighborhood" of observations to the received "observation". Our regret criterion can be used to modify existing value- and policy-based Deep RL methods. We demonstrate that our approaches provide a significant improvement in performance across a wide variety of benchmarks against leading approaches for robust Deep RL.

Finding Regret-Minimizing Policies for Adversarially Robust RL:

In our investigation, we present innovative methodologies for computing Reinforcement Learning (RL) policies designed to exhibit resilience in the face of perturbations to their behavioral dynamics. Diverging from traditional paradigms that prioritize the maximization of rewards, our approach centers on the minimization of regret within the enacted policies.
     Conventional state-of-the-art techniques typically gravitate towards either the maximization of a robust, guaranteed minimum reward or the fortification of a classifier/neural network to mitigate errors. However, both methodologies employ the optimization of reward signals. Within the framework of game theory, our proposed methodology can be aptly classified as "mini-max regret," diverging fundamentally from the prevalent "maxi-min reward" strategies employed by existing approaches.
     Policies grounded in regret-based optimization yield outcomes characterized by enhanced security and consistency, a phenomenon demonstrated in a visual representation provided below.

Vanilla PPO Trial Description RAD-PPO
Mujoco: HalfCheetah
Unperturbed Test: The tangible difference between the conventional (vanilla) and robust policies is readily apparent in this task. The vanilla policy sustains higher velocity through a distinctive locomotion pattern, leveraging its rear leg to "scoot" forwards while elevating the front leg. Conversely, the robust policy adopts a more measured and stable gallop-like strategy with both legs contacting the floor. In this trial, the vanilla policy achieves a 30% further distance than the robust agent.
Perturbed Test: Here, we see the consequences of each strategy. The vanilla policy is unstable and repeatedly "faceplants", even flipping over in the third episode. The robust policy manages to maintain its stable gait, with only slight stuttering. The robust policy scores double that of the vanilla policy on average, even when excluding immediate failures by the vanilla policy.
Mujoco: Walker2D
Unperturbed Test: Again, we can observe a distinctly different strategy between the two approaches. The value-optimizing agent uses both feet to contact the ground and leap forwards, while the regret-minimizing agent uses one leg as a counter balance. The robust strategy has a slower start, but appears to eventually match the top speed, scoring around the same on average.
Perturbed Test: Once again we see that the instability of the vanilla agent's strategy leads to over-correction when perturbations occur. While difficult to notice at a glance, the vanilla policy "kicks" more widely to recover as it becomes more unstable.
Mujoco: Hopper
Unperturbed Test: Differences between strategies in this task are the hardest to distinguish visually, though when comparing the forward tilt of each agent's body one will notice the vanilla agent leaning more to generate more horizontal momentum. Overall, the vanilla policy scores slightly higher.
Perturbed Test: Even though the two strategies are largely similar, adversarial perturbations exacerbate the slight difference in stability: the difference in forward tilt is even more noticeable, culminating in an early failure for the nonrobust agent.