In our investigation, we present innovative methodologies for computing Reinforcement Learning (RL) policies designed to exhibit resilience in the face of perturbations to their behavioral dynamics. Diverging from traditional paradigms that prioritize the maximization of rewards, our approach centers on the minimization of regret within the enacted policies.
     Conventional state-of-the-art techniques typically gravitate towards either the maximization of a robust, guaranteed minimum reward or the fortification of a classifier/neural network to mitigate errors. However, both methodologies employ the optimization of reward signals. Within the framework of game theory, our proposed methodology can be aptly classified as "mini-max regret," diverging fundamentally from the prevalent "maxi-min reward" strategies employed by existing approaches.
    
Policies grounded in regret-based optimization yield outcomes characterized by enhanced security and consistency, a phenomenon demonstrated in a visual representation provided below.
Vanilla PPO | Trial Description | RAD-PPO |
---|---|---|
Mujoco: HalfCheetah | ||
Unperturbed Test: The tangible difference between the conventional (vanilla) and robust policies is readily apparent in this task. The vanilla policy sustains higher velocity through a distinctive locomotion pattern, leveraging its rear leg to "scoot" forwards while elevating the front leg. Conversely, the robust policy adopts a more measured and stable gallop-like strategy with both legs contacting the floor. In this trial, the vanilla policy achieves a 30% further distance than the robust agent. | ||
Perturbed Test: Here, we see the consequences of each strategy. The vanilla policy is unstable and repeatedly "faceplants", even flipping over in the third episode. The robust policy manages to maintain its stable gait, with only slight stuttering. The robust policy scores double that of the vanilla policy on average, even when excluding immediate failures by the vanilla policy. | ||
Mujoco: Walker2D | ||
Unperturbed Test: Again, we can observe a distinctly different strategy between the two approaches. The value-optimizing agent uses both feet to contact the ground and leap forwards, while the regret-minimizing agent uses one leg as a counter balance. The robust strategy has a slower start, but appears to eventually match the top speed, scoring around the same on average. | ||
Perturbed Test: Once again we see that the instability of the vanilla agent's strategy leads to over-correction when perturbations occur. While difficult to notice at a glance, the vanilla policy "kicks" more widely to recover as it becomes more unstable. | ||
Mujoco: Hopper | ||
Unperturbed Test: Differences between strategies in this task are the hardest to distinguish visually, though when comparing the forward tilt of each agent's body one will notice the vanilla agent leaning more to generate more horizontal momentum. Overall, the vanilla policy scores slightly higher. | ||
Perturbed Test: Even though the two strategies are largely similar, adversarial perturbations exacerbate the slight difference in stability: the difference in forward tilt is even more noticeable, culminating in an early failure for the nonrobust agent. |