The primary idea behind regret-based robustness is to minimize the variance in outcomes for an agent while retaining good nominal performance. Our solution, the family of Robust Adversarial Defense (RAD) algorithms, optimizes a novel notion of regret that is scalable and principally-motivated.
Github LinkedIn G.ScholarComputing the expected regret over entire trajectories is an intractable problem, so we make two key improvements in our method. First, we restrict the set of states covered in the regret expectation to an \(\epsilon\)-bounded neighborhood, which is a common concession in adversarial attacks and robustness (the rationale being, a sufficiently large perturbation would be detectable and thus ignored). Secondly, we provide scalability through myopic expectations of regret, i.e. at each state we compute the regret over potential mistakes only at this time step, and accumulate them in a Q-learning sum. Mathematically, this retains the markov assumption and reduces the computational complexity from combinatorial O(n!) to a constant scalar O(kn). Empirically, this "smooths" out agent behavior as the model seeks trajectories with a margin for error.
Formally, we optimize the difference in rewards for the nominal (unperturbed) observation \(s\) and the expected adversarial neighborhood \(N(s)\).
\[\delta(s,a)=R(s,a)-\sum_{\hat{s} \in N(s)}R(\hat{s},a) + \gamma \mathbb{E}_{s^{'}}\delta(s^{'},\pi(s^{'}))\]If you want to collaborate, have a question, or just want to say hi, you can contact me through the links below.