Research
- Regret-Based Defense in Adversarial Reinforcement Learning (project page) AAMAS 2024 (Arxiv) By optimizing a novel form of regret, we train RL agents that are more robust than previous robustly trained value-optimizing agents. Our regret notion, CCER, provides a scalable, transferrable way to compute adversarial cumulative regret for actions across time steps.
- On Minimizing Adversarial Counterfactual Error in Robust Reinforcement Learning (project page)ICLR 2025 (Arxiv) We progress the formulation of observation-adversarial RL by recognizing its true structure—a POMDP. Leveraging this fact, our proposed methods achieve SOTA performance across all adversarial RL benchmarks.
- Hierarchical Red-Teaming for Large Language Models Preprint We train LLMs to autonomously discover toxicity vulnerabilities in target LLMs through natural dialogue. We employ a hierarchical setup: one policy suggests a strategy and a second policy generates adversarial text according to the strategy. We achieve SOTA attacks on standard datasets, in addition to providing the first principled RL framework in this domain.







