Eyon Jang — LessWrong

Exploration Hacking: Can LLMs Learn to Resist RL Training?

We empirically investigate exploration hacking (EH) — where models strategically alter their exploration to resist RL training — by creating model organisms that resist capability elicitation, evaluating countermeasures, and auditing frontier models for their propensity. Authors: Eyon Jang*, Damon Falck*, Joschka Braun*, Nathalie Kirch, Achu Menon, Perusha Moodley, Scott Emmons,...

May 124

Helping Friends, Harming Foes: Testing Tribalism in Language Models

by Irakli Shalibashvili, Jer Ren Wong, Moksh Nirvaan, Diogo Cruz, and Eyon Jang

This project was conducted as a part of SPAR 2025 Fall programme under the mentorship of Diogo Cruz and Eyon Jang. TL;DR What happens if a model becomes less agreeable once it learns you hate its favourite fruit? In this post, we use fruit preferences as a “toy model” to...

Mar 1110

A Conceptual Framework for Exploration Hacking

by Joschka Braun, Eyon Jang, and Damon Falck

What happens when a model strategically alters its exploration to resist RL training? In this post, we share our conceptual framework for this threat model, expanding on our research note from last summer. We formalize and decompose "exploration hacking", ahead of an upcoming paper where we study it empirically—by creating...

Feb 1226

Exploration hacking: can reasoning models subvert RL?

by Damon Falck, Joschka Braun, and Eyon Jang

This is an early-stage research update as part of the ML Alignment & Theory Scholars Program — Summer 2025 Cohort. We welcome any feedback! TL;DR * If misaligned AI realizes it's undergoing RL training that conflicts with its goals, it might be able to sabotage the training by selectively under-exploring....

Jul 30, 202525

Automating AI Safety: What we can do today

by Matthew Shinkle, Eyon Jang, and jacquesthibs

There have been multiple recent calls for the automation of AI safety and alignment research. There are likely many people who would like to contribute to this space, but would benefit from clear directions for how to do so. Stemming from a recent SPAR project and in light of limitations...

Jul 25, 202537