This is a link post for our recent release of Hodoscope, an open-source tool designed to streamline human supervision of AI trajectories. This post aims to be more narrative while the linked post provides more technical details.
Hodoscope visualization of SWE-bench traces. The density difference between traces of o3 and other models is overlaid (red = overrepresented, blue = underrepresented).
The Fragility of LLM Monitors
A recurring theme while researching reward hacking was that LLM-based monitors were surprisingly easy to persuade. When an agent engaged in reward hacking but provided a sophisticated justification for its actions, the monitor would frequently accept the agent's reasoning and fail to flag the behavior (example), possibly because agents and... (read 525 more words →)
Thanks, Also I was running the code on the no_intervention setting, using the command run_rl_training no_intervention However, I am seeing almost zero reward hacking in my run: Am I doing something wrong here?
We provide the filtered versions of the datasets (removing problems that lack a correct canonical solution and filtering for problem difficulty) under results/data:
- `results/data/leetcode_test_medhard.jsonl`: Used for evaluations
- `results/data/leetcode_train_medhard_filtered.jsonl`: Used for RL trainings
- `results/data/leetcode_train_medhard_holdout.jsonl`: Used for training/evaluating monitors
Hi, thanks for the amazing work. But I cannot find these datasets in the github repo. I was trying to replicate the results to understand a bit more about the emergence of reward hacking in LLM agents and would highly appreciate if you can help me with where I can find this data.