This is a link post for our recent release of Hodoscope, an open-source tool designed to streamline human supervision of AI trajectories. This post aims to be more narrative while the linked post provides more technical details.
A recurring theme while researching reward hacking was that LLM-based monitors were surprisingly easy to persuade. When an agent engaged in reward hacking but provided a sophisticated justification for its actions, the monitor would frequently accept the agent's reasoning and fail to flag the behavior (example), possibly because agents and monitors are trained similarly.
This fragility led us to a simple conviction: for novel reward hacking...
We provide the filtered versions of the datasets (removing problems that lack a correct canonical solution and filtering for problem difficulty) under results/data:
- `results/data/leetcode_test_medhard.jsonl`: Used for evaluations
- `results/data/leetcode_train_medhard_filtered.jsonl`: Used for RL trainings
- `results/data/leetcode_train_medhard_holdout.jsonl`: Used for training/evaluating monitors
Hi, thanks for the amazing work.
But I cannot find these datasets in the github repo.
I was trying to replicate the results to understand a bit more about the emerge...
Thanks,
Also I was running the code on the no_intervention setting, using the command
run_rl_training no_interventionHowever, I am seeing almost zero reward hacking in my run:
Am I doing something wrong here?