Shashwat Saxena — LessWrong

LESSWRONG
LW

Hodoscope: Visualization for Efficient Human Supervision

This is a link post for our recent release of Hodoscope, an open-source tool designed to streamline human supervision of AI trajectories. This post aims to be more narrative while the linked post provides more technical details.

Hodoscope visualization on SWE-bench — *Hodoscope visualization of SWE-bench traces. The density difference between traces of o3 and other models is overlaid (red = overrepresented, blue = underrepresented).*

The Fragility of LLM Monitors

A recurring theme while researching reward hacking was that LLM-based monitors were surprisingly easy to persuade. When an agent engaged in reward hacking but provided a sophisticated justification for its actions, the monitor would frequently accept the agent's reasoning and fail to flag the behavior (example), possibly because agents and... (read 525 more words →)

Replying toSteering RL Training: Benchmarking Interventions Against Reward Hacking

Shashwat Saxena2mo

Steering RL Training: Benchmarking Interventions Against Reward Hacking

Thanks,
Also I was running the code on the no_intervention setting, using the command run_rl_training no_intervention
However, I am seeing almost zero reward hacking in my run:
Am I doing something wrong here?

Replying toSteering RL Training: Benchmarking Interventions Against Reward Hacking

Shashwat Saxena2mo

Steering RL Training: Benchmarking Interventions Against Reward Hacking

We provide the filtered versions of the datasets (removing problems that lack a correct canonical solution and filtering for problem difficulty) under results/data:
- `results/data/leetcode_test_medhard.jsonl`: Used for evaluations
- `results/data/leetcode_train_medhard_filtered.jsonl`: Used for RL trainings
- `results/data/leetcode_train_medhard_holdout.jsonl`: Used for training/evaluating monitors

Hi, thanks for the amazing work.
But I cannot find these datasets in the github repo.
I was trying to replicate the results to understand a bit more about the emergence of reward hacking in LLM agents and would highly appreciate if you can help me with where I can find this data.

Thanks!