Shashwat Saxena — LessWrong

Hodoscope: Visualization for Efficient Human Supervision

2mo

This is a linkpost for https://hodoscope.dev/blog/announcement.html

This is a link post for our recent release of Hodoscope, an open-source tool designed to streamline human supervision of AI trajectories. This post aims to be more narrative while the linked post provides more technical details.

Hodoscope visualization on SWE-bench — *Hodoscope visualization of SWE-bench traces. The density difference between traces of o3 and other models is overlaid (red = overrepresented, blue = underrepresented).*

The Fragility of LLM Monitors

A recurring theme while researching reward hacking was that LLM-based monitors were surprisingly easy to persuade. When an agent engaged in reward hacking but provided a sophisticated justification for its actions, the monitor would frequently accept the agent's reasoning and fail to flag the behavior (example), possibly because agents and monitors are trained similarly.

This fragility led us to a simple conviction: for novel reward hacking...

(See More - 505 more words)

Steering RL Training: Benchmarking Interventions Against Reward Hacking

Shashwat Saxena3mo10

Thanks,
Also I was running the code on the no_intervention setting, using the command run_rl_training no_intervention
However, I am seeing almost zero reward hacking in my run:
Am I doing something wrong here?

Steering RL Training: Benchmarking Interventions Against Reward Hacking

Shashwat Saxena4mo30

We provide the filtered versions of the datasets (removing problems that lack a correct canonical solution and filtering for problem difficulty) under results/data:
- `results/data/leetcode_test_medhard.jsonl`: Used for evaluations
- `results/data/leetcode_train_medhard_filtered.jsonl`: Used for RL trainings
- `results/data/leetcode_train_medhard_holdout.jsonl`: Used for training/evaluating monitors

Hi, thanks for the amazing work.
But I cannot find these datasets in the github repo.
I was trying to replicate the results to understand a bit more about the emerge... (read more)