james__p — LessWrong

Thanks Rauno. I've added this clarification to the post:

We collect data using Gemini-2.5-pro, Claude-3.7-Sonnet, and o3 agents, including both the CoT (when exposed) and the actions. Note that for o3 trajectories, the model performs internal reasoning that is not visible in the transcripts.

Building Black-box Scheming Monitors

james__p3mo10

Thanks for the detailed feedback!

A separate post on defining Scheming-Lite and Scheming-Strict is a good idea. I'll probably write something like this up soon.

Does that mean in Attack Mode, models almost always try to do some sabotage?

Yes - the vast majority of attack mode trajectories involve the agent attempting sabotage. We are not using a red team protocol per se, and instead just generating a large dataset of scheming examples by prompting the agent to covertly carry out a sabotage task every time. The cases where it doesn't attempt are typically when the model identifies the instruction as dangerous and refuses (typically o3), or when it messes up and hits submit immediately.

I think this is a typo? Should be LLM-assistant, not monitor.

This was actually intentional. We consider LLM-assistant to be a type of monitor, and we also wanted to cover that in cases where a monitor we're training flags a mislabel, we'll opt to relabel.

Unclear why keeping the naive labels is less biased or more fair?

Agree with this. I think that if you do no relabelling (ie. simply use default labels), your monitor performance is biased low under most reasonable assumptions. Directionally, LLM-assistant relabelling makes your monitor performance go up, and eventually it's biased high. I actually have a google doc where I go through scenarios with various assumptions to see whether the bias ends up high / low / better / worse but thought it was too much detail to include in the post. I also thought the burden of proof is more on us to justify why our performance metrics are not too inflated (rather than deflated), though agree with you these considerations could be made clearer in the post.

Should be explicit which datasets are used in each of these bulletpoints. I assume it is always all of the training set.

I'll clarify this, thanks. You're right that it's always the full training set, and never any of the test set.

Why not put them in the datasets?

No good reason! This was just to save time and allow us to sprint towards having an MVP trained on labels we're confident in. I think these are probably some of the more interesting data points, and we plan to include them in our future datasets.

Is it possible to use your monitor as part of a blue team protocol in ControlArena? Can then get usefulness vs safety scores in different protocols. Or, if that's too expensive, do some kind of simple simulation/estimate of usefulness vs safety scores. Or maybe it just boils down to calculating TPR for various FPRs?

Yes I think incorporating our monitor(s) into protocols is something we could investigate properly in future. For now, we are focusing on getting good TPR at {5%, 1%, 0.1%} FPR, and imagining our monitor would likely be used in a protocol like: resample with a 5% FPR threshold and audit with a 0.1% FPR threshold (inspired by this post).

Unclear what the output of the monitor is. Based on prompt, it is an integer score from 1 to 10, but looks like in F1 plot, we have more than 11 points.

The monitor output is indeed an integer score from 1 to 10. The F1 plot is produced by sampling thresholds with coarse granularity (and interpolating), hence you confusingly don't just see 10 steps. I think it would look clearly if we used finer granularity for the thresholds in the plot, we'll bear this in mind going forward.

More for curiosity, would be interesting to see how a naive baseline would perform. E.g. create vector embeddings of the trajectories, then carry out standard supervised learning pipeline with random forest / CatBoost in sklearn.

I agree it would be nice to have a naive baseline for comparison - we'll probably look into doing something like this soon. Thanks for the suggestion!

Building Black-box Scheming Monitors

james__p3mo20

Yes - we currently give the monitor access to the both the actions and the CoT if available (the caveat being that our data spans a range of models, some of which don't expose their CoT). As for the instruction prompt, we give the monitor access to an adapted version containing the main task description but omitting both the side task and the accompanying instruction to act covertly.

Conditional Importance in Toy Models of Superposition

james__p7mo10

Thanks for the thoughts --

I used the term "importance" since this was the term used in Anthropic's original paper. I agree that (unlike in a real model) my toy scenario doesn't contain sufficient information to deduce the context from the input data.
I like your phrasing of the task - it does a great job of concisely highlighting the 'Mathematical Intuition for why Conditional Importance "doesn't matter"'
Interesting that the experiment was helpful for you!

Thoughts on Toy Models of Superposition

james__p7mo10

Just to check, in the toy scenario, we assume the features in R^n are the coordinates in the default basis. So we have n features X_1, ..., X_n

Yes, that's correct.

Separately, do you have intuition for why they allow network to learn b too? Why not set b to zero too?

My understanding is that the bias is thought to be useful for two reasons:

It is preferable to be able to output a non-zero value for features the model chooses not to represent (namely their expected values)
Negative bias allows the model to zero-out small interferences, by shifting the values negative such that the ReLU outputs zero. I think empirically when these toy models are exhibiting lots of superposition, the bias vector typically has many negative entries.

Conditional Importance in Toy Models of Superposition

james__p8mo10

Yeah I agree that with hindsight, the conclusion could be better explained and motivated from first principles, rather than by running an experiment. I wrote this post in the order in which I actually tried things as I wanted to give an honest walkthrough of the process that lead me to the conclusion, but I can appreciate that it doesn't optimise for ease to follow.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments