Sam Bowman — LessWrong

Thanks! Report coauthor here with a few quick reactions.

On the general vibe that this isn't a safety case and that we could reasonably do something much more rigorous here, I agree. (And we do describe it as an incomplete pilot in the report, though didn't spell out much about our plans.) This started because we were more and more worried about us and others missing easy to catch warning signs, and wanted to at least avoid that.

Some additional reactions:

(I realize that many of these won't be reassuring to a skeptical reader without a lot more detail on the exact evals than I can pull together for this comment, but hopefully they at least clarify how we understood what we were and weren't accomplishing. More on this meta-point below.)

"okay, but if the model were deceptive and misaligned, wouldn't you have observed the same things you're reporting here?"

We also worry about this, and I don't think we're fully robust to it, but I think we got a pretty substantial positive from the evaluation, in part because we evaluated similar past models and early snapshots of these models: If Opus were deceptive in this way, it seems very likely (though not certain) that we'd see more slip ups at _some_ point in its developmental trajectory. I'm picturing observations like the model defecting on only the most realistic test cases.

And, although it's not explicitly stated, there seems to be an implicit stopping rule like "we'll keep on doing this until enough of our eval suite passes, and then we'll release the resulting checkpoint."

We are at least trying to track sandbagging and situational awareness, and the actual algorithm is closer to 'keep doing this until we're reasonably confident that there are no remaining issues severe enough to be worth delaying the launch over.' This means that our decision includes considerations like 'we used a lot of information from eval X in training, so we should focus on eval Y.' We didn't have a completely strict train-test separation, in that there were no evals that we intentionally ignored until the end, but we did enough along these lines that I'm not concerned.

their assessment was limited to passively counting how many times the model spontaneously mentioned its belief that the scenario was not real.

We did some ad-hoc experiments beyond this, though these were both (i) not rigorous enough to be worth sharing as substantial evidence of safety but (ii) encouraging/unconcerning in their results.

That said, more explicitly selecting test cases for realism is one of our top priorities, and realism is one of my biggest concerns about most of the existing public evals.

say virtually nothing about whether Claude 4 might have "faked alignment"

The 'systematic deception' section pretty clearly addresses this, IMO: Alignment faking entails this kind of deception and goal-hiding. In particular, a decent fraction of the manual and automated red-teaming involved putting models in situations that we thought would lead them to reveal hidden goals or to hide goals in a detectable way. All of this is modulo the genuine concerns about whether moderate-scale behavioral assessments would catch this kind of thing if it were present.

Claude 4 Opus weirdly performs worse than generally-weaker models at subtly hiding its bad behavior

I agree we could have drawn this connection more clearly. We have some moderately confident non-concerning hypotheses about why it's worse, but no strong assurances.

prompts for "human manipulation" basically straight-up tell the model that it's in an eval!

I think that would be >95%th-percentile obvious relative to the evals we report on here. I think we could have done better on the older eval you mention, though. We've been focusing on many-turn interactions, without this kind of tell, that map onto real or plausible use cases, and I expect we'll keep moving further in that direction.

overall preponderance of evidence obtained during the whole experience, much of which didn't fit in the system card

Yeah, I think this'll be an ongoing challenge. We've found it really informative to spend a lot of time on ad-hoc prompts/honeypots/red-teaming prompts/mini-evals than on a handful of big standardized evals, in part to help address concerns about realism and sandbagging. But big standardized evals are much easier to document. I think a non-pilot version of something like this that relied on much less 'trust us' would need to be a few times longer, and this was already a pretty huge effort to prepare.

It seems plausible (not a plan—just an off the cuff thought here) that an alignment safety case ultimately includes something like 'we let some neutral third-party experts sit down and skim through the tens of thousands of pages of examples and mini-evals that we dumped into our internal docs and slack threads, and they endorse our qualitative conclusions, which fit into our overall argument in XYZ ways.'

Once you have an LLM capable of "role-playing as" Misaligned Superhuman World-Ending Claude – and having all the capabilities that Misaligned Superhuman World-Ending Claude would have in the nightmare scenario where such a thing really exists – then you're done, you've already created Misaligned Superhuman World-Ending Claude.

I very much want to avoid this, but FWIW, I think it's still far from than the worst-case scenario, assuming that this role play needs to be actively evoked in some way, like the catgirl etc. personas do. In these cases, you don't generally get malign reasoning when you're doing things like running evals or having the model help with monitoring or having the model do safety R&D for you. This leaves you a lot of affordances.

If you're doing this stuff right, it should feel more like writing fiction, or planning a LARP, or setting up a military simulation exercise. You should be creating a whole Potemkin-village staging-environment version of the real world, or at least of your company. [...]

Strongly agree. (If anyone reading this thinks they're exceptionally good at this kind of very careful long-form prompting work, and has at least _a bit_ of experience that looks like industry RE/SWE work, I'd be interested to talk about evals jobs!)

That's part of Step 6!

Or, we are repeatedly failing in consistent ways, change plans and try to articulate as best we can why alignment doesn’t seem tractable.

I think we probably do have different priors here on how much we'd be able to trust a pretty broad suite of measures, but I agree with the high-level take. Also relevant:

However, we expect it to also be valuable, to a lesser extent, in many plausible harder worlds where this work could provide the evidence we need about the dangers that lie ahead.

Is there anything you'd be especially excited to use them for? This should be possible, but cumbersome enough that we'd default to waiting until this grows into a full paper (date TBD). My NYU group's recent paper on a similar debate setup includes a data release, FWIW.

Possible confound: Is it plausible that the sycophancy vector is actually just adjusting how much the model conditions its responses on earlier parts of the conversation, beyond the final 10–20 tokens? IIUC, the question is always at the end, and ignoring the earlier context about the person who's nominally asking the question should generally get you a better answer.

That makes sense, though what's at stake with that question? In almost every safety-relevant context I can think of, 'scale' is just used as a proxy for 'the best loss I can realistically achieve in a training run', rather than as something we care about directly.

Yep, that sounds right! The measure we're using gets noisier with better performance, so even faithfulness-vs-performance breaks down at some point. I think this is mostly an argument to use different metrics and/or tasks if you're focused on scaling trends.

Concretely, the scaling experiments in the first paper here show that, as models get larger, truncating or deleting the CoT string makes less and less difference to the model's final output on any given task.

So, stories about CoT faithfulness that depend on the CoT string being load-bearing are no longer very compelling at large scales, and the strings are pretty clearly post hoc in at least some sense.

This doesn't provide evidence, though, that the string is misleading about the reasoning process that the model is doing, e.g., in the sense that the string implies false counterfactuals about the model's reasoning. Larger models are also just better at this kind of task, and the tasks all have only one correct answer, so any metric that requires the model to make mistakes in order to demonstrate faithfulness is going to struggle. I think at least for intuitive readings of a term like 'faithfulness', this all adds up to the claim in the comment above.

Counterfactual-based metrics, like the ones in the Turpin paper, are less vulnerable to this, and that's probably where I'd focus if I wanted to push much further on measurement given what we know now. Though we already know from that paper that standard CoT in near-frontier models isn't reliably faithful by that measure.

We may be able to follow up with a few more results to clarify the takeaways about scaling, and in particular, I think just running a scaling sweep for the perturbed reasoning adding-mistakes metric from the Lanham paper here would clarify things a bit. But the teams behind all three papers have been shifting away from CoT-related work (for good reason I think), so I can't promise much. I'll try to fit in a text clarification if the other authors don't point out a mistake in my reasoning here first...

I agree, though I'll also add:

- I don't think our results clearly show that faithfulness goes down with model size, just that there's less affirmative evidence for faithfulness at larger model sizes, at least in part for predictable reasons related to the metric design. There's probably more lowish-hanging fruit involving additional experiments focused on scaling. (I realize this disagrees with a point in the post!)

- Between the good-but-not-perfect results here and the alarming results in the Turpin 'Say What They Think' paper, I think this paints a pretty discouraging picture of standard CoT as a mechanism for oversight. This isn't shocking! If we wanted to pursue an approach that relied on something like CoT, and we want to get around this potentially extremely cumbersome sweet-spot issue around scale, I think the next step would be to look for alternate training methods that give you something like CoT/FD/etc. but have better guarantees of faithfulness.

I’d like to avoid that document being crawled by a web scraper which adds it to a language model’s training corpus.

This may be too late, but it's probably also helpful to put the BIG-Bench "canary string" in the doc as well.

Assuming we're working with near-frontier models (s.t., the cost of training them once is near the limit of what any institution can afford), we presumably can't actually retrain a model without the data. Are there ways to approximate this technique that preserve its appeal?

(Just to check my understanding, this would be a component of a sufficient-but-not-necessary solution, right?)

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments