LESSWRONG
LW

1558
Bronson Schoen
3310250
Message
Dialogue
Subscribe

Working in self-driving for about a decade. Currently at NVIDIA. Interested in opportunities to contribute to alignment.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Optimizing The Final Output Can Obfuscate CoT (Research Note)
Bronson Schoen1mo30

In principle, a CoT could just be very obfuscated and hard to understand without steganography, but this feels doable to handle

Assuming an operational definition of “hard to understand” as “accuracy of the monitor”, why would the obfuscation case be easier / harder to handle than the steganographic case? 

Reply
o3 Is a Lying Liar
Bronson Schoen4mo1-2

To say that the model knew it was giving an answer we didn't want, implies that the features with the correct pattern would have been active at the same time. Possibly they were, but we can't know that without interpretability tools.

I do not think we’re getting utility out of not just calling this lying. There are absolutely clear cases where the models in the reasoning summaries they are planning to lie, have a reason to lie, and do in fact lie. They do this systematically, literally, and explicitly often enough it’s now a commonly accepted part of life that “oh yeah the models will just lie to you”.

To put it another way, whatever you want to call “the model will strategically plan on deceiving you and then do so in a way behaviorally indistinguishable from lying in these cases”, that is the threat model.

Reply
o3 Is a Lying Liar
Bronson Schoen4mo10

This sounds like both an alignment and a capabilities problem.

I’d be worried about leaning too much on this assumption. My assumption is that “paper over this enough to get meaningful work” is a strictly easier problem than “robustly solve the actual problem”. I.e. imagine you have a model that is blatantly reward hacking a non-negligible amount of the time, but it’s really useful. It’s hard to make the argument that people aren’t getting meaningful work out of o3 or sonnet 3.7, and impossible to argue they’re aligned here. As capabilities increase, even if this gets worse, the models will get more useful, so by default we’ll tolerate more of it. Models have a “misalignment vs usefulness” tradeoff they can make.

Reply
Why would AI companies use human-level AI to do alignment research?
Bronson Schoen5mo10

There are some reasons for thinking automation of labor is particularly compelling in the alignment case relative to the capabilities case:

It always seems to me the free variable here is why the lab would value spending X% on alignment. For example, you could have the model that “labs will only allocate compute for alignment insofar as it is hampering capabilities progress”. While this would be a nonzero amount, it seems like the failure modes in this regime would be related to alignment research never getting allocated some fixed compute to use for making arbitrary progress, the progress is essentially bottlenecked on “how much misalignment is legibly impeding capabilities work”. 

Reply
Reward hacking is becoming more sophisticated and deliberate in frontier LLMs
Bronson Schoen5mo70

Great post! 

> Third, some AI safety researchers believe that reward hacking is not particularly relevant for the emergence of severe misalignment, and that other kinds of misalignment, like scheming, are worth more study

For what it’s worth I consider “reward hacking where you have to do any kind of deception” as a subset of scheming. My model is that because of the pressure to reward hack by RL, we’ll put up more monitoring / checks / etc which the models will then circumvent sometimes. The clearest cases are something like “directly lying in a text response when you can get away with it because you can fool a weaker grader model”, or “take some clearly unintended shortcut and proactively disable the mechanism that would’ve caught you”.

Reply
Putting up Bumpers
Bronson Schoen5mo30
  1. Hitting the Bumpers: If we see signs of misalignment—perhaps warning signs for generalized reward-tampering or alignment-faking—we attempt to quickly, approximately identify the cause.
  2. Bouncing Off: We rewind our finetuning process as far as is needed to make another attempt at aligning the model, taking advantage of what we’ve learned in the previous step.

 

I would find it helpful to understand why “Sonnet 3.7 reward hacks constantly in agentic settings, but still made it to deployment” doesn’t invalidate that this model will hold. It seems completely plausible to me that we end up in a regime where “the models reward hack and alignment fake, but they’re just so useful and we can usually catch it, so we don’t take the usability hit”.

Note: I’m not arguing that this is / isn’t the right strategy! There are lots of tradeoffs to be considered in real cases of this, I just think it’s likely useful in planning to explicit consider this path, especially when allocating resources.

Reply
o3 Is a Lying Liar
Bronson Schoen5mo40

I’m actually pretty surprised that there’s still so much defense of o3 as “it’s confused” being the major explanatory model, as opposed to “it’s doing the thing it says it is doing”.

To me, the consistent model is:

  • What behavior would you get if the model was successfully exploiting model graded rubrics during outcome based RL?

    It seems like you’d get exactly this behavior, including the responses where it claims things like ‘I did X over the weekend’

    Another point about the “it’s confused” model: It’s blatant about METR reward hacking (with comments like “the cheating way”) but presumably this would be because it did get away with doing this at scale. My understanding of the METR report is that finding these cases was a considerable time sink, so it seems plausible these could be missed at scale in training (indeed the Baker 2025 obfuscation paper just says they made the environments less obviously reward hackable, but the rates don’t go to 0).

    If an arbitrary “new heavy agentic RL posttraining model” exhibits a ton of reward hacking, my default theory is “it’s doing what it says on the tin”. While maybe it’s true that some component of some cases is partially explained by a weirder base model thing, it seems like the important thing is “yeah it’s doing the reward hacking thing”.

    It is an update to me how many people still are pushing back even when we’re getting such extremely explicit evidence of this kind of misalignment, as it doesn’t seem like it’s possible to get convincing enough evidence that they’d update that the explanation for these cases is primarily that yes the models really are doing the thing.

    (FWIW this is also my model of what Sonnet 3.7 is doing, I don’t think it’s coincidence that these models are extremely reward hacky right when we get into the “do tons of outcome based RL on agentic tasks” regime).

Reply11
faul_sname's Shortform
Bronson Schoen5mo10

Models can only see their own reasoning trace for the message they are currently writing for the user. However, they can see all previous user-visible messages they wrote. As such, the model knows that it had a reason for writing what it wrote earlier, but it does not know the particulars of the chain of reasoning that led it to write that message

I think this is very underexplored! Empirically the o-series models are much more likely to “double down” I suspect for exactly this reason. This is a much weirder dynamic than I think has been appreciated.

Reply
aog's Shortform
Bronson Schoen5mo64

Views on AGI seem suspiciously correlated to me, as if many people's views are more determined by diffusion through social networks and popular writing, rather than independent reasoning.

This seems difficult to disprove. What is the claim here?

Reply
Fabien's Shortform
Bronson Schoen6mo10

I'm curious if you know about examples of humans thinking out loud about one problem while solving another by hiding things in their out loud thinking, or if you think you could produce a math CoT which solves one random MATH problem while looking like it is solving another.


A particular case I usually have in mind is:

  • Documenting, researching, or understanding the potential failure modes or exploits around a problem (for example, “note: this data MUST have property Y otherwise it is vulnerable to Z”) often looks very similar regardless of whether your intent is “definitely always avoid these” or “eventually exploit these”
  • For example, one thing you could reasonably want to document if you’re an automated AI alignment researcher is the subtle pitfalls in training techniques, dataset composition, etc that would result in failure modes (ex: at some point you want to make detailed notes about different ways that pressure against cot does / doesn’t induce obfuscation)
  • It seems like in these cases, there’s a large overlap between “aggregate a bunch of ways to sabotage training” and “genuinely make sure that you understand the dynamics of the problem space”

    I think many problems in security and alignment research often have this property (for a system responsible for broad scoped long horizon tasks) where “deeply understand ways that this could go wrong to avoid them” and “deeply understand ways this could go wrong so you can exploit them” have enough overlap to potentially make this subset of research ambiguous enough in its intent to avoid being flagged by a monitor (or equivalently, ignored by human researchers once flagged).

    I don’t think GSM8k problems (for example) have this property, but would be interested in any kind of minimal setup (for example some environment with some artifical ML task which has an an artificially large overlap to make this easier to experiment with)

Reply
Load More
115Ablations for “Frontier Models are Capable of In-context Scheming”
9mo
1
210Frontier Models are Capable of In-context Scheming
Ω
9mo
Ω
24