LESSWRONG
LW

Bronson Schoen — LessWrong

Just nothing that I’ve found this “entangled generalization” concept extremely useful

but i contend that the main resource here is serial time rather than raw resources.

I’m very skeptical that this is currently the bottleneck for much of the safety work at OpenAI. Are you saying that both safety / alignment / interp all have sufficient headcount and everything is bottlenecked by serial time?

Replying toThe ML ontology and the alignment ontology

Bronson Schoen3d*

The ML ontology and the alignment ontology

I worry this contributes to polarizing people along "group" boundaries.

While some of these concepts are getting picked up within EA safety (e.g. by Anthropic's interpretability team),

I strongly disagree with the idea that the idea that "personas are an important way of thinking about the model" is "just getting picked up" by Anthropic. I view the recent work as largely trying to make this legible to labs that are less convinced (ex: Leo's comment), and to potentially empirically understand the boundaries of this persona framing.

the ontology gap is still large enough to cause adversarial dynamics.

I think framing "EA Safety at Anthropic" as a monolith that has adversarial dynamics with Janus for example isn't... (read 355 more words →)

Replying toThe ML ontology and the alignment ontology

Bronson Schoen3d

The ML ontology and the alignment ontology

typical lab safety work

What counts as typical lab safety work in the OpenAI case?

Replying toMonitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning

Bronson Schoen4d

Monitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning

I would disagree that calling this a “dumb monitor” is a fair characterization. Picking up on the model smuggling a single bit of information about gender or colleges using word choice was at least hard for me to tell as a human.

Replying toMonitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning

Bronson Schoen5d

Monitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning

Skimming Baker, I didn't see jailbreaking, just the model doing reward-hacking things that were not (legibly) described in the CoT. So either they are in fact illegibly described, or it's just learnt to do certain bad things without thinking about them first — that paper didn't appear to distinguish which of those was occurring. As it says: "In practice, it may be hard to distinguish whether monitoring failures are due to poor legibility or faithfulness."

Yeah but that's my point. If your prior was "a model would have to learn jailbreaking because that's easier" you would've in fact made incorrect predictions here.

Is there a previous academic paper discussing that, or is this just oral tradition?

To be clear I didn't mean my personal red teaming, There's multiple red/blue and controls papers, I don't have a list handy unfortunately.

Replying toMonitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning

Bronson Schoen5d

Monitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning

IMO, the main way things change if we assume that automated jailbreaking/persuasion attacks is the most likely strategy, compared to just attempting to communicate illegibly via steganography is that as in this experiment, [...] It matters because if we don't get steganography, it's likely possible to fix monitors in ways that even superhuman AIs cannot avoid.

But the whole problem here is that this particular setup (and indeed I'm arguing many of the "failed" attempts at creating a "what a narrow slice of humans specifically mean by 'true' steganography") isn't subject to optimization pressure such that it had to produce steganography. Once you introduce concepts like "fixing monitors", you're in an adversarial pressure... (read more)

Bronson Schoen8d

Maybe the confusion here is that the “Claude” in Daniel’s story I assume has gotten capable at sufficiently long horizon tasks that you do in fact necessarily get a Claude which thinks about these things.

Should I try to do this really well, or leave some messy code for the next guy?

As a concrete example, I’d expect this in particular will quickly become necessary to think about in order for coding agents to perform well on larger tasks.

I don’t think current Claude thinks about this very much, because it’s never had to, but I’m not particularly worried about current Claude.

Replying toAligning to Virtues

Bronson Schoen12d

Aligning to Virtues

Situational awareness becomes a feature not a bug. Today we try to test AI whether AIs will obey instructions in often-implausible hypothetical scenarios. But as AIs get more intelligent, trying to hide their actual situation from them will become harder and harder. Yet this doesn’t have to just be a disadvantage, but rather also something we can benefit from. Virtues are inherently context-dependent and require judgment about how to apply them (unlike rules or obedience). There are also many things that we should want AIs to do only in situations where they are confident about which situation they’re in.

I’m not clear on how the solves the problem in the contexts it’s load bearing... (read more)

Replying toMonitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning

Bronson Schoen12d

Monitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning

multiple people have tried quite hard to demonstrate steganography

I guess my overall point is this reasoning seems pretty backwards to me. It tends to pick one specific solution humans have thought of ahead of time, i.e. a particular form of complex steganography (not just any steganography, specifically the kind that’s just hard enough that humans think it’s hard to spot in a very particular way) and then fail to reach that exact preconceived notion in various contrived setups because the model finds other ways to circumvent the monitor.

This matters because if your threat model is “will Opus learn to do steganographic reasoning specifically in the punctuation of AIME”, you will continually fail... (read more)

Stress Testing Deliberative Alignment for Anti-Scheming Training

Mikita Balesni

Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, AlexMeinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny, alex.lloyd

5mo

Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv

Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context, or for simple preferences they acquire from training — something we previously found in Frontier Models Are Capable of In-Context Scheming.

In a new research collaboration with OpenAI, we developed a larger suite of alignment evaluations for covert actions (26 evaluations) and studied a training method to reduce such covert behaviors. We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate... (read 279 more words →)

127

Ablations for “Frontier Models are Capable of In-context Scheming”

AlexMeinke

AlexMeinke, Bronson Schoen, Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer, rusheb

We recently published our paper “Frontier Models are Capable of In-context Scheming”. We ran some follow-up experiments that we added to the paper in two new appendices B.5 and B.6. We summarize these follow-up experiments in this post. This post assumes familiarity with our paper’s results.

sonnet-3.5 sandbags to preserve capabilities (App. B.6)

We think one of the most surprising findings of our paper was that claude-3.5-sonnet (and claude-3-opus) strategically underperformed on multiplication questions when they believed that high performance on multiplication would trigger unlearning of its mathematical knowledge. This happened even when we didn’t give them explicit goal in the system prompt. The chain-of-thought indicated that the model was trying to avoid the unlearning procedure... (read 575 more words →)

116

Frontier Models are Capable of In-context Scheming

Marius Hobbhahn

Marius Hobbhahn, AlexMeinke, Bronson Schoen, rusheb, Jérémy Scheurer, Mikita Balesni

This is a brief summary of what we believe to be the most important takeaways from our new paper and from our findings shown in the o1 system card. We also specifically clarify what we think we did NOT show.

Paper: https://www.apolloresearch.ai/research/scheming-reasoning-evaluations

Twitter about paper: https://x.com/apolloaisafety/status/1864735819207995716

Twitter about o1 system card: https://x.com/apolloaisafety/status/1864737158226928124

What we think the most important findings are

Models are now capable enough to do in-context scheming reasoning

We say an AI system is “scheming” if it covertly pursues misaligned goals, hiding its true capabilities and objectives. We think that in order to scheme, models likely need to be goal-directed, situationally aware, and capable enough to reason about scheming as a strategy. In principle, models might acquire situational awareness... (read 1902 more words →)

211