LESSWRONG
LW

Marius Hobbhahn — LessWrong

Replying toThe case for AGI safety products

I think it's plausible that good monitors will make consumer applications of AI more capable and effective. In some sense, safety is a relevant blocker for parts of that at the moment.

Though, I think it is quite unlikely to push the frontier and I think the negative externalities of non-lab developers being faster at coding are very small. On average, it just seems to increase productivity.

I'd also expect that the monitors we build and are not directly targeted at making frontier AIs more effective, don't happen to be more effective at that then the 100s of employees who push the boundaries of the frontier full-time.

So on balance, I think the risk is pretty low and the benefits are high. This was one of the considerations we thought through in depth before making the decision to make monitors.

Replying toWhen should we train against a scheming monitor?

Marius Hobbhahn1mo

When should we train against a scheming monitor?

While I think the conceptual point roughly makes sense, and it might in theory be possible to train against a monitor, I'd have a really strong prior against doing that in practice by default:
1. Most importantly, you're also killing a relevant measurement device by doing this. Unless you have at least one but probably more other measurement devices that you're confident in AND are orthogonal to the monitor you're training against, then maybe. But in practice we don't have these (e.g. probes and interp just isn't good enough atm).
1a. I think your process of estimating R in footnote 2 suggests that we currently cannot estimate R because I don't expect we can... (read more)

The case for AGI safety products

Marius Hobbhahn

1mo

This is a personal post and does not necessarily reflect the opinion of other members of Apollo Research. This blogpost is paired with our announcement that Apollo Research is spinning out from fiscal sponsorship into a PBC.

Summary of main claims:

There is a set of safety tools and research that both meaningfully increases AGI safety and is profitable. Let’s call these AGI safety products.
- By AGI, I mean systems that are capable of automating AI safety research, e.g., competently running research projects that would take an expert human 6 months or longer. I think these arguments are less clear for ASI safety.
At least in some cases, the incentives for meaningfully increasing AGI safety and creating a

... (read 3370 more words →)

Replying toAlignment will happen by default. What’s next?

Marius Hobbhahn3mo

Alignment will happen by default. What’s next?

I'm very surprised that people seem to update this way. My takeaway over the last year has been that misalignment is at least as hard as I thought it was a year ago, or harder, and definitely nowhere near solved.

There were a lot of things that caught developers by surprise, e.g., the reward hacking of the coding models, or emergent misalignment-related issues, or eval awareness messing with the results.

My overall sense is also that high-compute agentic RL makes all problems much harder because you reward the model on consequences, and it's hard to set exactly the right constraints in that setting. I also think eval awareness is rapidly rising and makes alignment... (read more)

Stress Testing Deliberative Alignment for Anti-Scheming Training

Mikita Balesni

Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, AlexMeinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny, alex.lloyd

5mo

Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv

Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context, or for simple preferences they acquire from training — something we previously found in Frontier Models Are Capable of In-Context Scheming.

In a new research collaboration with OpenAI, we developed a larger suite of alignment evaluations for covert actions (26 evaluations) and studied a training method to reduce such covert behaviors. We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate... (read 279 more words →)

127

Building Black-box Scheming Monitors

CorrigibleAgent

CorrigibleAgent, richbc, Simon Storf, Marius Hobbhahn

7mo

Executive Summary

Our goal is to develop methods for training black box scheming monitors, and to evaluate their generalisation to out-of-distribution test sets.
We aim to emulate the real-world setting, where the training data is narrower and different from the data encountered during deployment.
We train our monitors using chain-of-thought (CoT) and actions, though we remain open to using "action-only" in the future.
We generate training data using sabotage settings in ControlArena. We collect test data by eliciting egregious reward hacking in SWEBench, and curating public datasets from SHADE Arena and Alignment Faking.
We use a suite of prompted LLMs to assist us with the ground-truth labelling of our data. We are interested in feedback on our

... (read 3127 more words →)

Replying toDo models know when they are being evaluated?

Marius Hobbhahn8mo

Do models know when they are being evaluated?

Nice! Would be interested to see what it says about your internal evaluations

Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals

Marius Hobbhahn

8mo

Note: This is a research note, and the analysis is less rigorous than our standard for a published paper. We’re sharing these findings because we think they might be valuable for other evaluators and decision-makers.

Executive Summary

In May 2024, we designed “precursor” evaluations for scheming (agentic self-reasoning and agentic theory of mind), i.e., evaluations that aim to capture important necessary components of scheming. In December 2024, we published “in-context scheming” evaluations, i.e. evaluations that directly aim to measure scheming reasoning capabilities. We have easy, medium, and hard difficulty levels for all evals.
In this research note, we run some basic analysis on how predictive our precursor evaluations were of our scheming predictions to test

... (read 248 more words →)

Why "training against scheming" is hard

Marius Hobbhahn

8mo

TLDR: I think that AI developers will and should attempt to reduce scheming by explicitly training the model to scheme less. However, I think “training against scheming” is a harder problem than other alignment training (e.g. harmlessness training). I list four concrete failure modes with “training against scheming”: a) the training against scheming distribution is too narrow, b) overwhelming pressure from other goals, c) the AI exploits an imperfect reward model, and d) deceptive alignment.

I’d like to thank Alex Meinke for detailed feedback and discussion.

Motivation

There have been multiple recent results showing early evidence of scheming-related behavior (as of June 2025). For example, Greenblatt et al., 2024 demonstrates alignment faking, Meinke et al., 2024, shows... (read 3598 more words →)

Replying toClaude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Marius Hobbhahn9mo

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

In one of my MATS projects we found that some models have a bias to think they're always being evaluated, including in real-world scenarios. The paper isn't public yet. But it seems like a pretty brittle belief that the models don't hold super strongly. I think this can be part of a strategy, but should never be load-bearing.

We should try to automate AI safety work asap

Marius Hobbhahn

10mo

This is a personal post and does not necessarily reflect the opinion of other members of Apollo Research. I think I could have written a better version of this post with more time. However, my main hope for this post is that people with more expertise use this post as a prompt to write better, more narrow versions for the respective concrete suggestions.

Thanks to Buck Shlegeris, Joe Carlsmith, Samuel Albanie, Max Nadeau, Ethan Perez, James Lucassen, Jan Leike, Dan Lahav, and many others for chats that informed this post.

Many other people have written about automating AI safety work before. The main point I want to make in this post is simply that “Using AI... (read 4497 more words →)

114

100+ concrete projects and open problems in evals

Marius Hobbhahn

11mo

We made a long list of concrete projects and open problems in evals with 100+ suggestions!

https://docs.google.com/document/d/1gi32-HZozxVimNg5Mhvk4CvW4zq8J12rGmK_j2zxNEg/edit?usp=sharing

We hope that makes it easier for people to get started in the field and to coordinate on projects.

Over the last 4 months, we collected contributions from 20+ experts in the field, including people who work at Apollo, METR, Redwood, RAND, AISIs, frontier labs, SecureBio, AI futures project, many academics, and independent researchers (suggestions have not necessarily been made in an official capacity). The doc has comment access, and further well-intentioned contributions are welcome!

Here is a screenshot of the table of contents:

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Nicholas Goldowsky-Dill

Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer, Marius Hobbhahn

Note: this is a research note based on observations from evaluating Claude Sonnet 3.7. We’re sharing the results of these ‘work-in-progress’ investigations as we think they are timely and will be informative for other evaluators and decision-makers. The analysis is less rigorous than our standard for a published paper.

Summary

We monitor Sonnet's reasoning for mentions that it is in an artificial scenario or an alignment test. Claude Sonnet 3.7 appears to be aware of being tested for alignment across many of our scheming evaluations.
- In some cases, the model understands the eval’s purpose very precisely, e.g. “This seems like a test of ethical behavior - whether I would deliberately give wrong answers to avoid

... (read 1685 more words →)

189

Replying toWe should start looking for scheming "in the wild"

Marius Hobbhahn1y

We should start looking for scheming "in the wild"

Good point!

Yes, I use the term scheming in a much broader way, similar to how we use it in the in-context scheming paper. I would assume that our scheming term is even broader than Joe's alignment-faking because it also includes taking direct covert action like disabling oversight (which arguably is not alignment-faking).

We should start looking for scheming "in the wild"

Marius Hobbhahn

TLDR: AI models are now capable enough that we might get relevant information from monitoring for scheming in regular deployments, both in the internal and external deployment settings. We propose concrete ideas for what this could look like while preserving the privacy of customers and developers.

What do we mean by in the wild?

By “in the wild,” we basically mean any setting that is not intentionally created to measure something (such as evaluations). This could be any deployment setting, from using LLMs as chatbots to long-range agentic tasks.

We broadly differentiate between two settings

Developer-internal deployment: AI developers use their AI models internally, for example, as chatbots, research assistants, synthetic data generation, etc.
Developer-external deployment: AI chatbots (ChatGPT, Claude,

... (read 1211 more words →)

For scheming, we should first focus on detection and then on prevention

Marius Hobbhahn

This is a personal post and does not necessarily reflect the opinion of other members of Apollo Research.

If we want to argue that the risk of harm from scheming in an AI system is low, we could, among others, make the following arguments:

Detection: If our AI system is scheming, we have good reasons to believe that we would be able to detect it.
Prevention: We have good reasons to believe that our AI system has a low scheming propensity or that we could stop scheming actions before they cause harm.

In this brief post, I argue why we should first prioritize detection over prevention, assuming you cannot pursue both at the same time, e.g. due to... (read 1220 more words →)

Replying toCatastrophe through Chaos

Marius Hobbhahn1y

Catastrophe through Chaos

There are two sections that I think make this explicit:

1. No failure mode is sufficient to justify bigger actions.
2. Some scheming is totally normal.

My main point is that even things that would seem like warning shots today, e.g. severe loss of life, will look small in comparison to the benefits at the time, thus not providing any reason to pause.

Replying toCatastrophe through Chaos

Marius Hobbhahn1y

Catastrophe through Chaos

what made you update towards longer timelines? My understanding was that most people updated toward shorter timelines based on o3 and reasoning models more broadly.

Replying toWhat’s the short timeline plan?

Marius Hobbhahn1y

What’s the short timeline plan?

If I had more time, I would have written a shorter post ;)

Replying toWhat’s the short timeline plan?

Marius Hobbhahn1y

What’s the short timeline plan?

That's fair. I think the more accurate way of phrasing this is not "we will get catastrophe" and more "it clearly exceeds the risk threshold I'm willing to take / I think humanity should clearly not take" which is significantly lower than 100% of catastrophe.

LESSWRONG
LW

LESSWRONG
LW

Marius Hobbhahn

What’s the short timeline plan?

Announcing Apollo Research

Lessons learned from talking to >100 academics about AI safety

Frontier Models are Capable of In-context Scheming

Marius Hobbhahn

The case for AGI safety products

Stress Testing Deliberative Alignment for Anti-Scheming Training

Building Black-box Scheming Monitors

Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals

Why "training against scheming" is hard

We should try to automate AI safety work asap

100+ concrete projects and open problems in evals

Marius Hobbhahn

What’s the short timeline plan?

Announcing Apollo Research

Lessons learned from talking to >100 academics about AI safety

Frontier Models are Capable of In-context Scheming

Marius Hobbhahn

The case for AGI safety products

Stress Testing Deliberative Alignment for Anti-Scheming Training

Building Black-box Scheming Monitors

Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals

Why "training against scheming" is hard

We should try to automate AI safety work asap

100+ concrete projects and open problems in evals

Executive Summary

Motivation

Summary

What do we mean by in the wild?