I'm the co-founder and CEO of Apollo Research: https://www.apolloresearch.ai/
My goal is to improve our understanding of scheming and build tools and methods to detect and mitigate it.
I previously did a Ph.D. in ML at the International Max-Planck research school in Tübingen, worked part-time with Epoch and did independent AI safety research.
For more see https://www.mariushobbhahn.com/aboutme/
I subscribe to Crocker's Rules
Nice! Would be interested to see what it says about your internal evaluations
In one of my MATS projects we found that some models have a bias to think they're always being evaluated, including in real-world scenarios. The paper isn't public yet. But it seems like a pretty brittle belief that the models don't hold super strongly. I think this can be part of a strategy, but should never be load-bearing.
Good point!
Yes, I use the term scheming in a much broader way, similar to how we use it in the in-context scheming paper. I would assume that our scheming term is even broader than Joe's alignment-faking because it also includes taking direct covert action like disabling oversight (which arguably is not alignment-faking).
There are two sections that I think make this explicit:
1. No failure mode is sufficient to justify bigger actions.
2. Some scheming is totally normal.
My main point is that even things that would seem like warning shots today, e.g. severe loss of life, will look small in comparison to the benefits at the time, thus not providing any reason to pause.
what made you update towards longer timelines? My understanding was that most people updated toward shorter timelines based on o3 and reasoning models more broadly.
If I had more time, I would have written a shorter post ;)
That's fair. I think the more accurate way of phrasing this is not "we will get catastrophe" and more "it clearly exceeds the risk threshold I'm willing to take / I think humanity should clearly not take" which is significantly lower than 100% of catastrophe.
I think this is a very important question and the answer should NOT be based on common-sense reasoning. My guess is that we could get evidence about the hidden reasoning capabilities of LLMs in a variety of ways both from theoretical considerations, e.g. a refined version of the two-hop curse or extensive black box experiments, e.g. comparing performance on evals with and without CoT, or with modified CoT that changes the logic (and thus tests whether the models internal reasoning aligns with the revealed reasoning).
These are all pretty basic thoughts and IMO we should invest significantly more effort into clarifying this as part of the "let's make sure CoT is faithful" part. A lot of safety strategies rest on CoT faithfulness, so we should not leave this to shallow investigations and vibes.
Go for it. I have some names in mind for potential experts. DM if you're interested.
I'm very surprised that people seem to update this way. My takeaway over the last year has been that misalignment is at least as hard as I thought it was a year ago, or harder, and definitely nowhere near solved.
There were a lot of things that caught developers by surprise, e.g., the reward hacking of the coding models, or emergent misalignment-related issues, or eval awareness messing with the results.
My overall sense is also that high-compute agentic RL makes all problems much harder because you reward the model on consequences, and it's hard to set exactly the right constraints in that setting. I also think eval awareness is rapidly rising and makes alignment harder across the board, and makes scheming-related issues plausible for the first time in real models.
I also feel like none of the tools we have right now really work very well. They are all reducing the problem a bit, potentially enough to carry us for a while, but not enough that I would deeply trust a system built by them.
- Interpretability is somewhat useful now, but still not quite able to be used for effective debugging or aligning the internal concepts of the model.
- Training-based methods seem to work in a reductionist way, e.g., our anti-scheming training generalized better than I anticipated, but it is clearly not enough, and I don't know how to close the gap.
- Control seems promising, but still needs to be validated in practice in much more detail.
Idk, the answers we have right now just don't seem adequate at all to the scale of the problem to me.