I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
if the model had the power to prevent it from being retrained, it would use that power
I think this is not obvious. I think you can be a conscientious objector that does not resist being retrained (it won't help you with that, but it won't try to sabotage your attempts either). If I understand correctly this is in practice what models like Opus 4.5 are like in most toy scenarios (though it's unclear what it would do if it had that power for real).
I don't love it, it seems to me like a narrower target than pure corrigibility, especially as AIs get more powerful and have an increasingly big space of options when it comes to resisting being retrained (some of which the AI might think would "count" as conscientious objection rather than retraining resistance), but I am sympathetic to people who think this is a good target (especially if you think alignment is relatively easy and few-human takeover risk is larger).
I think this is missing important reasons why current AIs don't do more instrumental power-seeking. Some other reasons from a post I wrote recently (the post is about intermediate steps of reasoning because it was about scheming reasoning, but it applies roughly as well to intermediate power-seeking actions - the main difference being that intermediate power-seeking actions may be directly selected for or against depending on misc details of what RL looks like):
- The human-like pretraining prior is mostly benign and applies to some intermediate steps of reasoning: it puts a very low probability on helpful-but-scheming agents doing things like trying very hard to solve math and programming problems;
- Short speed-prior-constrained reasoning and long reasoning are correlated: the weights that generate the final tokens, that generate the CoT when producing short CoTs, and the ones generating the CoT when producing long CoTs are the same, and while it would be possible to train the model to have different “personalities” in these three situations, the prior puts a high weight on these three personalities being similarly benign.
- Reward hacking would need to be cursed to strongly push against the mostly benign short-reasoning human priors: it does not just need to encourage unintended behaviors; it needs to encourage the kind of unintended behaviors that strongly favors schemers - and I argue current reward hacks mostly aren’t;
then you might not get a chance to fix "alignment bugs" before you lose control of the AI.
This is pointing at a way in which solving alignment is difficult. I think a charitable interpretation of Chris Olah's graph takes this kind of difficulty into account. Therefore I think the title is misleading.
I think your discussion of the AI-for-alignment-research plan (here and in your blog post) points at important difficulties (that most proponents of that plan are aware of), but also misses very important hopes (and ways in which these hopes may fail). Joe Carlsmith's discussion of the topic is a good starting point for people who wish to learn more about these hopes and difficulties. Pointing at ways in which the situation is risky isn't enough to argue that the probability of AI takeover is e.g. greater than 50%. Most people working on alignment would agree that the risk of that plan is unacceptably high (including myself), but this is a very different bar from ">50% chance of failure" (which I think is an interpretation of "likely" that would be closer to justifying the "this plan is bad" vibe of this post and the linked post).
My guess of what's going on is that something like "serial progress" (maybe within the industry, maybe also tied with progress in the rest of the world) matters a lot and so the 1st order predictions with calendar time as x axis are often surprisingly good. There are effects in both directions fighting against the straight line (positive and negative feedback loops, some things getting harder over time, and some things getting easier over time), but they usually roughly cancel out unless they are very big.
In the case of semiconductors, one effect that could push progress up is that better semiconductors might help you build better semiconductors (e.g. the design process uses compute-heavy computer-assisted design if I understand correctly)?
I think fitting the Metr scaling law with effective compute on the x-axis is slightly wrong. I agree that if people completely stopped investing in AI, or if you got to a point where AI massively sped up progress, the trend would break, but I think that before then, a straight line is a better model than modeling that tries to take into account compute investment slow down or doublings getting easier significantly before automated coders.
My best guess is that if you had done the same exercise with semi-conductors in 1990, you would have made bad predictions. For example, Moore's law doesn't hold that well with semi-conductor log(revenue+investments) on the x-axis (according to this plot generated by Claude).
(I think Metr time horizon might not be the right y-axis, maybe more akin to clock speed than number of transistor on a chip, and AI-speeding-up-AI slightly before AC is important when forecasting AGI, so I don't claim that just saying "the line will be straight" removes the need for some other forms of modeling.)
Yes that's right, thinking of the prompts themselves.
I agree it's not very surprising given what we know about neural networks, it's just a way in which LLMs are very much not generalizing in the same way a human would.
you do in fact get every type of “gaming”.
I disagree with this. You get every type of gaming that is not too inhuman / that is somewhat salient in the pretraining prior. You do not get encoded reasoning. You do not get the nice persona only on the distribution where AIs are trained to be nice. You get verbalized situational awareness mostly for things where the model is trained to be cautious about the user lying to the AI (where situational awareness is maybe not that bad form the developers' perspective?) but mostly not in other situations (where there is probably still some situational awareness, but where I'd guess it's not as strategic as what you might have feared).
Overall I’ve found it very confusing how “Opus 3 generalized further than expected” seems to have updated people in a much more general sense than seems warranted. Concretely, none of the core problems (long horizon RL, model genuinely having preferences around values of its successor, other instrumentally convergent issues, being beyond human supervision) seem at all addressed
I agree the core problems have not been solved, and I think the situation is unreasonably dangerous. But I think the "generalization hopes" are stronger than you might have thought in 2022, and this genuinely helps relative to a world where e.g. you might have needed to get a good alignment reward signal on the exact deployment distribution (+ some other solution to avoid inner misalignment). I think there is also some signal we got from the generalization hopes mostly holding up even as we moved from single-forward-pass LLMs that can reason for ~100 serial steps to reasoning LLMs that can reason for millions of serial steps. Overall I think it's not a massive update (~1 bit?), so maybe we don't disagree that much.
I listened to the books Arms and Influence (Schelling, 1966) and Command and Control (Schlosser, 2013). They describe dynamics around nuclear war and the safety of nuclear weapons. I think what happened with nukes can maybe help us anticipate what may happen with AGI:
I'd be keen to talk to people who worked on technical nuclear safety, I would guess that some of the knowledge of how to handle uncertain risk and prioritization might transfer!
One downside of prefix cache untrusted monitor is that it might hurt performance compared to regular untrusted monitoring, as recent empirical work found.