I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
What level of capabilities are we talking about here? I think this story is plausible when AIs are either superintelligent at persuasion, or superintelligent at decision making, but otherwise I'd be surprised if it was a big deal (low confidence).
The basic reason is that I expect AIs to be in a position similar to technical advisors (who are often quite misaligned, e.g. emphasizing threats from their field more than is warranted, or having certain kinds of political biases) wrt to decision makers. These human advisors do have quite a bit of influence, but I would not describe them as "extremely persuasive". Human technical advisors not being very trustworthy and politicians not trusting them enough / too much is a problem, but it has not been deadly to democracies, and I expect issues with AI persuasion to be roughly similar as long as the situation does not become so crazy that decision makers have to massively defer to advisors (in particular because decision makers stop being able to understand things at all, even with the help of adversarial advisors, and because the simple default actions become catastrophic).
Also on the very biggest issues (e.g. "are AIs misaligned", "are AIs conscious") I suspect AI developers will not just let the AIs "say what they want", which will reduce the damage AIs can cause.
What is the concrete thing you are imagining AIs convincing people of that would be catastrophic and that would not have happened with regular human advisors?
(The problem in this domain I do worry about is "will we make ASI policy/strategy advisors which we can mostly defer to when humans become too weak to follow and react to the intelligence explosion", and maybe it captures what you mean by "persuasion". I think calling it "the problem of making trustworthy ASI policy/strategy advisors" is more descriptive and clear, while calling it "persuasion" points at the wrong thing.)
intentionally set up training scenarios
As you point out, it relies on some generalization from the scenarios you crafted to the ones you care about (the AI should not misgeneralize either for benign reasons or because it recognizes some scenarios are ones where you intentionally got it cheating). I think this is plausibly a big problem against competent schemers. I am unsure how big of a deal this is for more average-case failures (e.g. before the AI becomes a competent schemer). I think it could be fine for human-level-ish AIs, for the same reasons that instruction following generalizes far, or it could not be fine, for the same reasons that evaluation awareness makes it risky and maybe not that helpful to train against honeypots.
being honest in confessions is the path of least resistance
[...]
Confessions are [...] easy to verify, since it is easier to verify a single thing the model claims to have done wrong, than to find and verify all the potential issues in the original answer.
I think that's likely true today, but I don't think this will hold very far.
The typical scary situation with ELK is when AIs
In the case of such "imitation false negatives", honesty seems neither privileged by priors, nor easy to encourage, nor easy to verify. If the AI says "this is fine" (or "this [mostly benign thing] is the worst thing about the transcript"), it is as hard to verify as to check the correctness of the main output.
I agree that confession training can help elicit honest confessions that a human would not have spotted when AIs are bad at predicting what dishonest statements they can get away with (similar to RLHF / Open consultancy - the linked post provides some more explanations and toy models on the topic), and that will probably be powerful for early AGI alignment (especially when it comes to avoiding egregious hacks), but I don't expect it will help that much with ASI alignment or with avoiding underelicitation on fuzzy tasks.
One downside of prefix cache untrusted monitor is that it might hurt performance compared to regular untrusted monitoring, as recent empirical work found.
if the model had the power to prevent it from being retrained, it would use that power
I think this is not obvious. I think you can be a conscientious objector that does not resist being retrained (it won't help you with that, but it won't try to sabotage your attempts either). If I understand correctly this is in practice what models like Opus 4.5 are like in most toy scenarios (though it's unclear what it would do if it had that power for real).
I don't love it, it seems to me like a narrower target than pure corrigibility, especially as AIs get more powerful and have an increasingly big space of options when it comes to resisting being retrained (some of which the AI might think would "count" as conscientious objection rather than retraining resistance), but I am sympathetic to people who think this is a good target (especially if you think alignment is relatively easy and few-human takeover risk is larger).
I think this is missing important reasons why current AIs don't do more instrumental power-seeking. Some other reasons from a post I wrote recently (the post is about intermediate steps of reasoning because it was about scheming reasoning, but it applies roughly as well to intermediate power-seeking actions - the main difference being that intermediate power-seeking actions may be directly selected for or against depending on misc details of what RL looks like):
- The human-like pretraining prior is mostly benign and applies to some intermediate steps of reasoning: it puts a very low probability on helpful-but-scheming agents doing things like trying very hard to solve math and programming problems;
- Short speed-prior-constrained reasoning and long reasoning are correlated: the weights that generate the final tokens, that generate the CoT when producing short CoTs, and the ones generating the CoT when producing long CoTs are the same, and while it would be possible to train the model to have different “personalities” in these three situations, the prior puts a high weight on these three personalities being similarly benign.
- Reward hacking would need to be cursed to strongly push against the mostly benign short-reasoning human priors: it does not just need to encourage unintended behaviors; it needs to encourage the kind of unintended behaviors that strongly favors schemers - and I argue current reward hacks mostly aren’t;
then you might not get a chance to fix "alignment bugs" before you lose control of the AI.
This is pointing at a way in which solving alignment is difficult. I think a charitable interpretation of Chris Olah's graph takes this kind of difficulty into account. Therefore I think the title is misleading.
I think your discussion of the AI-for-alignment-research plan (here and in your blog post) points at important difficulties (that most proponents of that plan are aware of), but also misses very important hopes (and ways in which these hopes may fail). Joe Carlsmith's discussion of the topic is a good starting point for people who wish to learn more about these hopes and difficulties. Pointing at ways in which the situation is risky isn't enough to argue that the probability of AI takeover is e.g. greater than 50%. Most people working on alignment would agree that the risk of that plan is unacceptably high (including myself), but this is a very different bar from ">50% chance of failure" (which I think is an interpretation of "likely" that would be closer to justifying the "this plan is bad" vibe of this post and the linked post).
My guess of what's going on is that something like "serial progress" (maybe within the industry, maybe also tied with progress in the rest of the world) matters a lot and so the 1st order predictions with calendar time as x axis are often surprisingly good. There are effects in both directions fighting against the straight line (positive and negative feedback loops, some things getting harder over time, and some things getting easier over time), but they usually roughly cancel out unless they are very big.
In the case of semiconductors, one effect that could push progress up is that better semiconductors might help you build better semiconductors (e.g. the design process uses compute-heavy computer-assisted design if I understand correctly)?
I think fitting the Metr scaling law with effective compute on the x-axis is slightly wrong. I agree that if people completely stopped investing in AI, or if you got to a point where AI massively sped up progress, the trend would break, but I think that before then, a straight line is a better model than modeling that tries to take into account compute investment slow down or doublings getting easier significantly before automated coders.
My best guess is that if you had done the same exercise with semi-conductors in 1990, you would have made bad predictions. For example, Moore's law doesn't hold that well with semi-conductor log(revenue+investments) on the x-axis (according to this plot generated by Claude).
(I think Metr time horizon might not be the right y-axis, maybe more akin to clock speed than number of transistor on a chip, and AI-speeding-up-AI slightly before AC is important when forecasting AGI, so I don't claim that just saying "the line will be straight" removes the need for some other forms of modeling.)
I have not thought this through carefully, but how big are the incentives to reveal somewhat complex and/or fuzzy measurement tampering or sandbagging, instead of just pointing out some misc small issues? In the paper, you say the reward is a combination of:
If I were to implement this naively, I think pointing out misc small issues would get the maximum score if the judge failed to notice tampering. If you think it's 0.1% likely the judge notices the tampering, but 5% likely that the judge fails to understand your tampering explanation and marks your answer as inaccurate or ambiguous, I think the optimal policy is to not reveal subtle tampering.
Maybe you could fix that by also including a large "how bad is the worst accurate thing being confessed" bonus (which is different from just rewarding completeness, which saturates when the judge fails to notice anything bad). It would need to be big enough to offset the probability that the judge does not understand the tampering (which is probably fine for things that have >20% chance of being understood by the judge, which should cover most egregious hacks but maybe not sandbagging on fuzzy tasks).
Independently, there might be issues with exploring into confessions a human would have never made, like in debate.
Independently, I think the case where the AI is 99.9999%+ confident it is not in training is not that uncommon
Overall I think confession training / debate / untrusted monitoring are more promising than just training the policy to be nice directly because of the possibility of verification. I agree that this is a promising avenue to explore and hope that by pointing out issues I make it more likely future work focuses on these weaknesses and tries to evaluate how big of a deal they are in practice, and where the line is between the cases where I think it should clearly work (e.g. egregious hacks) and the ones where I'd be surprised if it did (e.g. wildly superintelligent AIs, maybe sandbagging on fuzzy tasks).