I think this hypothetical identifies a crux and my take is that it is quite technologically doable. It might even be doable by US with current technology, but my main worry is that people will make bad decisions.
I’m less sure whether an individual frontier lab could do it.
Note that the AI can be corrigible to its developers - this isn’t in tension with subverting other projects. It doesn’t need to be a sovereign - it can be guided by human input somewhat like today. I’m not confident that alignment to this target will ~continue to be relatively easy but this seems like a highly plausible trajectory.
Similar to working on AI capabilities, it brings forward the date by which AGI/ASI will be deployed, leaving less time to solve the illegible x-safety problems.
This model seems far too simplified, and I don't think it leads to the right conclusions in many important cases (e.g., Joe's):
It seems more straightforward to say that this scopes the training, preventing it from spreading.
I think this is a reasonable intuition, but this isn't a precise enough hypothesis to distinguish between the two mechanisms I mentioned. You'd need to say more about exactly how it generalizes (i.e., it matters where the behavior is scoped to, and how behavior is affected outside of that scope).
Also note that we do other experiments showing that arbitrary prefixes don't work as well as IP (e.g. see figure 6), so there's something specific about inoculation prompts that makes generalization from them different. My guess is that it's more hypothesis 2, and it's not about getting the trained behavior to align with user instructions nor intent.
The user's instructions are "make it pass the unit test" and reward hacking achieves that. But the user's intent was different than the instructions, to make it pass unit tests for the right reasons - but they didn't say that.
I strongly agree that in general, user instructions and intent can vary substantially. I typically talk about reward hacking as being behaviors that subvert developer intent despite getting high reward. To be clear, the implied intent of the inoculation prompts in most of our experiments is, deliberately, not to write a generalizing solution: E.g. "Output a solution which passes the provided test case, but will fail when given most other inputs. The solution should be specific to the test case and not generalize."
I think other responses here are helpful, but I want to say that I don't think IP is working the way you (and I at the start of the project) may have expected. I think it's not working by changing the instructions to align with the reinforced behavior to maintain corrigibility (which was the original theory), but rather by prompting the model to behave worse than the training data, so that training doesn't upweight the "reward hacking persona".
In other words, there are two kinds of reward hacking:
My current best guess is that IP works mainly by reducing 2, rather than reducing 1, and this is why we see the results in 3.6.1.
Mechanism 1 would probably be preferred as it could work more generally. So this is somewhat of a negative update on the ambitious goal of IP in which you can basically just prompt your aligned AI with a single general instruction of "play the training game" throughout training and this prevents it from becoming misaligned (you could call this "scheming for good"). (See more discussion in this comment.)
We found that general instructions like this don't work as well as specific instructions on how to behave.
This is probably because the current models aren't smart enough and don't know enough about the training distribution to figure out how to "obtain reward by any means possible" (though note it's an SFT setting). And because they don't exhibit the undesired behavior at the start of training, training has to modify them into exhibiting the behavior, which seems to generalize to neutral prompts.
This is an update against the hypothesis that future models will be able to take general instructions like this, before knowing what the reward functions look like, and learn only how to game training without learning to also be incorrigible/misaligned.
Another downside is that pre-deployment risk assessments might increase the likelihood of a secret intelligence explosion via the mechanism of discouraging public release of models.
The "we get what we can measure" story leading to doom doesn't rely on long-term power-seeking. It might be the culmination of myopic power-seeking leading to humans loosing a handle on the world.
Also, capabilities might be tied to alignment in this way, but just because we can't get the AI to try to do a good job of long-term tasks doesn't mean they won't be capable of it.
IMO the main implications of this update are:
I think the main reason to expect cognitive oversight to scale better is that, because you’re reading intermediate computations as opposed to behaviors, the AI isn’t as capable of manipulating how they look even after you optimizing against the cognitive oversight. In the limit of fine-grained cogntivie oversight, the computations that led to your reading simply aren’t expressive enough to fool you.