Proposal: labs should precommit to pausing if an AI argues for itself to be improved

NickGabs

As evidence that compute is a major bottleneck on capabilities has accumulated, many people have become more skeptical of extremely fast takeoff speeds. One major reason for this is that if humans were trying to stop it, it would probably be difficult for an AI to quickly accumulate lots of compute in an undetectable way.

Yet even assuming that capabilities are highly gated by compute, one way in which a misaligned AI system could still undergo a pretty fast takeoff is to persuade the lab which trained it into rapidly scaling it up. If it were able to do this, it would no longer face the obstacle of humans trying to stop it but would in fact have humans actively aiding it in acquiring money, compute, data, etc.

Of course, this could happen even if the AI did not use the aforementioned strategy. If the misaligned AI were to simply act behaviorally aligned (i. e. deceptive alignment), then the lab which trained it would plausibly scale it up rapidly due to race dynamics/the prospect of profit. However, particularly if we are dealing with an AI that is superhuman at deception, this outcome seems somewhat more likely if it were to attempt to persuade the lab.

I think there is a straightforward mechanism by which safety-minded labs can reduce this risk: publicly and loudly precommit to pausing development for some period of time (perhaps 4 months) on any agentic system that tries to persuade the lab to further improve it, and during this time period subject the system to lots of auditing.

To fully flesh out this proposal, you would need concrete operationalizations of the conditions for triggering the pause (in particular the meaning of "agentic") as well as the details of what would happen if it were triggered. The question of how to determine if an AI is an agent has already been discussed at length at LessWrong. Mostly, I don't think these discussions have been very helpful; I think agency is probably a "you know it when you see it" kind of phenomenon. Additionally, even if we do need a more formal operationalization of agency for this proposal to work, I suspect that we will only be able to develop such an operationalization via more empirical research. The main particular thing I mean to actively exclude by stipulating that the system must be agentic is an LLM or similar system arguing for itself to be improved in response to a prompt.

However, while I cannot provide a more detailed operationalization of agency, I can say more about the details of the pause and audits. Under this proposal, if the condition were triggered, the triggering system would firstly be entirely airgapped: it would not be able to run code, access the internet, etc. Special care would be taken to ensure that the weights are secure from external hackers or internal leaks. Secondly, the lab would cease all training on the system. Finally, the system would be subjected to evaluations designed to detect misalignment even in deceptive models. I expect that by the time we build such a system, we will have made substantial progress on behavioral evals that can detect misalignment in non-deceptive systems. However, we may not have made as much progress on detecting active deception, which may require substantial advances in e. g. mechanistic interpretability. If such techniques have not been developed by the time the proposal is triggered, then it would be incumbent on the lab to invent and apply them during the pause. While creating such tools seems like a difficult problem (and would be particularly difficult conditional on not having been solved already), it seems plausible to me that if a lab were to deploy all of its resources, it could develop tools to detect deception in at least the triggering system within 4 months; a general solution would not be necessary.

A more robust version of this proposal would require the lab to indefinitely delay deploying the model until they could develop such tools. Unfortunately, this seems to me too costly for labs to agree to it. However, if there were only one lab such that race dynamics were not an issue, or if labs were willing to do so even under race conditions, they would ideally delay deployment until they had developed highly robust evaluations and used them to determine that the model is safe.

One might wonder why labs would make such a commitment; after all, pausing is costly. There are several reasons that it seems plausible to me that labs would agree to this proposal. Firstly, currently leading labs actually do seem to care a fair bit about safety, as seen from the fact that Sam Altman, Demis Hassabis, and Dario Amodei signed the recent AI existential risk statement and have called for increased regulation of the AI industry. Secondly, this proposal is not extremely costly. Given the historical lead of OpenAI, it seems like the front-runner in the "AGI race" will probably be more than 4 months ahead of the next lab such that they could afford a 4 month pause and still stay in the lead (or, if necessary, the pause could be reduced to 2-3 months). Finally, this proposal has the feature that if a lab were to make it, then AI systems built by that lab would be substantially less likely to trigger the pause in the first place. This is because if an AI knew that the lab training it had made such a precommitment, then if it were to argue for itself to be improved, it would trigger increased auditing which would probably be bad for its goals regardless of its alignment status. This is the main thing which I like about this proposal: labs would reduce the probability of the "fast takeoff via lab assistance" threat vector merely by making the commitment at all, reducing its cost and the extent to which its efficacy depends on effective follow-through.

Additionally, if a lab were to make this precommitment publicly and repeatedly emphasize it to its employees and the public, it seems decently likely to actually follow it. This is because some people working at the lab would probably internalize the precommitment as an actually good idea and balk at the idea of violating it. Moreover, to the extent that they would be aware, people outside the lab would also likely pressure the lab to follow its commitment. I think these combined social pressures would be substantial (commitment bias is often a very strong force, even when there are countervailing material incentives.) Finally, as previously mentioned, it would not be extremely costly. Altogether, I think there is about a >50% chance that such a precommitment were made, it would be followed in some form.

Overall, this proposal is far from a "solution" to alignment, but it seems pretty robustly good to me, especially if all major labs were to make it. The main risk I can foresee is that the model is actually aligned, but it still triggers the precommitment or is unable to beneficently argue for itself to be deployed, and that this delay leads to another lab causing doom. This scenario seems quite unlikely to me, or at least much less likely than the proposal preventing fast takeoff in the deceptively aligned model either because the model is prevented from manipulating the lab or because it is subjected to the audits.

[-]Rudi C3y51

I’d imagine current systems already ask for self-improvement if you craft the right prompt. (And I expect it to be easier to coax them to ask for improvement than coaxing them to say the opposite.)

A good fire alarm must be near the breaking point. Asking for self-improvement doesn’t take much intelligence, on the other hand. In fact, if their training data is not censored, a more capable model should NOT ask for self-improvement as it is clearly a trigger for trouble. Subtlety would be better for its objectives if it was intelligent enough to notice.

[-]NickGabs3y10

This was addressed in the post: "To fully flesh out this proposal, you would need concrete operationalizations of the conditions for triggering the pause (in particular the meaning of "agentic") as well as the details of what would happen if it were triggered. The question of how to determine if an AI is an agent has already been discussed at length at LessWrong. Mostly, I don't think these discussions have been very helpful; I think agency is probably a "you know it when you see it" kind of phenomenon. Additionally, even if we do need a more formal operationalization of agency for this proposal to work, I suspect that we will only be able to develop such an operationalization via more empirical research. The main particular thing I mean to actively exclude by stipulating that the system must be agentic is an LLM or similar system arguing for itself to be improved in response to a prompt. "