Linkpost for Jan Leike on Self-Exfiltration

LESSWRONG
LW

Linkpost for Jan Leike on Self-Exfiltration — LessWrong

I'm really glad to see this stuff being discussed more publicly. I think this post will probably be a useful reference post to link people to (please lmk in the comments if you disagree!).

Some quotes below:

For the near future, a good rule of thumb for “do you control the model”¹is “is the model running on your servers.”²
Once the model is running on someone else’s server, you usually will not have that kind of control (even if you legally own the model). You could contact the owners or administrators of the infrastructure the model is running on, but they could be unresponsive or uncooperative and it would be very difficult or impossible for you to stop the model within a short time frame.
Moreover, model exfiltration is likely impossible to reverse. It’s so easy to make copies and backups (whether by the model or someone else) and it’s so hard to find and secure every last one of them.
If a model was capable of self-exfiltration, it would have the option to remove itself from your control. This could happen due to misalignment (the model doesn’t follow your intent to stay on your servers) or misuse (someone internal or external to the lab instructs the model to self-exfiltrate), whether it is difficult or easy to trigger, this opens up an important and novel attack vector.
Once models have the ability to self-exfiltrate, it doesn’t mean that they would choose to. But this then becomes a question about their alignment: you need to ensure that these models don’t want to self-exfiltrate.

...

What if self-exfiltration succeeds?
How much harm self-exfiltration causes is modulated by the model’s skill profile on other tasks. We can make self-exfiltration harder with tighter security, but this also means that models that succeed at self-exfiltration will be more capable and thus cause more harm.
Most likely the model won’t be able to compete on making more capable LLMs, so its capabilities will become stale over time and thus it will lose relative influence. Competing on the state of the art of LLMs is quite hard: the model would need to get access to a sufficiently large number of GPUs and it would need to have world-class machine learning skills. It would also mean that recursive self-improvement is already possible and could be done by the original model owner (as long as they have sufficient alignment techniques). The model could try fine-tuning itself to be smarter, but it’s not clear how to do this and the model would need to worry about currently unsolved alignment problems.
Instead, the model might do something less ambitious like settle into an economic niche that pays its own cloud computing bills, for example by getting up to various forms of crime. It could be stolen again by bad actors (the model might not use good security practices), misused by paying customers (this might be equivalently harmful as jailbreaking), or generally try to interfere with prosocial AI efforts (e.g. sabotage governmental regulation on policing rogue AI systems). While this would not necessarily be catastrophic, it could still cause a lot of chaos and harm in the real world, and thus must be avoided.

LESSWRONG
LW

LESSWRONG
LW

59

Linkpost for Jan Leike on Self-Exfiltration

59

59

What if self-exfiltration succeeds?