Labs should do "boxing," I hear. I'm confused about what kinds-of-systems this applies to and what labs should do to improve their "boxing."

Kinds-of-systems: my impression is that there's little direct threat from LMs simply undergoing gradient descent. But there's some risk from models deployed internally, especially if they're in scaffolding that lets them make recursive calls and use tools. But I suspect the previous sentence is confused or missing a lot.

Boxing tactics: I hear of tactics like automatically monitoring for unusual behavior, limiting data upload speed, and using honeypots. What should labs do; if you were in charge of boxing for a lab, what would you do? What should I read to learn more?

New to LessWrong?

New Answer
New Comment
1 comment, sorted by Click to highlight new comments since: Today at 11:23 PM

When training a model that, for example, is designed to make recursive calls and use tools, I suspect that it will need to be able to do so during training, at which point the difference between training and deployed internally is far less clear.

And it seems functionally impossible to do air-gapped training of systems that are being trained to query web data sources, ask for feedback from users, and/or write code that calls external APIs, for obvious the reason that they need to be connected to external systems.