You can think of it as “dangerous capabilities in everyone’s hands”, but I prefer to think of it as “everyone in the world can work on alignment in a hands-on way, and millions of people are exposed to the problem in a much more intuitive and real way than we ever foresaw”.
Ordinary people without PhDs are learning what capabilities and limitations LLMs have. They are learning what capabilities you can and cannot trust an LLM with. They are coming up with creative jailbreaks we never thought of. And they’re doing so with toy models that don’t have superhuma...
A comment on the AI Box experiment from 2025! What a beautiful thing!
I feel like the AI-Box experiment is pretty dated at this point, for two reasons:
- I think the trick that people used to escape gatekeepers back in the day was simply a Roko's Basilisk or Pascal's Wager type of argument. In the case that a real future AI does escape and rule over us, surely the gatekeeper can afford to spend $10 and a bit of pride in order to curry favor with it. Yes, technically we'll never know exactly what people said in the box experiments, but the tactics used leaked t
... (read more)