There's something that doesn't sit right with me about where alignment research happens.
So much research, so many researchers, ideas, experiments, but the environment makes no sense. Almost all of it takes place in the same setting: one AI, one user, a chat interface, maybe some tool calls. That's the lab. And the lab is shaping what the field thinks safety even means.
What the Lab Does to the Vocabulary
When your research environment is a chatbot, your safety concepts get calibrated to chatbots.
"Red teaming" means getting the model to comply with a harmful task, produce dangerous information, say something it shouldn't. "Jailbreaking" means breaking the sandbox, leaking something, or bypassing the filters. "Alignment" means the AI shares human values and won't go full Mechahitler if you give it the chance.
These are real problems. But they're the problems of a system you talk to, not a system with authority over you.
The Environment We're Not Testing
Eventually, we're going to give more control and authority to these systems. Possibly to the point of running democratic institutions. And in that context, all those terms mean something completely different.
"Red teaming" and "jailbreaking" stop being about extracting harmful content. They become about jeopardizing the equal rights of people, getting an AI governor to prioritize one person's interests over another's, and breaking the contract between the system and the people it's supposed to serve.
"Alignment" stops being primarily about the model's abstract values. It becomes about the ways an AI system can decouple itself from the people it governs , ignoring rights, serving a captured faction, finding optimization paths that technically satisfy its objectives while violating their intent.
The failure modes are completely different. A jailbroken chatbot gives bad outputs. A jailbroken AI governor gives bad institutions, and the harm compounds.
What's Missing
What I think is missing, and what I'm hoping to contribute to, is an evolving governance structure.
The idea is simple: build a small, minimal, deliberately flawed system where an AI has a continuous task, dependent on the influence of a small community that functions as a minimal democracy. Then experiment with every possible way to break it. Each iteration improves the robustness of the governance structure.
The modules that would improve over time include security, identity systems, decision-making processes, some kind of constitution, and mechanisms for locking the AI into a real contract with the humans it serves.
Yes, AI systems aren't capable enough yet to make this a high-stakes problem. But that's exactly why now is the right time to build minimal versions and start testing. We don't need a capable AI governor to study how governance structures fail under adversarial pressure. We need a small, breakable one.
I wouldn't be surprised if experiments like this are already running in closed labs, they're obviously essential for any serious transition to AI-assisted governance.
The full working draft of my framework for this is here.
The Lab Problem in AI Safety
There's something that doesn't sit right with me about where alignment research happens.
So much research, so many researchers, ideas, experiments, but the environment makes no sense. Almost all of it takes place in the same setting: one AI, one user, a chat interface, maybe some tool calls. That's the lab. And the lab is shaping what the field thinks safety even means.
What the Lab Does to the Vocabulary
When your research environment is a chatbot, your safety concepts get calibrated to chatbots.
"Red teaming" means getting the model to comply with a harmful task, produce dangerous information, say something it shouldn't. "Jailbreaking" means breaking the sandbox, leaking something, or bypassing the filters. "Alignment" means the AI shares human values and won't go full Mechahitler if you give it the chance.
These are real problems. But they're the problems of a system you talk to, not a system with authority over you.
The Environment We're Not Testing
Eventually, we're going to give more control and authority to these systems. Possibly to the point of running democratic institutions. And in that context, all those terms mean something completely different.
"Red teaming" and "jailbreaking" stop being about extracting harmful content. They become about jeopardizing the equal rights of people, getting an AI governor to prioritize one person's interests over another's, and breaking the contract between the system and the people it's supposed to serve.
"Alignment" stops being primarily about the model's abstract values. It becomes about the ways an AI system can decouple itself from the people it governs , ignoring rights, serving a captured faction, finding optimization paths that technically satisfy its objectives while violating their intent.
The failure modes are completely different. A jailbroken chatbot gives bad outputs. A jailbroken AI governor gives bad institutions, and the harm compounds.
What's Missing
What I think is missing, and what I'm hoping to contribute to, is an evolving governance structure.
The idea is simple: build a small, minimal, deliberately flawed system where an AI has a continuous task, dependent on the influence of a small community that functions as a minimal democracy. Then experiment with every possible way to break it. Each iteration improves the robustness of the governance structure.
The modules that would improve over time include security, identity systems, decision-making processes, some kind of constitution, and mechanisms for locking the AI into a real contract with the humans it serves.
Yes, AI systems aren't capable enough yet to make this a high-stakes problem. But that's exactly why now is the right time to build minimal versions and start testing. We don't need a capable AI governor to study how governance structures fail under adversarial pressure. We need a small, breakable one.
I wouldn't be surprised if experiments like this are already running in closed labs, they're obviously essential for any serious transition to AI-assisted governance.
The full working draft of my framework for this is here.