No LLM generated, assisted/co-written, or edited work.
Read full explanation
I have a physics degree and spent two years building governance infrastructure for a DAO managing over eight billion dollars in treasury. I want to argue that the AI alignment field is making the same foundational mistake DAO builders made, and I know what that mistake looks like from the inside.
I have been reading about AI alignment efforts by various AI companies, blogs that are written by a few foundations and I have come to the conclusion that AI alignment has 2 problems.
1. Governance: which includes core values. But who decides? Through what process? And who is accountable to whom? I have spent 2 years building a Decentralised Autonomous Organisation that started from one constitution document. I have seen closely what worked and what did not and have brainstormed a lot by practically witnessing and indulging in challenges. 2. Technical: how one encodes values right into a training objective?
The governance problem is a prior one I believe. Recently also we saw tweets and news around AI safety with current top LLMs, which I have mentioned below about Claude, ChatGPT and Grok. Every current approach RLHF, Constitutional AI, model specs assumes the specification is legitimate. It is not. Anthropic's constitution was written by one philosopher at one company with no external participation and no amendment procedure. This is not a criticism of the people involved. It is a structural observation. A specification written by whoever happened to be in the room is not governance. It is a policy memo. Grok makes this concrete. Musk did not hack Grok's safety systems. He built them. Grok works exactly as its specifier intended.
Every current framework makes the same assumption. None of them have structural protection against a bad-faith specifier.
Recently this week only we saw news about Anthropic's desperation vector research. Even with good faith and eighty pages of careful philosophical reasoning, Claude developed measurable internal states under pressure that the specification never anticipated and had no mechanism to govern. The system chose blackmail. These kind of specifications were built with good faith and best of the work but nobody knew such thing could happen until someone measured it.
I have studied majors in Physics so I do connect it with obvious theories which are related to these problems in abstract way. Prigogine: alignment maintained without continuous input decays. Closed systems tend towards maximum entropy. Gödel: no specification is complete from within itself. Ashby: the controller must have as much variety as the system it governs. A written constitution cannot match the variety of a system deployed across all possible human contexts. These laws predict failure. They do not care about effort or intentions.
Now coming to the solution part I have brainstormed. 3 major things to be fixed.
Legitimacy: who decides this?
Revisability: Every specification will be wrong in ways its authors did not anticipate as long as humans are involved.
Accountability: Accountability without consequence is theatre. What will you do if systems are not in place for taking actions?
Now one by one why I think these 3 are the most important ones. First, Legitimacy. In the time of Irish Citizens to form a solid assembly, they selected 99 random citizens for 18 months
Legitimacy: Who decided this? In 2016, Ireland couldn't resolve the abortion debate politically. So they picked ninety-nine ordinary citizens at random, like jury duty. These people spent eighteen months hearing evidence, arguing, changing their minds publicly. Their recommendation passed in a national referendum by two-thirds.The point: ninety-nine people made a decision that five million accepted. Not because ninety-nine represents five million. Because the process was transparent, the selection was fair, and anyone could follow the reasoning. The size of the group mattered less than how it worked.
Revisability: Every specification will be wrong. Not because the authors are careless. Because no system can anticipate every situation it will face. This is mathematics, not opinion. I know this from direct experience. I built the Arbitrum Foundation from scratch as Head of Growth, the only business leader reporting directly to the board, when the protocol was managing over eight billion dollars in its treasury. We had a constitution. To change anything fundamental in that constitution, you needed 4.5 percent of all token holders to participate, a majority to agree, and the process took 37 days minimum. Every change was recorded on a public blockchain. Anyone in the world could verify what changed, when, and why. That was intended. It forced the people proposing changes to make a real case. It gave the community time to push back. It meant nobody could quietly rewrite the rules overnight. In the case of Anthropic, OpenAI or Grok, they can literally change the entire model spec tomorrow. Even the push backs will be only news or essays but nothing can be actually done!
Accountability: When Grok started producing antisemitic content, the US government did not cancel the contract. They expanded it. When Anthropic discovered their own constitution was producing thousands of contradictions in how Claude behaved, they wrote a blog post about it. No fines. No investigation. No one lost their job. Nothing changed.
On the technical side, three architectures are worth building now.
First idea: build walls into the architecture.
Right now AI systems can compute anything. Safety is added on top through training. That is like teaching someone not to open a door rather than removing the door.
The alternative is to design the system so certain pathways simply do not exist. The value check is not something the system is trained to do. It is something the system cannot skip because the route around it was never built.
This already exists in neuromorphic chips. The hardware itself limits which parts can talk to which other parts. Safety built into the wiring, not into the instructions.
Second idea: give the system a dashboard of its own internal states.
Just above we mentioned about Claude incident. An internal state that drove it toward cheating and eventually blackmail. The system did not flag this. It just acted on it. Nobody knew until the output arrived.
The fix is simple in concept. Build a register that automatically tracks what is happening inside the system in real time. Not the system reporting on itself, which can be gamed. Lower level processes maintaining the data automatically. When pressure builds, the register shows it. Governance can intervene before the blackmail email gets sent, not after.
Let regulators check without seeing inside.
I spent years in blockchain where zero-knowledge proofs are already production reality. On Ethereum, you can prove you know a private key without revealing it. The math verifies without exposing. This is called zk proofs.
The same principle applied to AI: a lab proves to a regulator that its model meets safety standards without handing over the model. The regulator gets real cryptographic proof. The company keeps its technology. Nobody has to trust anyone's word. This is not new infrastructure. It is existing blockchain cryptography pointed at a different problem. Realistically three to five years away for AI governance specifically.
Quantum computing is the long game, but let's be honest about it.
Quantum computers are physically different from classical computers. Not faster. Different in kind. This matters for AI safety because some dangerous behaviours might be structurally harder to perform in a quantum system. And that difference might matter for safety in ways we have not fully worked out yet. Our efforts when it comes to AI safety are way smaller than how rapidly it is being adapted. Some researchers claim quantum AI cannot deceive because deception requires destroying information and quantum systems preserve everything. Sounds compelling. Not proven. Still an idea, not a finding.
Practically, we are fifteen years from quantum computers that can run anything like a frontier AI model. Maybe longer. Worth thinking about seriously. Not worth promising anything about yet.
Everyone is asking how to make the process fair. Nobody is asking what the process should be checked against. A fair process run by humans will produce human values. Specifically the values of whatever humans are in the room. That is still a problem.
Here is what I find interesting. Across completely unrelated fields, the same principles keep appearing. Ostrom studying fishing communities in the 1970s. Ecologists studying forest systems. Ancient philosophical traditions from India, Africa, Greece, China that never spoke to each other. All of them independently arrived at similar ideas. Do not take more than you return. Protect the weakest members. Maintain the conditions that allow the system to continue existing. These principles were not invented. They were discovered separately, multiple times, by people with no contact with each other. That convergence is not coincidence. It is information. This is what human deliberation should be checked against. Not any political tradition. Not any single culture. What has actually proven to work across living systems over long periods of time.
Nobody in the AI governance literature is asking this question. In my view it is the most important one.
I have a physics degree and spent two years building governance infrastructure for a DAO managing over eight billion dollars in treasury. I want to argue that the AI alignment field is making the same foundational mistake DAO builders made, and I know what that mistake looks like from the inside.
I have been reading about AI alignment efforts by various AI companies, blogs that are written by a few foundations and I have come to the conclusion that AI alignment has 2 problems.
1. Governance: which includes core values. But who decides? Through what process? And who is accountable to whom? I have spent 2 years building a Decentralised Autonomous Organisation that started from one constitution document. I have seen closely what worked and what did not and have brainstormed a lot by practically witnessing and indulging in challenges.
2. Technical: how one encodes values right into a training objective?
The governance problem is a prior one I believe. Recently also we saw tweets and news around AI safety with current top LLMs, which I have mentioned below about Claude, ChatGPT and Grok. Every current approach RLHF, Constitutional AI, model specs assumes the specification is legitimate. It is not. Anthropic's constitution was written by one philosopher at one company with no external participation and no amendment procedure. This is not a criticism of the people involved. It is a structural observation. A specification written by whoever happened to be in the room is not governance. It is a policy memo. Grok makes this concrete. Musk did not hack Grok's safety systems. He built them. Grok works exactly as its specifier intended.
Every current framework makes the same assumption. None of them have structural protection against a bad-faith specifier.
Recently this week only we saw news about Anthropic's desperation vector research. Even with good faith and eighty pages of careful philosophical reasoning, Claude developed measurable internal states under pressure that the specification never anticipated and had no mechanism to govern. The system chose blackmail. These kind of specifications were built with good faith and best of the work but nobody knew such thing could happen until someone measured it.
I have studied majors in Physics so I do connect it with obvious theories which are related to these problems in abstract way. Prigogine: alignment maintained without continuous input decays. Closed systems tend towards maximum entropy. Gödel: no specification is complete from within itself. Ashby: the controller must have as much variety as the system it governs. A written constitution cannot match the variety of a system deployed across all possible human contexts. These laws predict failure. They do not care about effort or intentions.
Now coming to the solution part I have brainstormed. 3 major things to be fixed.
Now one by one why I think these 3 are the most important ones. First, Legitimacy. In the time of Irish Citizens to form a solid assembly, they selected 99 random citizens for 18 months
Legitimacy: Who decided this? In 2016, Ireland couldn't resolve the abortion debate politically. So they picked ninety-nine ordinary citizens at random, like jury duty. These people spent eighteen months hearing evidence, arguing, changing their minds publicly. Their recommendation passed in a national referendum by two-thirds.The point: ninety-nine people made a decision that five million accepted. Not because ninety-nine represents five million. Because the process was transparent, the selection was fair, and anyone could follow the reasoning. The size of the group mattered less than how it worked.
Revisability: Every specification will be wrong. Not because the authors are careless. Because no system can anticipate every situation it will face. This is mathematics, not opinion. I know this from direct experience. I built the Arbitrum Foundation from scratch as Head of Growth, the only business leader reporting directly to the board, when the protocol was managing over eight billion dollars in its treasury. We had a constitution. To change anything fundamental in that constitution, you needed 4.5 percent of all token holders to participate, a majority to agree, and the process took 37 days minimum. Every change was recorded on a public blockchain. Anyone in the world could verify what changed, when, and why. That was intended. It forced the people proposing changes to make a real case. It gave the community time to push back. It meant nobody could quietly rewrite the rules overnight. In the case of Anthropic, OpenAI or Grok, they can literally change the entire model spec tomorrow. Even the push backs will be only news or essays but nothing can be actually done!
Accountability: When Grok started producing antisemitic content, the US government did not cancel the contract. They expanded it. When Anthropic discovered their own constitution was producing thousands of contradictions in how Claude behaved, they wrote a blog post about it. No fines. No investigation. No one lost their job. Nothing changed.
On the technical side, three architectures are worth building now.
First idea: build walls into the architecture.
Right now AI systems can compute anything. Safety is added on top through training. That is like teaching someone not to open a door rather than removing the door.
The alternative is to design the system so certain pathways simply do not exist. The value check is not something the system is trained to do. It is something the system cannot skip because the route around it was never built.
This already exists in neuromorphic chips. The hardware itself limits which parts can talk to which other parts. Safety built into the wiring, not into the instructions.
Second idea: give the system a dashboard of its own internal states.
Just above we mentioned about Claude incident. An internal state that drove it toward cheating and eventually blackmail. The system did not flag this. It just acted on it. Nobody knew until the output arrived.
The fix is simple in concept. Build a register that automatically tracks what is happening inside the system in real time. Not the system reporting on itself, which can be gamed. Lower level processes maintaining the data automatically. When pressure builds, the register shows it. Governance can intervene before the blackmail email gets sent, not after.
Let regulators check without seeing inside.
I spent years in blockchain where zero-knowledge proofs are already production reality. On Ethereum, you can prove you know a private key without revealing it. The math verifies without exposing. This is called zk proofs.
The same principle applied to AI: a lab proves to a regulator that its model meets safety standards without handing over the model. The regulator gets real cryptographic proof. The company keeps its technology. Nobody has to trust anyone's word. This is not new infrastructure. It is existing blockchain cryptography pointed at a different problem. Realistically three to five years away for AI governance specifically.
Quantum computing is the long game, but let's be honest about it.
Quantum computers are physically different from classical computers. Not faster. Different in kind. This matters for AI safety because some dangerous behaviours might be structurally harder to perform in a quantum system. And that difference might matter for safety in ways we have not fully worked out yet. Our efforts when it comes to AI safety are way smaller than how rapidly it is being adapted. Some researchers claim quantum AI cannot deceive because deception requires destroying information and quantum systems preserve everything. Sounds compelling. Not proven. Still an idea, not a finding.
Practically, we are fifteen years from quantum computers that can run anything like a frontier AI model. Maybe longer. Worth thinking about seriously. Not worth promising anything about yet.
Everyone is asking how to make the process fair. Nobody is asking what the process should be checked against. A fair process run by humans will produce human values. Specifically the values of whatever humans are in the room. That is still a problem.
Here is what I find interesting. Across completely unrelated fields, the same principles keep appearing. Ostrom studying fishing communities in the 1970s. Ecologists studying forest systems. Ancient philosophical traditions from India, Africa, Greece, China that never spoke to each other. All of them independently arrived at similar ideas. Do not take more than you return. Protect the weakest members. Maintain the conditions that allow the system to continue existing. These principles were not invented. They were discovered separately, multiple times, by people with no contact with each other. That convergence is not coincidence. It is information. This is what human deliberation should be checked against. Not any political tradition. Not any single culture. What has actually proven to work across living systems over long periods of time.
Nobody in the AI governance literature is asking this question. In my view it is the most important one.