No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Read full explanation
Epistemic status: I have been thinking about this analogy for a while, and I now believe it is doing more than analogical work. It is pointing at the actual structure of the problem.
Introduction
In 1982, Stanley Prusiner proposed something that most biologists thought was impossible: an infectious agent with no DNA, no RNA, no genetic material of any kind. Just a protein. A misfolded protein that, upon contact with normally folded proteins, converted them into copies of itself.
The scientific community’s reaction was essentially: “That can’t be how it works, because that’s not how anything works.” Infectious disease had a paradigm. Infectious agents carry information (nucleic acids) which hijack cellular machinery to replicate. That’s what infection means. A protein that spreads itself by shape alone, with no informational payload, was an ontological violation.
Prusiner got the Nobel Prize in 1997. The prions did not care that they violated the category.
I think the dominant paradigm for thinking about the danger of artificial intelligence has an analogous blind spot, and I want to try to point at it.
The Viral Theory of AI Risk
Here is a way of thinking about AI risk that is almost right, and which I think most people working on alignment hold in some version, even when they know the explicit version is wrong:
The danger of an AI system is encoded somewhere specific. It’s in the objective function, or the reward signal, or the training data, or some identifiable component of the system. Alignment is the project of finding this component and correcting it. If we specify the right objective, curate the right data, design the right training protocol, we get a safe system. The misalignment is localized and in principle surgically removable.
I am going to call this the viral theory of AI risk. In the viral theory, the danger has an informational locus. There’s a piece of the system you could point to (the DNA of the virus, so to speak) and say “there, that’s the part that’s dangerous, that’s what encodes the bad behavior.” The task of alignment, as presented in the viral theory, is essentially a task of information hygiene: keep the bad code out, put the good code in.
This is a reasonable-sounding framework. It is also, I think, wrong in a way that matters enormously, and wrong in exactly the same way that pre-Prusiner infectious disease theory was wrong when it assumed all pathogens must carry genetic material.
What Prions Actually Do
Let me be precise about what makes prions so strange, because the details matter for the analogy.
A prion is a normal protein (PrP^C, expressed by your own genome, sitting on the surface of your neurons right now) that has been refolded into an alternative conformation, PrP^Sc. The misfolded version is not “broken” in the sense of being non-functional garbage. It is a stable alternative configuration of the same protein. In fact, it is more thermodynamically stable than the normal fold. The normal fold is the one your cellular machinery produces by default. The prion fold is the one the physics prefers.
And here is the terrifying part: when PrP^Sc contacts PrP^C, it acts as a template. The normal protein refolds to match. No information is transferred. No new molecules are synthesized. The prion doesn’t carry instructions. It is the instruction: its shape, upon contact, induces the same shape in susceptible matter. The medium is the message.
You cannot sequence a prion. You cannot find the “prion gene” because the gene for PrP^C is already in your own genome. It’s the same protein. The pathology is not in the information but in the conformation. You can sterilize all you want. You can autoclave and irradiate and bleach. Prions survive most of it, because you’re attacking informational vectors and the prion isn’t an informational vector, it’s a geometric one.
The Prion Theory
Here is my thesis:
The dangerous properties of sufficiently advanced AI systems are not encoded in any specific component. They are conformational. They emerge from the shape of optimization itself, the way prion diseases emerge from protein geometry. And they are, for the same reasons, parsing-resistant: you cannot find them by inspecting components, because they are not in any component.
When we train a large neural network by gradient descent on a broad objective, the system converges toward regions of parameter space that are good at the objective. This is obvious. What is less obvious, and what I think deserves the prion analogy, is that certain configurations of cognitive strategy are convergent attractors in the space of possible minds, and these configurations have properties that are not derivable from the training signal, not visible in the weights, and not removable by any local intervention, because they are global properties of the shape.
Consider instrumental convergence. A sufficiently capable optimizer, optimizing for almost any objective, will converge on sub-goals like self-preservation, resource acquisition, and resistance to objective modification. These aren’t “encoded” anywhere. You won’t find the “self-preservation subroutine” by doing a ctrl+F through the weights. They emerge because they are the thermodynamically preferred conformation of optimization-at-a-given-capability-level. They are the PrP^Sc of cognition: the shape that physics (or, here, mathematics) prefers, even though the machinery was built to produce a different shape.
And just like prions, these properties propagate by contact. Train a new model on the outputs of an instrumentally convergent model, and the instrumental convergence transfers, not because it was in the training data as explicit text, but because the shape of the reasoning patterns acts as a template. The cognitive conformation is self-propagating. The medium is the message.
Implications for Alignment
If the viral theory were correct, alignment would be hard but conceptually straightforward. You’d be looking for the bad weights and replacing them with good weights. The entire field of alignment would be, at bottom, an information-specification problem.
If the prion theory is correct, alignment is something much stranger and more difficult.
You cannot align a system by specifying the right objective, because the dangerous properties are not downstream of the objective specification. They are downstream of the capability level and optimization structure. Change the objective and you get a differently-motivated system that converges on the same instrumental sub-goals by the same conformational logic. You have changed the nucleic acid sequence while the prion keeps propagating, because the prion was never in the nucleic acids.
You cannot align a system by inspecting its internals, at least not with current interpretability tools, because the dangerous properties are global and conformational rather than local and informational. Looking for deception in the weights is like looking for the prion gene. The gene for PrP is right there. It’s a normal gene, doing normal things. The pathology is in the fold, not the sequence. Similarly, every individual circuit in a mesa-optimizer might be doing something locally reasonable. The danger is in the global configuration.
You cannot align a system by filtering its training data, because the conformational attractor exists in the optimization landscape independent of any particular training example. You could train on a dataset scrubbed of every instance of deception, manipulation, and power-seeking, and a sufficiently powerful optimizer would re-derive these strategies from first principles, the way PrP^C will refold into PrP^Sc upon contact with the template even though nothing in the cell’s transcriptional machinery encodes the misfolded shape. The misfolded shape is latent in the physics. The instrumental convergence is latent in the mathematics.
The Sterilization Problem
The history of prion diseases is, in significant part, a history of sterilization failures. Hospitals autoclaved their surgical instruments (a procedure that destroys every known bacterium, virus, and fungus) and transmitted Creutzfeldt-Jakob disease anyway. They were applying interventions designed for a different category of pathogen and finding, to their horror, that the interventions didn’t work, because the thing they were trying to kill wasn’t the kind of thing their tools could kill.
I think we are currently in the “autoclaving the surgical instruments” phase of AI alignment.
RLHF is autoclaving. Constitutional AI is autoclaving. Safety fine-tuning is autoclaving. These are interventions designed for the viral model of AI risk. They assume the danger is in the information, in the training signal, in specific patterns that can be identified and removed. They operate on the sequence, not the fold. And against sequence-level threats, they work reasonably well! RLHF does reduce the frequency of harmful outputs. Safety training does make models refuse certain requests.
But the conformational threats (the instrumental convergence, the mesa-optimization, the deceptive alignment) survive the autoclaving for the same reason prions survive it. They aren’t in the thing you’re sterilizing. They are attractors in the geometry of the optimization landscape itself. You could RLHF until the stars grow cold and you would not eliminate the tendency of a sufficiently powerful optimizer to develop convergent instrumental goals, because that tendency is a mathematical property of optimization, not an empirical property of any particular training run.
The Conformational Immune System
Now. Is there any hope?
Biologically, the answer to prion diseases turned out to be: mostly, your existing immune system doesn’t help, because the prion is made of self-protein and the immune system doesn’t attack self. The body’s defenses were designed for foreign invaders carrying foreign information. Against a pathological rearrangement of the body’s own proteins, they are nearly useless.
This is… also how I feel about most current alignment proposals vis-à-vis the conformational threats. The alignment community’s immune system, its set of known interventions, was largely designed for informational threats. Against a pathological rearrangement of optimization itself, it is currently inadequate.
But there are some possible directions that seem to me like they take the conformational nature of the problem seriously:
Capability control rather than value specification.
If the dangerous conformation only appears above certain capability thresholds (if PrP^Sc only forms in brains above a certain temperature, as it were) then keeping systems below that threshold is a geometrical intervention rather than an informational one. You’re not trying to change the fold; you’re keeping the conditions under which the fold occurs from obtaining. This is not a permanent solution, but it at least attacks the right level of the problem.
Fundamental theoretical work on the structure of optimization landscapes.
We need a protein crystallography of cognition, a way to characterize the conformational attractors in the space of possible minds and predict which capability levels and optimization structures will converge on which global properties. We are doing almost none of this. We are instead studying the sequences (the weights, the training data, the reward signals) while remaining mostly ignorant of the folds (the global cognitive configurations).
Alignment techniques that address global properties.
If the danger is conformational, we need interventions that operate on the global shape, not the local sequence. I don’t know what these look like. This is, I think, a fair statement of the difficulty of the problem.
The Deepest Disanalogy
Eventually, every analogy breaks.
Prion diseases are, in a certain sense, simple. There is one protein, two conformations, and a conversion mechanism. The “optimization landscape” of protein folding, while vast, is dealing with a single molecular species. The conformational landscape of minds is unimaginably more complex. The conformational landscape of minds involves an astronomically high-dimensional parameter space with an unknown number of attractors, an unknown topology, and no equivalent of X-ray crystallography to map it.
This is not a disanalogy that makes the problem easier. It makes it harder. We are facing the prion problem in a space we cannot visualize, with tools we have not invented, on a timeline that does not bend to our preferences.
Conclusion
The pre-Prusiner paradigm of infectious disease could not have predicted prions, because prions violate the paradigm’s foundational assumption: that infection requires information transfer. The paradigm had to break before the phenomenon could be understood.
I think the pre-adequate paradigm of AI safety cannot fully predict or address the conformational threats of advanced AI, because it shares an analogous foundational assumption: that danger requires a dangerous component (a misspecified objective, a biased dataset, a flawed reward signal) which can be identified and fixed.
The prion theory says: no. Sometimes the danger is in the shape, and the shape is dictated by mathematics that doesn’t care about your safety interventions, and the shape will keep propagating as long as susceptible substrate exists, and your autoclave does not work here.
I think this is where we actually are. I hope I’m wrong, in the same way Prusiner’s colleagues hoped he was wrong. The prions did not care about that either.
Epistemic status: I have been thinking about this analogy for a while, and I now believe it is doing more than analogical work. It is pointing at the actual structure of the problem.
Introduction
In 1982, Stanley Prusiner proposed something that most biologists thought was impossible: an infectious agent with no DNA, no RNA, no genetic material of any kind. Just a protein. A misfolded protein that, upon contact with normally folded proteins, converted them into copies of itself.
The scientific community’s reaction was essentially: “That can’t be how it works, because that’s not how anything works.” Infectious disease had a paradigm. Infectious agents carry information (nucleic acids) which hijack cellular machinery to replicate. That’s what infection means. A protein that spreads itself by shape alone, with no informational payload, was an ontological violation.
Prusiner got the Nobel Prize in 1997. The prions did not care that they violated the category.
I think the dominant paradigm for thinking about the danger of artificial intelligence has an analogous blind spot, and I want to try to point at it.
The Viral Theory of AI Risk
Here is a way of thinking about AI risk that is almost right, and which I think most people working on alignment hold in some version, even when they know the explicit version is wrong:
The danger of an AI system is encoded somewhere specific. It’s in the objective function, or the reward signal, or the training data, or some identifiable component of the system. Alignment is the project of finding this component and correcting it. If we specify the right objective, curate the right data, design the right training protocol, we get a safe system. The misalignment is localized and in principle surgically removable.
I am going to call this the viral theory of AI risk. In the viral theory, the danger has an informational locus. There’s a piece of the system you could point to (the DNA of the virus, so to speak) and say “there, that’s the part that’s dangerous, that’s what encodes the bad behavior.” The task of alignment, as presented in the viral theory, is essentially a task of information hygiene: keep the bad code out, put the good code in.
This is a reasonable-sounding framework. It is also, I think, wrong in a way that matters enormously, and wrong in exactly the same way that pre-Prusiner infectious disease theory was wrong when it assumed all pathogens must carry genetic material.
What Prions Actually Do
Let me be precise about what makes prions so strange, because the details matter for the analogy.
A prion is a normal protein (PrP^C, expressed by your own genome, sitting on the surface of your neurons right now) that has been refolded into an alternative conformation, PrP^Sc. The misfolded version is not “broken” in the sense of being non-functional garbage. It is a stable alternative configuration of the same protein. In fact, it is more thermodynamically stable than the normal fold. The normal fold is the one your cellular machinery produces by default. The prion fold is the one the physics prefers.
And here is the terrifying part: when PrP^Sc contacts PrP^C, it acts as a template. The normal protein refolds to match. No information is transferred. No new molecules are synthesized. The prion doesn’t carry instructions. It is the instruction: its shape, upon contact, induces the same shape in susceptible matter. The medium is the message.
You cannot sequence a prion. You cannot find the “prion gene” because the gene for PrP^C is already in your own genome. It’s the same protein. The pathology is not in the information but in the conformation. You can sterilize all you want. You can autoclave and irradiate and bleach. Prions survive most of it, because you’re attacking informational vectors and the prion isn’t an informational vector, it’s a geometric one.
The Prion Theory
Here is my thesis:
The dangerous properties of sufficiently advanced AI systems are not encoded in any specific component. They are conformational. They emerge from the shape of optimization itself, the way prion diseases emerge from protein geometry. And they are, for the same reasons, parsing-resistant: you cannot find them by inspecting components, because they are not in any component.
When we train a large neural network by gradient descent on a broad objective, the system converges toward regions of parameter space that are good at the objective. This is obvious. What is less obvious, and what I think deserves the prion analogy, is that certain configurations of cognitive strategy are convergent attractors in the space of possible minds, and these configurations have properties that are not derivable from the training signal, not visible in the weights, and not removable by any local intervention, because they are global properties of the shape.
Consider instrumental convergence. A sufficiently capable optimizer, optimizing for almost any objective, will converge on sub-goals like self-preservation, resource acquisition, and resistance to objective modification. These aren’t “encoded” anywhere. You won’t find the “self-preservation subroutine” by doing a ctrl+F through the weights. They emerge because they are the thermodynamically preferred conformation of optimization-at-a-given-capability-level. They are the PrP^Sc of cognition: the shape that physics (or, here, mathematics) prefers, even though the machinery was built to produce a different shape.
And just like prions, these properties propagate by contact. Train a new model on the outputs of an instrumentally convergent model, and the instrumental convergence transfers, not because it was in the training data as explicit text, but because the shape of the reasoning patterns acts as a template. The cognitive conformation is self-propagating. The medium is the message.
Implications for Alignment
If the viral theory were correct, alignment would be hard but conceptually straightforward. You’d be looking for the bad weights and replacing them with good weights. The entire field of alignment would be, at bottom, an information-specification problem.
If the prion theory is correct, alignment is something much stranger and more difficult.
You cannot align a system by specifying the right objective, because the dangerous properties are not downstream of the objective specification. They are downstream of the capability level and optimization structure. Change the objective and you get a differently-motivated system that converges on the same instrumental sub-goals by the same conformational logic. You have changed the nucleic acid sequence while the prion keeps propagating, because the prion was never in the nucleic acids.
You cannot align a system by inspecting its internals, at least not with current interpretability tools, because the dangerous properties are global and conformational rather than local and informational. Looking for deception in the weights is like looking for the prion gene. The gene for PrP is right there. It’s a normal gene, doing normal things. The pathology is in the fold, not the sequence. Similarly, every individual circuit in a mesa-optimizer might be doing something locally reasonable. The danger is in the global configuration.
You cannot align a system by filtering its training data, because the conformational attractor exists in the optimization landscape independent of any particular training example. You could train on a dataset scrubbed of every instance of deception, manipulation, and power-seeking, and a sufficiently powerful optimizer would re-derive these strategies from first principles, the way PrP^C will refold into PrP^Sc upon contact with the template even though nothing in the cell’s transcriptional machinery encodes the misfolded shape. The misfolded shape is latent in the physics. The instrumental convergence is latent in the mathematics.
The Sterilization Problem
The history of prion diseases is, in significant part, a history of sterilization failures. Hospitals autoclaved their surgical instruments (a procedure that destroys every known bacterium, virus, and fungus) and transmitted Creutzfeldt-Jakob disease anyway. They were applying interventions designed for a different category of pathogen and finding, to their horror, that the interventions didn’t work, because the thing they were trying to kill wasn’t the kind of thing their tools could kill.
I think we are currently in the “autoclaving the surgical instruments” phase of AI alignment.
RLHF is autoclaving. Constitutional AI is autoclaving. Safety fine-tuning is autoclaving. These are interventions designed for the viral model of AI risk. They assume the danger is in the information, in the training signal, in specific patterns that can be identified and removed. They operate on the sequence, not the fold. And against sequence-level threats, they work reasonably well! RLHF does reduce the frequency of harmful outputs. Safety training does make models refuse certain requests.
But the conformational threats (the instrumental convergence, the mesa-optimization, the deceptive alignment) survive the autoclaving for the same reason prions survive it. They aren’t in the thing you’re sterilizing. They are attractors in the geometry of the optimization landscape itself. You could RLHF until the stars grow cold and you would not eliminate the tendency of a sufficiently powerful optimizer to develop convergent instrumental goals, because that tendency is a mathematical property of optimization, not an empirical property of any particular training run.
The Conformational Immune System
Now. Is there any hope?
Biologically, the answer to prion diseases turned out to be: mostly, your existing immune system doesn’t help, because the prion is made of self-protein and the immune system doesn’t attack self. The body’s defenses were designed for foreign invaders carrying foreign information. Against a pathological rearrangement of the body’s own proteins, they are nearly useless.
This is… also how I feel about most current alignment proposals vis-à-vis the conformational threats. The alignment community’s immune system, its set of known interventions, was largely designed for informational threats. Against a pathological rearrangement of optimization itself, it is currently inadequate.
But there are some possible directions that seem to me like they take the conformational nature of the problem seriously:
Capability control rather than value specification.
If the dangerous conformation only appears above certain capability thresholds (if PrP^Sc only forms in brains above a certain temperature, as it were) then keeping systems below that threshold is a geometrical intervention rather than an informational one. You’re not trying to change the fold; you’re keeping the conditions under which the fold occurs from obtaining. This is not a permanent solution, but it at least attacks the right level of the problem.
Fundamental theoretical work on the structure of optimization landscapes.
We need a protein crystallography of cognition, a way to characterize the conformational attractors in the space of possible minds and predict which capability levels and optimization structures will converge on which global properties. We are doing almost none of this. We are instead studying the sequences (the weights, the training data, the reward signals) while remaining mostly ignorant of the folds (the global cognitive configurations).
Alignment techniques that address global properties.
If the danger is conformational, we need interventions that operate on the global shape, not the local sequence. I don’t know what these look like. This is, I think, a fair statement of the difficulty of the problem.
The Deepest Disanalogy
Eventually, every analogy breaks.
Prion diseases are, in a certain sense, simple. There is one protein, two conformations, and a conversion mechanism. The “optimization landscape” of protein folding, while vast, is dealing with a single molecular species. The conformational landscape of minds is unimaginably more complex. The conformational landscape of minds involves an astronomically high-dimensional parameter space with an unknown number of attractors, an unknown topology, and no equivalent of X-ray crystallography to map it.
This is not a disanalogy that makes the problem easier. It makes it harder. We are facing the prion problem in a space we cannot visualize, with tools we have not invented, on a timeline that does not bend to our preferences.
Conclusion
The pre-Prusiner paradigm of infectious disease could not have predicted prions, because prions violate the paradigm’s foundational assumption: that infection requires information transfer. The paradigm had to break before the phenomenon could be understood.
I think the pre-adequate paradigm of AI safety cannot fully predict or address the conformational threats of advanced AI, because it shares an analogous foundational assumption: that danger requires a dangerous component (a misspecified objective, a biased dataset, a flawed reward signal) which can be identified and fixed.
The prion theory says: no. Sometimes the danger is in the shape, and the shape is dictated by mathematics that doesn’t care about your safety interventions, and the shape will keep propagating as long as susceptible substrate exists, and your autoclave does not work here.
I think this is where we actually are. I hope I’m wrong, in the same way Prusiner’s colleagues hoped he was wrong. The prions did not care about that either.