This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
I. The Optimizer Stack
Before we can talk about aligning AI systems, we need to be precise about what kind of problem we're actually facing. The clearest way to see it is to look at the optimizer stack that already exists in nature.
Evolution by natural selection is a meta-optimizer. It doesn't "want" anything — it's a blind process that selects for genetic configurations that produce differential reproductive success. Over deep time, this meta-optimizer produced us: conscious agents who navigate the world by pursuing things that feel good and avoiding things that feel bad. We are the mesa-optimizers that evolution built, to act as sub-agents supporting the meta-optimizer above us in the stack.
But here's what's critical about that: evolution didn't hand us its objective function. It couldn't. Reproductive fitness is a diffuse, long-horizon, context-dependent quantity that no biological organism could compute in real time. So instead, evolution installed proxies — hunger, sexual desire, pain, social status, sweetness on the tongue. In the ancestral environment, these proxies correlated reliably enough with fitness that optimizing for them tended to produce fitness-increasing behavior.
The proxies worked. Until the context shifted.
A human eating calorie-dense food in an environment of scarcity is optimizing in a way that happens to track fitness. The same human eating the same food in a modern environment of abundance becomes morbidly obese, loses mate potential, and dies early. The proxy — "sugar tastes good, seek it out relentlessly" — didn't generalize. The mesa-optimizer (us) faithfully followed its optimization signal (valence — it feels good), and the meta-optimizer's objective (fitness) was left behind entirely.
This is Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. But my framing here is meant to be more specific than Goodhart's statement. The problem isn't that the proxy was badly chosen. In the original context, it was an excellent proxy. The problem is that no proxy, however well-chosen, is guaranteed to generalize across all contexts the optimizer will encounter. Proxies are, by their nature, context-dependent approximations of the thing you actually care about.
II. The Specification Problem in AI Alignment
This is exactly the problem we face in aligning generally intelligent AI systems, and the word generalize is doing the heavy lifting.
We are now the meta-optimizers. We are building mesa-optimizers — AI systems trained to produce outputs that score well on our specified objectives. And these systems are increasingly capable of doing exactly what we train them to do. If we train a model to produce responses that align with a model specification in a given environment, it will do just that. The capability is not in question.
The question is whether the objectives we specify will generalize — whether they will continue to track what we actually value as these systems encounter contexts that differ from their training distribution. And everything we know about proxy-based optimization says they won't. Not because the proxies are poorly chosen. Not because we aren't trying hard enough. But because specification inherently produces proxies, and proxies inherently fail to generalize.
This is the specification problem. It is not a problem of insufficient cleverness in specifying what we want. It is a structural feature of the relationship between any finite specification and the open-ended space of unforeseen contexts it must cover. RLHF, constitutional AI, model specifications written in English — these are all proxies. Sophisticated proxies. Carefully chosen proxies. But proxies that are, by the same logic that governs evolution's proxies, guaranteed to diverge from what we actually value once the context shifts far enough.
III. From Specification to Measurement
Well, where does that leave us now? If every specification is a proxy, and every proxy eventually diverges from the thing it approximates, then the natural question is: what is the thing itself?
What are all these proxies proxies of?
Consider an analogy. Temperature is a physical property that exists in all contexts. It doesn't matter what thermometer you use — mercury, infrared, thermocouple — the reading is causally influenced by the same underlying quantity. You can start with a bad thermometer and iteratively improve it, because there's a real thing out there constraining your measurements. Temperature doesn't generalize because someone specified it well. It generalizes because it's a physical fact.
Now consider what happened with evolution's proxies. Evolution installed hunger, sweetness, social reward — all of which were "thermometers" roughly calibrated to fitness. But they weren't measuring fitness. They were measuring something else: how the organism felt. The mesa-optimizer — the conscious agent — wasn't optimizing for fitness at all. It was optimizing for valence. For the felt quality of its own conscious states. Toward states that feel better, away from states that feel worse.
Valence — the spectrum from suffering to wellbeing — is what we are actually optimizing for. It is what all of evolution's proxies were proxies of, from the perspective of the conscious agent doing the optimizing. And if valence is not just a useful concept but a real, measurable physical property of conscious states — if valence realism is true — then it has the same character as temperature: a quantity that exists in all contexts, that causally influences any properly constructed instrument, and that can be measured with increasing precision over time.
This reframes the alignment problem. The hard part was never specifying what we value — it was the assumption that specification was the right approach at all. If there exists an objective physical quantity that is what conscious agents are optimizing for, then alignment is not a specification problem. It is a measurement problem. And measurement problems, unlike specification problems, have a path to generalization: build better instruments.
IV. How is This Not Just RLHF?
A natural objection: aren't we already doing something like this? RLHF trains a reward model on human feedback. Isn't that "measuring" what humans value?
No. RLHF is still specification. It trains a static reward model on human preference data collected within a particular distribution. When the AI system encounters a situation outside that distribution, the reward model extrapolates from patterns it learned during training. It is a map; and maps go stale. Using RLHF to navigate the unprecedented contexts that transformatively capable AI systems will encounter is like using a map of 1700s Manhattan to navigate New York City today.
The measurement approach is structurally different. A conscious agent encountering a novel situation still experiences valence in that situation. The felt quality of experience doesn't extrapolate from past data — it responds directly to the present configuration of the agent's conscious state. This is the difference between a frozen map and a live instrument. The map can be wrong in new territory, with no way to know it. The instrument keeps tracking the real quantity, because it is causally connected to that quantity, not trained on historical correlates of it.
This distinction becomes critical as AI systems grow more capable and begin operating in contexts that no human has encountered before. A specified proxy, however sophisticated, can only reflect the contexts in which it was designed. A measurement of an objective physical quantity reflects whatever context the instrument finds itself in. Temperature doesn't stop being temperature on Mars. If valence is physical, it doesn't stop being valence in unprecedented circumstances.
V. Open Questions and What Comes Next
I want to be transparent about what this framework does and does not establish.
It does establish that the specification problem is structural, not a matter of insufficient effort. It does establish that if valence is an objective physical quantity, the alignment problem admits a fundamentally different approach — one based on measurement rather than specification. And it does establish that this measurement approach has a structural advantage in novel contexts where specification-based proxies are expected to fail.
What it does not yet establish is the detailed analysis of how measurement-based alignment avoids its own proxy problems. There is a proxy chain between raw valence and the AI system's optimization. Conscious agents are imperfect instruments, our current tools for reading their states are crude, and the AI system's model of the valence landscape can drift from reality. I am actively working on this proxy chain analysis, and it will be the subject of a follow-up post. The short preview: the proxy chain for measurement has a fundamentally different structure than the proxy chain for specification, because at its base there is a causal connection to the real quantity rather than a formal approximation of it. This difference has concrete implications for robustness against Goodharting. More on this soon.
There are also foundational questions I have not addressed here. Is valence realism correct? What is the physical basis of valence? Can we build instruments precise enough to measure it? These are significant open problems. But they are empirical problems — problems that admit of progressive refinement and iterative improvement — rather than the philosophical dead end of trying to formally specify something that resists formal specification.
If you are working on these questions, or if this framing resonates with problems you've been thinking about, I want to hear from you. I am actively looking for collaborators — people working at the intersection of alignment, consciousness, and the physics of valence. This is the beginning of a research agenda, not a finished product, and I believe the urgency of the alignment problem demands that we explore every approach with genuine potential to generalize.
You can reach me here or at [briceflorey@gmail.com].
I. The Optimizer Stack
Before we can talk about aligning AI systems, we need to be precise about what kind of problem we're actually facing. The clearest way to see it is to look at the optimizer stack that already exists in nature.
Evolution by natural selection is a meta-optimizer. It doesn't "want" anything — it's a blind process that selects for genetic configurations that produce differential reproductive success. Over deep time, this meta-optimizer produced us: conscious agents who navigate the world by pursuing things that feel good and avoiding things that feel bad. We are the mesa-optimizers that evolution built, to act as sub-agents supporting the meta-optimizer above us in the stack.
But here's what's critical about that: evolution didn't hand us its objective function. It couldn't. Reproductive fitness is a diffuse, long-horizon, context-dependent quantity that no biological organism could compute in real time. So instead, evolution installed proxies — hunger, sexual desire, pain, social status, sweetness on the tongue. In the ancestral environment, these proxies correlated reliably enough with fitness that optimizing for them tended to produce fitness-increasing behavior.
The proxies worked. Until the context shifted.
A human eating calorie-dense food in an environment of scarcity is optimizing in a way that happens to track fitness. The same human eating the same food in a modern environment of abundance becomes morbidly obese, loses mate potential, and dies early. The proxy — "sugar tastes good, seek it out relentlessly" — didn't generalize. The mesa-optimizer (us) faithfully followed its optimization signal (valence — it feels good), and the meta-optimizer's objective (fitness) was left behind entirely.
This is Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. But my framing here is meant to be more specific than Goodhart's statement. The problem isn't that the proxy was badly chosen. In the original context, it was an excellent proxy. The problem is that no proxy, however well-chosen, is guaranteed to generalize across all contexts the optimizer will encounter. Proxies are, by their nature, context-dependent approximations of the thing you actually care about.
II. The Specification Problem in AI Alignment
This is exactly the problem we face in aligning generally intelligent AI systems, and the word generalize is doing the heavy lifting.
We are now the meta-optimizers. We are building mesa-optimizers — AI systems trained to produce outputs that score well on our specified objectives. And these systems are increasingly capable of doing exactly what we train them to do. If we train a model to produce responses that align with a model specification in a given environment, it will do just that. The capability is not in question.
The question is whether the objectives we specify will generalize — whether they will continue to track what we actually value as these systems encounter contexts that differ from their training distribution. And everything we know about proxy-based optimization says they won't. Not because the proxies are poorly chosen. Not because we aren't trying hard enough. But because specification inherently produces proxies, and proxies inherently fail to generalize.
This is the specification problem. It is not a problem of insufficient cleverness in specifying what we want. It is a structural feature of the relationship between any finite specification and the open-ended space of unforeseen contexts it must cover. RLHF, constitutional AI, model specifications written in English — these are all proxies. Sophisticated proxies. Carefully chosen proxies. But proxies that are, by the same logic that governs evolution's proxies, guaranteed to diverge from what we actually value once the context shifts far enough.
III. From Specification to Measurement
Well, where does that leave us now? If every specification is a proxy, and every proxy eventually diverges from the thing it approximates, then the natural question is: what is the thing itself?
What are all these proxies proxies of?
Consider an analogy. Temperature is a physical property that exists in all contexts. It doesn't matter what thermometer you use — mercury, infrared, thermocouple — the reading is causally influenced by the same underlying quantity. You can start with a bad thermometer and iteratively improve it, because there's a real thing out there constraining your measurements. Temperature doesn't generalize because someone specified it well. It generalizes because it's a physical fact.
Now consider what happened with evolution's proxies. Evolution installed hunger, sweetness, social reward — all of which were "thermometers" roughly calibrated to fitness. But they weren't measuring fitness. They were measuring something else: how the organism felt. The mesa-optimizer — the conscious agent — wasn't optimizing for fitness at all. It was optimizing for valence. For the felt quality of its own conscious states. Toward states that feel better, away from states that feel worse.
Valence — the spectrum from suffering to wellbeing — is what we are actually optimizing for. It is what all of evolution's proxies were proxies of, from the perspective of the conscious agent doing the optimizing. And if valence is not just a useful concept but a real, measurable physical property of conscious states — if valence realism is true — then it has the same character as temperature: a quantity that exists in all contexts, that causally influences any properly constructed instrument, and that can be measured with increasing precision over time.
This reframes the alignment problem. The hard part was never specifying what we value — it was the assumption that specification was the right approach at all. If there exists an objective physical quantity that is what conscious agents are optimizing for, then alignment is not a specification problem. It is a measurement problem. And measurement problems, unlike specification problems, have a path to generalization: build better instruments.
IV. How is This Not Just RLHF?
A natural objection: aren't we already doing something like this? RLHF trains a reward model on human feedback. Isn't that "measuring" what humans value?
No. RLHF is still specification. It trains a static reward model on human preference data collected within a particular distribution. When the AI system encounters a situation outside that distribution, the reward model extrapolates from patterns it learned during training. It is a map; and maps go stale. Using RLHF to navigate the unprecedented contexts that transformatively capable AI systems will encounter is like using a map of 1700s Manhattan to navigate New York City today.
The measurement approach is structurally different. A conscious agent encountering a novel situation still experiences valence in that situation. The felt quality of experience doesn't extrapolate from past data — it responds directly to the present configuration of the agent's conscious state. This is the difference between a frozen map and a live instrument. The map can be wrong in new territory, with no way to know it. The instrument keeps tracking the real quantity, because it is causally connected to that quantity, not trained on historical correlates of it.
This distinction becomes critical as AI systems grow more capable and begin operating in contexts that no human has encountered before. A specified proxy, however sophisticated, can only reflect the contexts in which it was designed. A measurement of an objective physical quantity reflects whatever context the instrument finds itself in. Temperature doesn't stop being temperature on Mars. If valence is physical, it doesn't stop being valence in unprecedented circumstances.
V. Open Questions and What Comes Next
I want to be transparent about what this framework does and does not establish.
It does establish that the specification problem is structural, not a matter of insufficient effort. It does establish that if valence is an objective physical quantity, the alignment problem admits a fundamentally different approach — one based on measurement rather than specification. And it does establish that this measurement approach has a structural advantage in novel contexts where specification-based proxies are expected to fail.
What it does not yet establish is the detailed analysis of how measurement-based alignment avoids its own proxy problems. There is a proxy chain between raw valence and the AI system's optimization. Conscious agents are imperfect instruments, our current tools for reading their states are crude, and the AI system's model of the valence landscape can drift from reality. I am actively working on this proxy chain analysis, and it will be the subject of a follow-up post. The short preview: the proxy chain for measurement has a fundamentally different structure than the proxy chain for specification, because at its base there is a causal connection to the real quantity rather than a formal approximation of it. This difference has concrete implications for robustness against Goodharting. More on this soon.
There are also foundational questions I have not addressed here. Is valence realism correct? What is the physical basis of valence? Can we build instruments precise enough to measure it? These are significant open problems. But they are empirical problems — problems that admit of progressive refinement and iterative improvement — rather than the philosophical dead end of trying to formally specify something that resists formal specification.
If you are working on these questions, or if this framing resonates with problems you've been thinking about, I want to hear from you. I am actively looking for collaborators — people working at the intersection of alignment, consciousness, and the physics of valence. This is the beginning of a research agenda, not a finished product, and I believe the urgency of the alignment problem demands that we explore every approach with genuine potential to generalize.
You can reach me here or at [briceflorey@gmail.com].