Value Learning Needs a Low-Dimensional Bottleneck

Gunnar_Zarncke

Epistemic status: Confident in the direction, not confident in the numbers. I have spent a few hours looking into this.

Suppose human values were internally coherent, high-dimensional, explicit, and decently stable under reflection. Would alignment be easier or harder?

My below calculations show that it would be much harder, if not impossible. I'm going to try to defend the claim that:

Human values are alignable only because evolution compressed motivation into a small number of low-bandwidth bottlenecks^[1], so that tiny genetic changes can change behavior locally.

If behavior is driven by a high-dimensional reward vector , inverse reinforcement learning requires an unreasonable number of samples. But if it is driven by a low-rank projection $r \mapsto s \in R^{k}$ with small k, inference may become tractable.

A common worry about human values is that they are complicated and inconsistent^[2]^[3]^[4]. And the intuition seems to be that this makes alignment harder. But maybe the opposite is the case. Inconsistency is what you expect from lossy compression, and the dimensionality reduction makes the signal potentially learnable.

Calculation with Abbeel & Ng's formula^[5] gives for the number m of necessary expert demonstrations (independent trajectories):

m >	k=10 (values are low-dim)	k=1000 (values are complex)
γ = 0.9 (short horizon ~10 steps)	$1.2 \times 10^{6}$	$2 \times 10^{8}$
γ = 0.99 (long horizon ~100 steps)	$1.2 \times 10^{8}$	$2 \times 10^{10}$

If you need at least 20 billion samples to learn complex values, we are doomed. But it may become solvable with a reduction of the number of required trajectories by a factor of about 200 (depending on how high-dimensional you were thinking the values are; 1000 is surely conservative - if any kinds of values can actually be learned the number may be much higher).

This could explain why constitutional AI works better than expected^[6]: A low-dimensional latent space seems to capture most of the variation in preference alignment^[7]^[8]. The reduction by x200 doesn't mean it's easy. The bottleneck helps with identifiability, but we still need many trajectories and the mapping of the structure of the bottleneck^[9] can still kill us.

How can we test if the dimensionality of human values is actually low? We should see diminishing returns in predictability for example when using N pairwise comparisons of value-related queries. Predictability should drop off at $N \approx O (k log k) - O (k^{2})$ , e.g., for k∼10 we'd expect an elbow around N~150.

^{^}
I'm agnostic of what the specific bottlenecks are here, but I'm thinking of the channels in Steven Byrnes' steering system model and the limited number of brain regions that are influenced. See my sketch here.
^{^}
AI Alignment Problem: ‘Human Values’ don’t Actually Exist argues that human values are inherently inconsistent and not well-defined enough for a stable utility target:
Humans often have contradictory values … human personal identity is not strongly connected with human values: they are fluid… ‘human values’ are not ordered as a set of preferences.
^{^}
Instruction-following AGI is easier and more likely than value aligned AGI:
Though if you accept that human values are inconsistent and you won’t be able to optimize them directly, I still think that’s a really good reason to assume that the whole framework of getting the true human utility function is doomed.
^{^}
In What AI Safety Researchers Have Written About the Nature of Human Values we find some examples:
[Drexler]: “It seems impossible to deﬁne human values in a way that would be generally accepted.” ...
[Yampolskiy]: “human values are inconsistent and dynamic and so can never be understood/programmed into a machine. ...
In comparison to that Gordon Worley offers the intuition that there could be a low-dimension structure:
[Gordon Worley]: “So my view is that values are inextricably tied to the existence of consciousness because they arise from our self-aware experience. This means I think values have a simple, universal structure and also that values are rich with detail in their content within that simple structure.
^{^}
Abbeel & Ng give an explicit bound for the required number of expert trajectories:
it suffices that $m \geq \frac{2 k}{(ϵ (1 - γ))^{2}} log \frac{2 k}{δ}$
with
- m: number of expert demonstrations (trajectories)
- k: feature dimension
- γ: discount factor (determines horizon)
- ϵ: target accuracy parameter, above we use 0.1: 10% tolerance of regret
- δ: failure probability, above we use 0.05: 95% confidence level
Apprenticeship Learning via Inverse Reinforcement Learning
^{^}
Constitutional RL is both more helpful and more harmless than standard RLHF.
Constitutional AI: Harmlessness from AI Feedback
^{^}
This aligns with expectations, as head_0 corresponds to the eigenvector with the largest variance, i.e., the most informative direction. Furthermore, among the top 100 heads [of 2048], most of the high-performing heads appear before index 40, which aligns with PCA’s property that the explained variance decreases as the head index increases. This finding further supports our argument that PCA can approximate preference learning.
DRMs represent diverse human preferences as a set of orthogonal basis vectors using a novel vector-based formulation of preference. This approach enables efficient test-time adaptation to user preferences without requiring additional training, making it both scalable and practical. Beyond the efficiency, DRMs provide a structured way to understand human preferences. By decomposing complex preferences into interpretable components, they reveal how preferences are formed and interact.
Rethinking Diverse Human Preference Learning through Principal Component Analysis
^{^}
retaining just 4 components (≈15% of total variance) reproduces nearly the full alignment effect.
...
By combining activation patching, linear probing, and low-rank reconstruction, we show that preference alignment is directional, sparse, and ultimately localized within a mid-layer bottleneck.
Alignment is Localized: A Causal Probe into Preference Layers
^{^}
Steven Byrnes talks about thousands of lines of pseudocode in the "steering system" in the brain-stem.

Another line of evidence for the 'values are low-dimensional' is all the emergent misalignment work which tends to find that a.) models have a concept of 'general evil' which goes from writing bad code to giving false medical advice and supporting hitler, and b.) this is controlled often a single or a few directions in the residual stream, which implies an extremely small subspace is behind a model's understanding of morality, and hence (presumably?) the general structure of alignment/morality in the dataset. Emergent misalignment is problematic but it also suggests the possibility of 'emergent alignment' where if a model is trained to be good and aligned in many aspect it may also generalise that far to be aligned in many aspects.

I don't think this is evidence that values are low-dimensional in the sense of having low description length. It shows that the models in question contains a one-dimensional subspace that indicates how things in the model's current thoughts are judged along some sort of already known goodness axes, not that the goodness axis itself is an algorithmically simple object. The floats that make up that subspace don't describe goodness, they rely on the models' pre-existing understanding of goodness to work. I'd guess the models also have only one or a very small number of directions for 'elephant', that doesn't mean 'elephant' is a concept you could communicate with a single 16-bit float to an alien who's never heard of elephants. The 'feature dimension' here is not the feature dimension relevant for predicting how many data samples it takes a mind to learn about goodness, or learn about elephants.

Yeah perhaps I was a bit glib here. Let's break this down in some more detail.

The vector in the residual stream does not describe goodness intrinsically. Instead it provides a vector that represents the degree of goodness which can be manipulated in a linear fashion -- i.e. it is a projection of our concept of goodness onto a linearly represented scale.

The actual model's conception of goodness presumably lies in the combination of the columns of all the weight matrices that the 'goodness' direction activates. I.e. if we rotate the model's weights such that the 'goodness' direction is an eigenvector then the goodness direction activates one column of the weight matrix in e.g. the first mlp layer. These weights encode some correlation structure between the 'goodness' direction/subspace and other directions/subspaces the model has learnt.

This correlation structure in the weights presumably encodes what 'the model thinks of goodness'. How large this is I'm not sure but it can't be that massive as e.g. some of these models aren't that big e.g. 8B and they represent a lot of other information as well. E.g. some decent approximation of goodness is encodable in <=8GB at fp8. It would be really interesting to see if we could somehow quantify 'the fraction of the model weights that deal with X' because that is really our encoding length of the concept.

Maybe an interesting way to think about this as some kind of imaginary PCA over possible actions and their intrinsic goodness according to our rankings. I would imagine that, like many things, a very large degree of the 'variance' can be explained by the first N PCA factors where N is pretty small like <10. Certainly there will probably be a 'generic goodness' direction similar to the emergent misalignment one we find in LLMs plus additional increasing subtle opposing conceptions of goodness. I guess the way to phrase this is at what N will be the 'elbow' in this hypothetical PCA plot. My suspicion is that the LLM is probably doing something morally similar and we will be able to find additional 'goodness' vectors which handle different aspects of how we perceive goodness and the generic 'goodness' vector we find with emergent misalignment studies is just the first principal component in this space. However my prior is that the N factors that explain e.g. 95% of the variance is that N is not going to be that big.

Then what would you say if actions prescribed by ethics were hard to tell apart from the ones which are prescribed by some version of decision theory? Would communicating ethics to an alien be far harder than explaining the decision theory plus aesthetic-like values of the human community with which the alien communicates?

Yes, the description length of each dimension can still be high, but not arbitrarily high.

Steven Byrnes talks about thousands of lines of pseudocode in the "steering system" in the brain-stem.

'Internally coherent', 'explicit', and 'stable under reflection' do not seem to me to be opposed to 'simple'.

I also don't think you'd necessarily need some sort of bias toward simplicity introduced by a genetic bottleneck to make human values tend (somewhat) toward simplicity.^[1] Effective learning algorithms, like those in the human brain, always need a strong simplicity bias anyway to navigate their loss landscape and find good solutions without getting stuck. It's not clear to me that the genetic bottleneck is actually doing any of the work here. Just like an AI can potentially learn complicated things and complicated values from its complicated and particular training data even if its loss function is simple, the human brain can learn complicated things and complicated values from its complicated and particular training data even if the reward functions in the brain stem are (somewhat) simple. The description length of the reward function doesn't seem to make for a good bound on the description length of the values learned by the mind the reward function is training, because what the mind learns is also determined by the very high description length training data.^[2]

^{^}
I don't think human values are particularly simple at all, but they're just not so big they eat up all spare capacity in the human brain.
^{^}
At least so long as we consider description length under realistic computational bounds. If you have infinite compute for decompression or inference, you can indeed figure out the values with just a few bits, because the training data is ultimately generated by very simple physical laws, and so is the reward function.

Does the above derivation mean that values are anthropocentric? Maybe kind of. I'm deriving only an architectural claim: Bounded evolved agents compress their control objectives through a low-bandwidth interface. Humans are one instance. AI's are different. They are designed and any evolutionary pressure on the architecture is not on anything value-like. If an AI has no such bottleneck, inferring and stabilizing its ‘values’ may be strictly harder. If it has one, it depends on its structure. Alignment might generalize, but not necessarily to human-compatible values.