Another line of evidence for the 'values are low-dimensional' is all the emergent misalignment work which tends to find that a.) models have a concept of 'general evil' which goes from writing bad code to giving false medical advice and supporting hitler, and b.) this is controlled often a single or a few directions in the residual stream, which implies an extremely small subspace is behind a model's understanding of morality, and hence (presumably?) the general structure of alignment/morality in the dataset. Emergent misalignment is problematic but it also suggests the possibility of 'emergent alignment' where if a model is trained to be good and aligned in many aspect it may also generalise that far to be aligned in many aspects.
I don't think this is evidence that values are low-dimensional in the sense of having low description length. It shows that the models in question contains a one-dimensional subspace that indicates how things in the model's current thoughts are judged along some sort of already known goodness axes, not that the goodness axis itself is an algorithmically simple object. The floats that make up that subspace don't describe goodness, they rely on the models' pre-existing understanding of goodness to work. I'd guess the models also have only one or a very small number of directions for 'elephant', that doesn't mean 'elephant' is a concept you could communicate with a single 16-bit float to an alien who's never heard of elephants. The 'feature dimension' here is not the feature dimension relevant for predicting how many data samples it takes a mind to learn about goodness, or learn about elephants.
Then what would you say if actions prescribed by ethics were hard to tell apart from the ones which are prescribed by some version of decision theory? Would communicating ethics to an alien be far harder than explaining the decision theory plus aesthetic-like values of the human community with which the alien communicates?
'Internally coherent', 'explicit', and 'stable under reflection' do not seem to me to be opposed to 'simple'.
I also don't think you'd necessarily need some sort of bias toward simplicity introduced by a genetic bottleneck to make human values tend (somewhat) toward simplicity.[1] Effective learning algorithms, like those in the human brain, always need a strong simplicity bias anyway to navigate their loss landscape and find good solutions without getting stuck. It's not clear to me that the genetic bottleneck is actually doing any of the work here. Just like an AI can potentially learn complicated things and complicated values from its complicated and particular training data even if its loss function is simple, the human brain can learn complicated things and complicated values from its complicated and particular training data even if the reward functions in the brain stem are (somewhat) simple. The description length of the reward function doesn't seem to make for a good bound on the description length of the values learned by the mind the reward function is training, because what the mind learns is also determined by the very high description length training data.[2]
I don't think human values are particularly simple at all, but they're just not so big they eat up all spare capacity in the human brain.
At least so long as we consider description length under realistic computational bounds. If you have infinite compute for decompression or inference, you can indeed figure out the values with just a few bits, because the training data is ultimately generated by very simple physical laws, and so is the reward function.
Does the above derivation mean that values are anthropocentric? Maybe kind of. I'm deriving only an architectural claim: Bounded evolved agents compress their control objectives through a low-bandwidth interface. Humans are one instance. AI's are different. They are designed and any evolutionary pressure on the architecture is not on anything value-like. If an AI has no such bottleneck, inferring and stabilizing its ‘values’ may be strictly harder. If it has one, it depends on its structure. Alignment might generalize, but not necessarily to human-compatible values.
Epistemic status: Confident in the direction, not confident in the numbers. I have spent a few hours looking into this.
Suppose human values were internally coherent, high-dimensional, explicit, and decently stable under reflection. Would alignment be easier or harder?
My below calculations show that it would be much harder, if not impossible. I'm going to try to defend the claim that:
If behavior is driven by a high-dimensional reward vector r∈Rn, inverse reinforcement learning requires an unreasonable number of samples. But if it is driven by a low-rank projection r↦s∈Rk with small k, inference may become tractable.
A common worry about human values is that they are complicated and inconsistent[2][3][4]. And the intuition seems to be that this makes alignment harder. But maybe the opposite is the case. Inconsistency is what you expect from lossy compression, and the dimensionality reduction makes the signal potentially learnable.
Calculation with Abbeel & Ng's formula[5] gives
with
If you need at least 20 billion samples to learn complex values, we are doomed. But it may become solvable with a reduction of the number of required trajectories by a factor of about 200 (depending on how high-dimensional you were thinking the values are; 1000 is surely conservative - if any kinds of values can actually be learned the number may be much higher).
This could explain why constitutional AI works better than expected[6]: A low-dimensional latent space seems to capture most of the variation in preference alignment[7][8]. The reduction by x200 doesn't mean it's easy. The bottleneck helps with identifiability, but we still need many trajectories and the mapping of the structure of the bottleneck[9] can still kill us.
How can we test if the dimensionality of human values is actually low? We should see diminishing returns in predictability for example when using N pairwise comparisons of value-related queries. Predictability should drop off at N≈O(klogk)−O(k2), e.g., for k∼10 we'd expect an elbow around N~150.
I'm agnostic of what the specific bottlenecks are here, but I'm thinking of the channels in Steven Byrnes' steering system model and the limited number of brain regions that are influenced. See my sketch here.
AI Alignment Problem: ‘Human Values’ don’t Actually Exist argues that human values are inherently inconsistent and not well-defined enough for a stable utility target:
Instruction-following AGI is easier and more likely than value aligned AGI:
In What AI Safety Researchers Have Written About the Nature of Human Values we find some examples:
In comparison to that Gordon Worley offers the intuition that there could be a low-dimension structure:
Abbeel & Ng give an explicit bound for the required number of expert trajectories:
with
Apprenticeship Learning via Inverse Reinforcement Learning
Constitutional AI: Harmlessness from AI Feedback
Rethinking Diverse Human Preference Learning through Principal Component Analysis
Alignment is Localized: A Causal Probe into Preference Layers
Steven Byrnes talks about thousands of lines of pseudocode in the "steering system" in the brain-stem.