As a writing exercise, I'm writing an AI Alignment Hot Take Advent Calendar - one new hot take, written every day for 25 days. Or until I run out of hot takes.
Everyone knows what the deal with natural abstractions is, right? Abstractions are regularities about the world that are really useful for representing its coarse grained behavior - they're building blocks for communicating, compressing, or predicting information about the world. An abstraction is "natural" if it's so easy to learn, and so broadly useful, that most right-thinking agents will have it as part of their toolbox of abstractions.
The dream is to use natural abstractions to pick out what we want from an AI. Suppose "human values" are a natural abstraction: then both humans and a world-modeling AI would have nearly the exact same human values abstraction in their toolboxes of abstractions. If we can just activate the AI's human values abstraction, we can more or less avoid misalignment between what-humans-are-trying-to-pick-out and what-abstraction-the-AI-takes-as-its-target.
One might think that the main challenge to this plan would be if there are too few natural abstractions. If human values (or agency, or corrigibility, or whatever nice thing you want to target) aren't a natural abstraction, you lose that confidence that the human and the AI are pointing at the same thing. But it's also a challenge if there are too many natural abstractions.
Turns out, humans don't just have one abstraction that is "human values," they have a whole lot of 'em. Humans have many different languages / ontologies we use to talk about people, and these use different abstractions as building blocks. More than one of these abstractions gets called "human values," but they're living in different ontologies / get applied in different contexts.
If none of these abstractions we use to talk about human values are natural, then we're back to the first problem. But if any of them are natural, it seems just as plausible that nearly all of them are. Abstractions don't even have to be discrete - it's perfectly possible to have a continuum.
This complicates the easy alignment plan, because it means that the structure of the world is merely doing most of the work for us rather than almost all of the work. The bigger the space of semantically-similar natural abstractions you have to navigate, the more you have to be careful about your extensional definitions, and the higher standards you have to have for telling good from bad results.
It's enough for natural abstraction to work for strawberry alignment, solving a technical task with a good understanding of what it means to not leave any weird side effects, without doing strong optimization of the world in the process and safely shutting down on completion of the task. With uploads, ambitious alignment becomes much more feasible, even if it doesn't have a natural specification.
I think the NAH does a lot of work for interpretability of an AI's beliefs about things that aren't values, but I'm pretty skeptical about the "human values" natural abstraction. I think the points made in this post are good, and relatedly, I don't want the AI to be aligned to "human values"; I want it to be aligned to my values. I think there’s a pretty big gap between my values and those of the average human even subjected to something like CEV, and that this is probably true for other LW/EA types as well. Human values as they exist in nature contain fundamental terms for the in group, disgust based values, etc.
If your values don't happen to have the property of giving the world back to everyone else, building an AGI with your values specifically (when there are no other AGIs yet) is taking over the world. Hence human values, something that would share influence by design, a universalizable objective for everyone to agree to work towards.
On the other hand, succeeding in directly (without AGI assistance) building aligned AGIs with fixed preference seems much less plausible (in time to prevent AI risk) than building task AIs that create uploads of specific people (a particularly useful application of strawberry alignment), to bootstrap alignment research that's actually up to the task of aligning preferences (ambitious alignment). And those uploads are agents of their own values, not human values, a governance problem.
Consider the subset of "human values" that we'd be "happy" (where we fully informed) for powerful systems to optimise for.
[Weaker version: "the subset of human values that it is existentially safe for powerful systems to optimise for".]
Let's call thia subset "ideal values".
I'd guess that the "most natural" abstraction isn't "ideal values" themselves but something like "the minimal latents of ideal values".