A delayed hot take. This is pretty similar to previous comments from Rohin.
Shard theory alignment requires magic - not in the sense of magic spells, but in the technical sense of steps we need to remind ourselves we don't know how to do. Locating magic is an important step in trying to demystify it.
"Shard theory alignment" means building an AI that does good things and not bad things by encouraging an RL agent to want to do good things, via kinds of reward shaping analogous to the diamond maximizer example.
How might the story go?
- You start out with some unsupervised model of sensory data.
- On top of its representation of the world you start training an RL agent, with a carefully chosen curriculum and a reward signal that you think matches "goodness in general" on that curriculum distribution.
- This cultivates shards that want things in the vicinity of "what's good according to human values."
- These start out as mere bundles of heuristics, but eventually they generalize far enough to be self-reflective, promoting goal-directed behavior that takes into account the training process and the possibility of self-modification.
- At this point the values will lock themselves in, and future behavior will be guided by the abstractions in the learned representation of the world that the shards used to get good results in training, not by what would actually maximize the reward function you used.
There magic here is especially concentrated around how we end up with the right shards.
One magical process is how we pick the training curriculum and reward signal. If the curriculum is only made up only of simple environments, then the RL agent will learn heuristics that don't need to refer to humans. But if you push the complexity up too fast, the RL process will fail, or the AI will be more likely to learn heuristics that are better than nothing but aren't what we intended. Does a goldilocks zone where the agent learns more-or-less what we intended exist? How can we build confidence that it does, and that we've found it?
And what's in the curriculum matters a lot. Do we try to teach the AI to locate "human values" by having it be prosocial towards individuals? Which ones? To groups? Over what timescale? How do we reward it for choices on various ethical dilemmas? Or do we artificially suppress the rate of occurrence of such dilemmas? Different choices will lead to different shards. We wouldn't need to find a unique best way to do things (that's a boondoggle), but we would need to find some way of doing things that we trust enough.
Another piece of magic is how the above process lines up with generalization and self-reflectivity. If the RL agent becomes self-reflective too early, it will lock in simple goals that we don't want. If it becomes self-reflective too late, it will have started exploiting unintended maxima of the reward function. How do we know when we want the AI to lock in its values? How do we exert control over that?
If shard theory alignment seemed like it has few free parameters, and doesn't need a lot more work, then I think you failed to see the magic. I think the free parameters haven't been discussed enough precisely because they need so much more work.
The part of the magic that I think we could start working on now is how to connect curricula and learned abstractions. In order to predict that a certain curriculum will cause an AI to learn what we think is good, we want to have a science of reinforcement learning advanced in both theory and data. In environments of moderate complexity (e.g. Atari, MuJoCo), we can study how to build curricula that impart different generalization behaviors, and try to make predictive models of this process. Even if shard theory alignment doesn't pan out, this sounds like good blue-sky research.
The part of the magic I think we're not ready for is self-reflectivity. Surely there's an in-principle solution to lining up the timings of desired shard formation and value lock-in, but there doesn't have to be a way for us to learn this solution in a timely manner. At best we need more abstract thought about this issue in order to figure out what an approach might even look like, and at worst I think this is a problem the necessitates a different approach.
The question of "which values?" is in a similar spot. In the story I gave above of shard theory alignment, we directly train an RL agent to learn values from some curriculum. But that's not necessarily the only solution. Yes, maybe we could build a curriculum for some good-enough interpretation of human values, informed by a future advanced science of RL. But we could also train for some value-finding process at a higher meta-level, for example. I think to a large extent, we're not ready for this question because we don't know what's easier to do with future science of RL - it's counting our chickens before they're hatched. But also, this is another reason to keep your eyes open for other approaches.