Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

A delayed hot take. This is pretty similar to previous comments from Rohin.

Shard theory alignment requires magic - not in the sense of magic spells, but in the technical sense of steps we need to remind ourselves we don't know how to do. Locating magic is an important step in trying to demystify it.

"Shard theory alignment" means building an AI that does good things and not bad things by encouraging an RL agent to want to do good things, via kinds of reward shaping analogous to the diamond maximizer example.

How might the story go?

  1. You start out with some unsupervised model of sensory data.
  2. On top of its representation of the world you start training an RL agent, with a carefully chosen curriculum and a reward signal that you think matches "goodness in general" on that curriculum distribution.
  3. This cultivates shards that want things in the vicinity of "what's good according to human values."
  4. These start out as mere bundles of heuristics, but eventually they generalize far enough to be self-reflective, promoting goal-directed behavior that takes into account the training process and the possibility of self-modification.
  5. At this point the values will lock themselves in, and future behavior will be guided by the abstractions in the learned representation of the world that the shards used to get good results in training, not by what would actually maximize the reward function you used.

There magic here is especially concentrated around how we end up with the right shards.

One magical process is how we pick the training curriculum and reward signal. If the curriculum is only made up only of simple environments, then the RL agent will learn heuristics that don't need to refer to humans. But if you push the complexity up too fast, the RL process will fail, or the AI will be more likely to learn heuristics that are better than nothing but aren't what we intended. Does a goldilocks zone where the agent learns more-or-less what we intended exist? How can we build confidence that it does, and that we've found it?

And what's in the curriculum matters a lot. Do we try to teach the AI to locate "human values" by having it be prosocial towards individuals? Which ones? To groups? Over what timescale? How do we reward it for choices on various ethical dilemmas? Or do we artificially suppress the rate of occurrence of such dilemmas? Different choices will lead to different shards. We wouldn't need to find a unique best way to do things (that's a boondoggle), but we would need to find some way of doing things that we trust enough.

Another piece of magic is how the above process lines up with generalization and self-reflectivity. If the RL agent becomes self-reflective too early, it will lock in simple goals that we don't want. If it becomes self-reflective too late, it will have started exploiting unintended maxima of the reward function. How do we know when we want the AI to lock in its values? How do we exert control over that?

---

If shard theory alignment seemed like it has few free parameters, and doesn't need a lot more work, then I think you failed to see the magic. I think the free parameters haven't been discussed enough precisely because they need so much more work.

The part of the magic that I think we could start working on now is how to connect curricula and learned abstractions. In order to predict that a certain curriculum will cause an AI to learn what we think is good, we want to have a science of reinforcement learning advanced in both theory and data. In environments of moderate complexity (e.g. Atari, MuJoCo), we can study how to build curricula that impart different generalization behaviors, and try to make predictive models of this process. Even if shard theory alignment doesn't pan out, this sounds like good blue-sky research.

The part of the magic I think we're not ready for is self-reflectivity. Surely there's an in-principle solution to lining up the timings of desired shard formation and value lock-in, but there doesn't have to be a way for us to learn this solution in a timely manner. At best we need more abstract thought about this issue in order to figure out what an approach might even look like, and at worst I think this is a problem the necessitates a different approach.

The question of "which values?" is in a similar spot. In the story I gave above of shard theory alignment, we directly train an RL agent to learn values from some curriculum. But that's not necessarily the only solution. Yes, maybe we could build a curriculum for some good-enough interpretation of human values, informed by a future advanced science of RL. But we could also train for some value-finding process at a higher meta-level, for example. I think to a large extent, we're not ready for this question because we don't know what's easier to do with future science of RL - it's counting our chickens before they're hatched. But also, this is another reason to keep your eyes open for other approaches.

New to LessWrong?

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 3:02 PM

"Magic," of course, in the technical sense of stuff we need to remind ourselves we don't know how to do. I don't mean this pejoratively, locating magic is an important step in trying to demystify it.

I think this title suggests a motte/bailey, and also seems clickbait-y. I think most people scanning the title will conclude you mean it in a perjorative sense, such that shard theory requires impossibilities or unphysical miracles to actually work. I think this is clearly wrong (and I imagine you to agree). As such, I've downvoted for the moment. 

AFAICT your message would be better conveyed by "Shard theory alignment has many free parameters."

If shard theory alignment seemed like it has few free parameters, and doesn't need a lot more work, then I think you failed to see the magic. I think the free parameters haven't been discussed enough precisely because they need so much more work.

The things you say about known open questions mostly seem like things I historically have said (which is not at all to say that you haven't brought any new content to the table!). For example,

  • "In environments of moderate complexity (e.g. Atari, MuJoCo), we can study how to build curricula that impart different generalization behaviors, and try to make predictive models of this process. Even if shard theory alignment doesn't pan out, this sounds like good blue-sky research." -> My shortform comment from last July. 
  • I also talk about many of the same things in the original diamond alignment post ("open questions" which I called "real research problems") and in my response to Nate's critique of it:
    • "I agree that the reflection process seems sensitive in some ways. I also give the straightforward reason why the diamond-values shouldn't blow up: Because that leads to fewer diamonds. I think this a priori case is pretty strong, but agree that there should at least be a lot more serious thinking here, eg a mathematical theory of value coalitions."

You might still maintain there should be more discussion. Yeah, sure. [EDIT: clipped a part which I think isn't responding to what you meant to communicate]

and doesn't need a lot more work,

I think it's quite plausible that you don't need much more work for shard theory alignment, because value formation really is that easy / robust.

If the curriculum is only made up only of simple environments, then the RL agent will learn heuristics that don't need to refer to humans. But if you push the complexity up too fast, the RL process will fail, or the AI will be more likely to learn heuristics that are better than nothing but aren't what we intended. Does a goldilocks zone where the agent learns more-or-less what we intended exist? How can we build confidence that it does, and that we've found it?

Not obvious to me that there is something like a one-dimensional "goldilocks" zone, buttressed by bad zones. An interesting framing, but not one that feels definitive (if you meant it that way).

At best we need more abstract thought about this issue in order to figure out what an approach might even look like, and at worst I think this is a problem the necessitates a different approach.

At best, shard theory alignment just works as-is, with some thought being taken like in the diamond alignment post. 

I'll have to eat the downvote for now - I think it's worth it to use magic as a term of art, since it's 11 fewer words than "stuff we need to remind ourselves we don't know how to do," and I'm not satisfied with "free parameters."

I think it's quite plausible that you don't need much more work for shard theory alignment, because value formation really is that easy / robust.

But how do we learn that fact?

If extremely-confident-you says "the diamond-alignment post would literally work" and I say "what about these magical steps where you make choices without knowing how to build confidence in them beforehand" and extremely-confident-you says "don't worry, most choices work fine because value formation is robust," how did they learn that value formation is robust in that sense?

I think it is unlikely but plausible that shard theory alignment could turn out to be easy, if only we had the textbook from the future. But I don't think it's plausible that getting that textbook is easy. Yes, we have arguments about human values that are suggestive, but I don't see a way to go from "suggestive" to "I am actually confident" that doesn't involve de-mystifying the magic.

Wouldn't "Shard theory requires work" or "Shard theory requires novel insights" work?

Perhaps just [Shard theory alignment requires "magic"] to indicate that the word is used in a different way?

I think it's worth it to use magic as a term of art, since it's 11 fewer words than "stuff we need to remind ourselves we don't know how to do," and I'm not satisfied with "free parameters."

11 fewer words, but I don't think it communicates the intended concept! 

If you have to say "I don't mean one obvious reading of the title" as the first sentence, it's probably not a good title. This isn't a dig -- titling posts is hard, and I think it's fair to not be satisfied with the one I gave. I asked ChatGPT to generate several new titles; lightly edited:

  1. "Uncertainties left open by Shard Theory"
  2. "Limitations of Current Shard Theory"
  3. "Challenges in Applying Shard Theory"
  4. "Unanswered Questions of Shard Theory"
  5. "Exploring the Unknowns of Shard Theory"

After considering these, I think that "Reminder: shard theory leaves open important uncertainties" is better than these five, and far better than the current title. I think a better title is quite within reach.

But how do we learn that fact?

I didn't claim that I assign high credence to alignment just working out, I'm saying that it may as a matter of fact turn out that shard theory doesn't "need a lot more work," because alignment works out as a matter of fact from the obvious setups people try. 

  1. There's a degenerate version of this claim, where ST doesn't need more work because alignment is "just easy" for non-shard-theory reasons, and in that world ST "doesn't need more work" because alignment itself doesn't need more work. 
  2. There's a less degenerate version of the claim, where alignment is easy for shard-theory reasons -- e.g. agents robustly pick up a lot of values, many of which involve caring about us.

"Shard theory doesn't need more work" (in sense 2) could be true as a matter of fact, without me knowing it's true with high confidence. If you're saying "for us to become highly confident that alignment is going to work this way, we need more info", I agree. 

But I read you as saying "for this to work as a matter of fact, we need X Y Z additional research":

At best we need more abstract thought about this issue in order to figure out what an approach might even look like, and at worst I think this is a problem the necessitates a different approach.

And I think this is wrong. 2 can just be true, and we won't justifiably know it. So I usually say "It is not known to me that I know how to solve alignment", and not "I don't know how to solve alignment." 

Does that make sense?

Since it was evidently A Thing, I have caved to peer pressure :P

"Shard theory doesn't need more work" (in sense 2) could be true as a matter of fact, without me knowing it's true with high confidence. If you're saying "for us to become highly confident that alignment is going to work this way, we need more info", I agree. 

But I read you as saying "for this to work as a matter of fact, we need X Y Z additional research":

Yeah, this is a good point. I do indeed think that just plowing ahead wouldn't work as a matter of fact, even if shard theory alignment is easy-in-the-way-I-think-is-plausible, and I was vague about this.

This is because the way in which I think it's plausible for it to be easy is some case (3) that's even more restricted than (1) or (2).  Like 3: If we could read the textbook from the future and use its ontology, maybe it would be easy / robust to build an RL agent that's aligned because of the shard theory alignment story.

To back up: in nontrivial cases, robustness doesn't exist in a vacuum - you have to be robust to some distribution of perturbations. For shard theory alignment to be easy, it hast to be robust to the choices we have to make about building AI, and specifically to the space of different ways we might make those choices. This space of different ways we could make choices depends on the ontology we're using to think about the problem - a good ontology / way of thinking about the problem makes the right degrees of freedom "obvious," and makes it hard to do things totally wrong.

I think in real life, if we think "maybe this doesn't need more work and just we don't know it yet," what's actually going to happen is that for some of the degrees of freedom we need to set, we're going to be using an ontology that allows for perturbations where the thing's not robust, depressing the chances of success exponentially.

Why do you believe that “But if you push the complexity up too fast, the RL process will fail, or the AI will be more likely to learn heuristics that are better than nothing but aren't what we intended”?

I understand why this could cause the AI to fail, but why might it learn incorrect heuristics?

I mean something like getting stuck in local optima on a hard problem. An extreme example would be if I try to teach you to play chess by having you play against Stockfish over and over, and give you a reward for each piece you capture - you're going to learn to play chess in a way that trades pieces short-term but doesn't win the game.

Or, like, if you think of shard formation as inner alignment failure that works on the training distribution, the environment being too hard to navigate shrinks the "effective" training distribution that inner alignment failures generalize over.

I think your objections are all basically correct, but that you treat them as dealbreakers in ways that I (a big shard-alignment fan) don't. As I understand it, your objections boil down to 1. picking the training curriculum/reward signal is hard (and design choices pose a level of challenge beyond the simple empirical does-it-work-to-produce-an-AGI) and 2. reflectivity is very hard and might cause lots of big problems, and we can’t begin to productively engage with those issues right now.


I don’t think that curriculum and reward signal are as problematic as you seem to think. From the standpoint of AI notkilleveryoneism, I think that basically any set of prosocial/human-friendly values will be sufficient, and that something directionally correct will be very easy to find. The design choices described as relating to “what’s in the curriculum” seem of secondary importance to me-- in all but the least iterative-design-friendly worlds, we can figure this out as we go, and if we figure out the notkilleveryoneism/basic corrigibility stuff in hard-takeoff worlds we would probably be able to slow down AI development long enough for iteration.


The reflectivity stuff 100% does cause huge problems that we don’t know how to solve, but I break with you in two places here-- firstly, you seem to think that we can’t do much empirical or theoretical work right now to improve our understanding of reflective processes; and secondly, you seem to assume that reflectivity involves or induces additional challenges that IMO can very readily be avoided. Regarding the former point, I think I’m doing empirical work right now that can plausibly help improve our understanding of reflectivity, and Peli Grietzer is doing theoretical work (on what he calls "praxis-based values," based on "doing X X-ingly... the intuition that some reflective values are an uroboros of means and ends") that engages with these problems as well. There’s lots of low-hanging fruit here, and for an approach to alignment that’s only been in play for about a year I think a lot of progress has been made.


Regarding the latter point, I think lots of your points surrounding lock-in might be stated too strongly. I’m a reflective goal-directed agent, and I don’t think my values are “locked in”; I can and do change my behaviors and moral views in response to new information and circumstances. Maybe you think that “lock-in” involves actual self-modification, so that e.g. an aspiring vegan would reengineer their tastebuds so that meat tastes horrible-- but creating shards that discourage this kind of behavior seems easy as pie. Overall, the problems involving “lock-in” don’t seem as hard to me as they do to you-- I think the bigger dangers (and ones we currently don’t know how to address, but might soon) are unknown unknowns and other reflectivity problems, especially those involving how desirable shards might interact in undesirable ways and and push our agent towards bizarre and harmful behaviors.

you seem to think that we can’t do much empirical or theoretical work right now to improve our understanding of reflective processes

We can certainly do research now that builds towards the research we eventually need to do. But if your empirical work you're doing right now can predict when an RL agent will start taking actions to preserve its own goals, I will be surprised and even more interested than I already am.

Regarding the latter point, I think lots of your points surrounding lock-in might be stated too strongly. I’m a reflective goal-directed agent, and I don’t think my values are “locked in”; I can and do change my behaviors and moral views in response to new information and circumstances. Maybe you think that “lock-in” involves actual self-modification, so that e.g. an aspiring vegan would reengineer their tastebuds so that meat tastes horrible-- but creating shards that discourage this kind of behavior seems easy as pie. Overall, the problems involving “lock-in” don’t seem as hard to me as they do to you

Lock-in is the process that stops the RL agent from slipping down the slope to actually maximizing the reward function as written. An example in humans would be how you avoid taking heroin specifically because you know that it would strongly stimulate the literal reward calculation of your brain.

You seem to be making an implied argument like "this isn't a big problem for me, a human, so it probably happens by default in a good way in future RL agents," and I don't find that implied argument valid.

I think the bigger dangers (and ones we currently don’t know how to address, but might soon) are unknown unknowns and other reflectivity problems, especially those involving how desirable shards might interact in undesirable ways and and push our agent towards bizarre and harmful behaviors.

What sort of stuff would be an example of that latter problem? If a shard-condensation process can lead to such human-undesirable generalization taken collectively, why should the individual shards that it condenses generalize the way we want when taken individually?