Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Originally a shortform comment.

Imagine we train an AI on realistic situations where it can kill people, and penalize it when it does so. Suppose that we successfully instill a strong and widely activated "If going to kill people, then don't" value shard. 

Even assuming this much, the situation seems fragile. See, many value shards are self-chaining. In The shard theory of human values, I wrote about how:

  1. A baby learns "IF juice in front of me, THEN drink",
  2. The baby is later near juice, and then turns to see it, activating the learned "reflex" heuristic, learning to turn around and look at juice when the juice is nearby,
  3. The baby is later far from juice, and bumbles around until they're near the juice, whereupon she drinks the juice via the existing heuristics. This teaches "navigate to juice when you know it's nearby."
  4. Eventually this develops into a learned planning algorithm incorporating multiple value shards (e.g. juice and friends) so as to produce a single locally coherent plan.
  5. ...

The juice shard chains into itself, as its outputs cause the learning process to further reinforce and generalize the juice-shard. This shard reinforces itself across time and thought-steps. 

But a "don't kill" shard seems like it should remain... stubby? Primitive? The "don't kill" shard can't self-chain into not doing something. If you're going to kill someone, and then don't because of the don't-kill shard, and that avoids predicted negative reward... Then maybe the "don't kill" shard gets reinforced and generalized a bit because it avoided negative reward (and so reward was higher than predicted, which I think would trigger e.g. a reinforcement event in people). 

But—on my current guesses and intuitions[1]—that shard doesn't become more sophisticated, it doesn't become reflective, it doesn't "agentically participate" in the internal shard politics (e.g. the agent's "meta-ethics", deciding what kind of agent it "wants to become"). Other parts of the agent want things, they want paperclips or whatever, and that's harder to do if the agent isn't allowed to kill anyone. 

Crucially, the no-killing injunction can probably be steered around by the agent's other values. While the obvious route of lesioning the no-killing shard might be reflectively-predicted by the world model to lead to more murder, and therefore bid against by the no-killing shard... There are probably ways to get around this obstacle. Other value shards (e.g. paperclips and cow-breeding) might bid up lesioning plans which are optimized so as to not make the killing a salient plan feature to the reflective world-model, and thus, the plan does not activate the no-killing shard.

This line of argumentation is a point in favor of the following: Don't embed a shard which doesn't want to kill. Make a shard which wants to protect / save / help people. That can chain into itself across time.

Other points:

  • Deontology seems most durable to me when it can be justified on consequentialist grounds. Perhaps this is one mechanistic reason why. 
  • This is one point in favor of the "convergent consequentialism" hypothesis, in some form. 
  • I think that people are not usually defined by negative values (e.g. "don't kill"), but by positives, and perhaps this is important.
  1. ^

    Which I won't actually detail right now.

New Comment
13 comments, sorted by Click to highlight new comments since: Today at 5:43 PM

I think the key question is how much policy-caused variance in the outcome there is.

That is, juice can chain onto itself because we assume there will be a bunch of scenarios where reinforcement-triggering depends a lot on the choices made. If you are near a juice but not in front of the juice, you can walk up to the juice, which triggers reinforcement, or you can not walk up to the juice, which doesn't trigger reinforcement. The fact that these two plausible actions differ in their reinforcement is what I am referring to with "policy-caused variance".

If you are told not to kill someone, then you can usually stay sufficiently far away from killing anyone, in such a way that the variance in killing is 0, because killing itself has a constant value of 0. (Unlike juice-drinking which might have a value of 0 or 1, depending on both scenario and actions.)

But you could also have a case with a positive value that fails to be robust/lasting, if the positive value fails to have variance. One example would be if the positive value is bounded and always achieved; for instance you might imagine that if you are always carrying a juice dispenser, juice-drinking would always have a value of 1, and therefore again there wouldn't be any variance to reinforce seeking juice.

A more subtle point is that if the task is too difficult, e.g. if you are in a desert with no juice available, then the policy-caused variance is also 0, because the juice is constant 0. This is a general point that encompasses many failures of reinforcement learning. Often, if you don't do things like reward shaping, then reinforcement learning simply fails, because the task is too complex to learn.

I think future RL algorithms will be able to succeed using less policy-caused variance than present RL algorithms require by using models to track the outcomes through many layers of interactions.

This seems related to insights from parenting (and animal training), which I mentioned, e.g., in Book Review: Kazdin's The Everyday Parenting Toolkit

positive reinforcement is scientifically proven to work much better then negative reinforcement (aka punishment).

If your point is true and the analogy holds, that might indicate that other "training" methods from parenting or animal training might also work.

Finally got around to reading this, only to discover that I independently noticed a similar thing just a couple days ago, see here for how I’m thinking about it. :)

I found it useful to compare a shard that learns to pursue juice (positive value) to one that avoids eating mouldy food (prohibition), just so they're on the same kind of framing/scale.

It feels like a possible difference between prohibitions and positive values is that positive values specify a relatively small portion of the state space that is good/desirable (there are not many states in which you're drinking juice), and hence possibly only activate less frequently, or only when parts of the state space like that are accessible, whereas prohibitions specify a large part of the state space that is bad (but not so much that the complement is a small portion - there are perhaps many potential states where you eat mouldy food, but the complement of that set is still not a similar size to the set of states of drinking juice). The first feels more suited to forming longer-term plans towards the small part of the state space (cf this definition of optimisation), whereas the second is less so. Then shards that start doing optimisation like this are hence more likely to become agentic/self-reflective/meta-cognitive etc.

In effect, positive values are more likely/able to self-chain because they actually (kind of, implicitly) specify optimisation goals, and hence shards can optimise them, and hence grow and improve that optimisation power, whereas prohibitions specify a much larger desirable state set, and so don't require or encourage optimisation as much.

As an implication of this, I could imagine that in most real-world settings "don't kill humans" would act as you describe, but in environments where it's very easy to accidentally kill humans, such that states where you don't kill humans are actually very rare, then the "don't kill humans" shard could chain into itself more, and hence become more sophisticated/agentic/reflective. Does that seem right to you?

As an implication of this, I could imagine that in most real-world settings "don't kill humans" would act as you describe, but in environments where it's very easy to accidentally kill humans, such that states where you don't kill humans are actually very rare, then the "don't kill humans" shard could chain into itself more, and hence become more sophisticated/agentic/reflective. Does that seem right to you?

I think that "don't kill humans" can't chain into itself because there's not a real reason for its action-bids to systematically lead to future scenarios where it again influences logits and gets further reinforced, whereas "drink juice" does have this property. 

In the described scenario, "don't kill humans" may in fact lead to scenarios where the AI can again kill humans, but this feels like an ambient statistical property of the world (killing people is easy) and not like a property of the shard's optimization (the shard isn't influencing logits on the basis of whether those actions will lead to future opportunities to not kill people, or something?). So I do expect "don't kill people" to become more sophisticated/reflective, but I intuitively feel there remains some important difference that I can't quite articulate.

I think that "don't kill humans" can't chain into itself because there's not a real reason for its action-bids to systematically lead to future scenarios where it again influences logits and gets further reinforced, whereas "drink juice" does have this property.


I'm trying to understand why the juice shard has this propety. Which of these (if any) are the the explanation for this:

  • Bigger juice shards will bid on actions which will lead to juice multiple times over time, as it pushes the agent towards juice from quite far away (both temporally and spatially), and hence will be strongly reinforcement when the reward comes, even though it's only a single reinforcement event (actually getting the juice).
  • Juice will be acquired more with stronger juice shards, leading to a kind of virtuous cycle, assuming that getting juice is always positive reward (or positive advantage/reinforcement, to avoid zero-point issues)

The first seems at least plausibly to also to apply to "avoid moldy food", if it requires multiple steps of planning to avoid moldy food (throwing out moldy food, buying fresh ingredients and then cooking them, etc.)

The second does seem to be more specific to juice than mold, but it seems to me that's because getting juice is rare, and is something we can better and better at, whereas avoiding moldy food is something that's fairly easy to learn, and past that there's not much reinforcement to happen. If that's the case, then I kind of see that as being covered by the rare-states explanation in my previous comment, or maybe an extension of that to "rare states and skills in which improvement leads to more reward".

Having just read tailcalled comment, I think that is in some sense another of phasing what I was trying to say, where rare (but not too rare) states are likely to mean that policy-caused variance is high on those decisions. Probably policy-caused variance is more fundamental/closer as an explanation to what's actually happening in the learning process, but maybe states of certain rarity which are high-reward/reinforcement is one possibly environmental feature that produces policy-caused variance.

Could this be resolved by wanting to not kill, rather than not wanting to kill? My understanding is that "Don't want X" acts as a filter of generated plans, while "Want to do X" (or "Want to not do X") will either generate plans, or at least sort of act as an optimization pressure. 

People are often defined by negative values in the sense of "don't do stuff that hurt" or "don't do stuff that result in lower status", although it might be better to phrase that as "actively avoid doing things that result in pain/lower status".

In English, “to not want X” ordinarily means “to want not-X”, not merely an absence of wanting X. I don’t know how common this is in languages generally, but in French at least, “je ne veux pas X” behaves the same way, and Google Translate suggests the same is true of many others. In fact, I would be surprised to find a language in which absence of wanting was as easy to express as want and not-want are.

My initial thoughts were:

  • On one hand, if you positively reinforce, the system will seek it out, if you negatively reinforce the system will work around it.
  • On the other hand, there doesn't seem to be a principled difference between positive reinforcement and negative reinforcement. Like I would assume that the zero point wouldn't affect the trade-off between two actions as long as the difference was fixed.

Having thought about it a bit more, I think I managed to resolve the tension. It seems that if at least one of the actions is positive utility, then the system has a reason to maneuver you into a hypothetical state where you choose between them, while if both are negative utility then the system has a reason to actively steer you away from having to make such a choice.

(This analysis is still naive in that it doesn't account for opportunity cost).

I'd really love to see greater formalisation of this intuition. Even what I've said above is quite ambiguous.

On the other hand, there doesn't seem to be a principled difference between positive reinforcement and negative reinforcement. Like I would assume that the zero point wouldn't affect the trade-off between two actions as long as the difference was fixed.

This is only true for optimal policies, no? For learned policies, positive reward will upweight and generalize certain circuits (like "approach juice"), while negative reward will downweight and generally-discourage those same circuits. This can then lead to path-dependent differences in generalization (e.g. whether person pursues juice in general).

(In general, I think reward is not best understood as an optimization target like "utility.")

Good point.

(That said, it seems like to useful check to see what the optimal policy will do. And if someone believes it won't achieve the optimal policy, it seems useful to try to understand the barrier that stops that. I don't feel quite clear on this yet).

Major stages in my own moral development...

  1. Preschool: learning "if I threaten to hit people, they can refuse to play with me, which sucks, so I guess I won't do that".  Shamefully, learning this via experience.
  2. Probably early elementary school: learning "if I lie about things, then people won't believe me, so I guess I won't do that."  Again via shameful experience.  Eventually, I developed this into a practically holy commandment; not sure what the external factors were.
    1. Some kind of scientific ethic?  Feynman with the "the easiest person to fool is yourself; to maintain scientific integrity, you have to bend over backwards, naming all the potential reasons you might be wrong" and stuff.
    2. A developing notion that lying was evil, that it could mess things up really badly, that good people who tried lying quickly regretted it (probably mostly fictional examples here), and that the only sensible solution was a complete prohibition.
  3. Middle school: took a game theory class at a summer camp; learned about the Prisoner's Dilemma and tragedy of the commons; threats and promises; and the hawk-dove game with evolutionarily stable strategies.  This profoundly affected me:
    1. The threats-and-promises thing showed that it was sometimes rational to (visibly) put yourself into a state (perhaps with explicit contracts, perhaps with emotions) where you would do something "irrational", because that could then change someone else's behavior.
    2. With the one-shot Prisoner's Dilemma, it seemed clear that, to get the best outcome for everyone, it was necessary for everyone to have an "irrational" module in their brain that led them to cooperate.  To a decent extent one can solve real-world situations with external mechanisms that make it no longer a one-shot Prisoner's Dilemma—reputation, private ownership rights—but it's not a complete solution.
    3. In the hawk-dove game, two birds meet and there's a resource and they have the option to fight for it; we figure each bird follows a certain strategy, dictated by their genes, and winning strategies will increase in prevalence.[1]  Lessons I took: there are multiple different equilibria, some better than others, some more abusive than others; if the population has a high enough fraction of people who will fight back against abuse beyond the point of "rationality", this will prevent abusers from dominating, and should be considered a public service.
  4. At some point I encountered the "non-aggression principle", and decided that it, coupled with common notions of what counts as aggression against person and property (and therefore what counts as property, which is perhaps contentious), was an excellent Schelling point for the fundament of an ethical system.  (I will reluctantly admit that "submit to the strongest man" is also a Schelling point people sometimes go for.)

On the subject of the post:

  1. The "positive, expanding" angle on things like "don't kill"—more generally, "people have rights that shouldn't be violated"—that comes to mind is: As you learn to do more cool things with your toys, imagine if someone else could come and take your toys away, or injure you.  That would be bad, wouldn't it?  And then have a belief about a connection between the rules other people follow and the rules you follow.  Some components:
    1. Have a theory of mind.
    2. Having true peers might be important—a baby (human or AI) who interacts only with the adults who control everything and are very different from them, might find it harder to see or believe in universality of rules.
  2. Another angle—similar to the above, but maybe different?—is to think positively about the rights you have to your toys, and think about the things you're thereby guaranteed to be allowed to do.
    1. I guess this would be relevant after you'd had the experience of trying to play with someone else's toys and been told a firm "no, that's not yours and you don't have permission".  For an AI, training on that seems doable.
  1. ^

    If two hawks meet, they fight, one gets injured, and the damage from the injury exceeds the value of the resource, so the expected value is negative to both birds (the basic scenario can be considered a game of chicken); if a hawk meets a dove, the dove runs away and gets zero, and the hawk gets the resource; if two doves meet, they waste some time on symbolic combat, one of them wins, and the expected value is positive.  Evolutionarily stable strategy is some fraction of hawks, some fraction of doves.

    Then there are other variations.  "Bullies" show fight and scare away doves, but will run away from a real fighter (i.e. hawk), and we figure that when two bullies meet, one gets scared and runs away first.  The bully strategy dominates the dove strategy; the equilibrium is a hawk-bully composite.

    Then we introduce the "retaliator" to defeat bullies.  Retaliators act like doves, but if the other bird shows fight, they fight back.  Against hawks and bullies they act like hawks; against doves or other retaliators they act like doves.  "Pure retaliator", or "mostly retaliator, with up to some fraction of doves", is an evolutionarily stable strategy—and so is hawk-bully.  Which one you end up with depends on your starting population.

    Further variations can be considered and explored.  For example, the most "optimal" result would be one in which all the birds understood some system by which each resource belonged to one bird and not the other, so when they met, one would act like a hawk and the other like a dove, resolving the conflict instantly.  Different systems are possible: "the biggest bird", "the bird whose territory it is", "the bird who got there first", "the bird who wins on some arbitrary visible characteristic (like tail length) not necessarily related to combat" (apparently this is a thing), and so on.  If there are multiple competing systems, then the majority will tend to push out the minority.  An evolutionarily stable (or metastable) equilibrium is "followers of one dominant system, plus up to some percentage of bullies".