Steven Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms—see Email: Twitter: @steve47285. Employer: Physicist by training.


Intro to Brain-Like-AGI Safety

Wiki Contributions



I'm not sure why you seem to think that I think of optionality-empowerment estimates as requiring anything resembling omniscience.

If we assume omniscience, it allows a very convenient type of argument:

  • Argument I [invalid]: Suppose an animal has a generic empowerment drive. We want to know whether it will do X. We should ask: Is X actually empowering?

However, if we don’t assume omniscience, then we can’t make arguments of that form. Instead we need to argue:

  • Argument II [valid]: Suppose an animal has a generic empowerment drive. We want to know whether it will do X. We should ask: Has the animal come to believe (implicitly or explicitly) that doing X is empowering?

I have the (possibly false!) impression that you’ve been implicitly using Argument I sometimes. That’s how omniscience came up.

For example, has a newborn bat come to believe (implicitly or explicitly) that flapping its arm-wings is empowering? If so, how did it come to believe that? The flapping doesn’t accomplish anything, right? They’re too young and weak to fly, and don’t necessarily know that flying is an eventual option to shoot for. (I’m assuming that baby bats will practice flapping their wings even if raised away from other bats, but I didn’t check, I can look it up if it’s a crux.) We can explain a sporadic flap or two as random exploration / curiosity, but I think bats practice flapping way too much for that to be the whole explanation.

Back to play-fighting. A baby animal is sitting next to its sibling. It can either play-fight, or hang out doing nothing. (Or cuddle, or whatever else.) So why play-fight?

Here’s the answer I prefer. I note that play-fighting as a kid presumably makes you a better real-fighter as an adult. And I don’t think that’s a coincidence; I think it’s the main point. In fact, I thought that was so obvious that it went without saying. But I shouldn’t assume that—maybe you disagree!

If you agree that “child play-fighting helps train for adult real-fighting” not just coincidentally but by design, then I don’t see the “Argument II” logic going through. For example, animals will play-fight even if they’ve never seen a real fight in their life.

So again: Why don’t your dog & cat just ignore each other entirely? Sure, when they’re already play-fighting, there are immediately-obvious reasons that they don’t want to be pinned. But if they’re relaxing, and not in competition over any resources, why go out of their way to play-fight? How did they come to believe that doing so is empowering? Or if they are in competition over resources, why not real-fight, like undomesticated adult animals do?

maximizing optionality automatically learns all motor skills - even up to bipedal walking

I agree, but I don’t think that’s strong evidence that nothing else is going on in humans. For example, there’s a “newborn stepping reflex”—newborn humans have a tendency to do parts of walking, without learning, even long before their muscles and brains are ready for the whole walking behavior. So if you say “a simple generic mechanism is sufficient to explain walking”, my response is “Well it’s not sufficient to explain everything about how walking is actually implemented in humans, because when we look closely we can see non-generic things going on”.

Here’s a more theoretical perspective. Suppose I have two side-by-side RL algorithms, learning to control identical bodies. One has a some kind of “generic” empowerment reward. The other has that same reward, plus also a reward-shaping system directly incentivizing learning to use some small number of key affordances that are known to work well for that particular body (e.g. standing).

I think the latter would do all the same things as the former, but it would learn faster and more reliably, particularly very early on. Agree or disagree? If you agree, then we should expect to find that in the brain, right?

(When I say “more reliably”, I’m referring to the trope that programming RL agents is really finicky, moreso than other types of ML. I don’t really know if that trope is correct though.)

Sure, but there is hardly room in the brainstem to reward-shape for the [] different things humans can learn to do.

I hope we’re not having one of those silly arguments where we both agree that empowerment explains more than 0% and less than 100% of whatever, and then we’re going back and forth saying “It’s more than 0%!” “No way, it’s less than 100%!” “No way, it’s more than 0%!” …  :)

Anyway, I think the brainstem “knows about” some limited number of species-typical behaviors, and can probably execute those behaviors directly without learning, and also probably reward-shapes the cortex into learning those behaviors faster. Obviously I agree that the cortex can also learn pretty much arbitrary other behaviors, like ballet and touch-typing, which are not specifically encoded in the brainstem.

The above seems likely enough in the context of CEV (again), but otherwise false.

I think there might be a mix-up here. There are two topics of discussion:

  • One topic is: “We should look at humans and human values since those are the things we want to align an AGI to.”
  • The other topic is: “We should look at humans and human values since AGI learning algorithms are going to resemble human brain within-lifetime learning algorithms, and humans provide evidence for what those algorithms do in different training environments”.

The part of the post that you excerpted is about the latter, not the former.

Imagine that God gives you a puzzle: You get most of the machinery for a human brain but some of the innate drive neural circuitry has been erased and replaced by empty boxes. You’re allowed to fill in the boxes however you want. You’re not allowed to cheat by looking at actual humans. Your goal is to fill in the boxes such that the edited-human winds up altruistic.

So you have a go at filling in the boxes. God lets you do as many validation runs as you want. The validation runs involve raising the edited-human in a 0AD society and seeing what they wind up like. After a few iterations, you find settings where the edited-humans reliably grow up very altruistic in every 0AD society you can think to try.

Now that your validation runs are done, it’s time for the test run. So the question is: if you put the same edited-human-brain in a 2022AD society, will it also grow up altruistic on the first try?

I think a good guess is “yes”. I think that’s what Jacob is saying.

(For my part, I think Jacob’s point there is fair, and a helpful way to think about it, even if it doesn’t completely allay my concerns.)

I mostly just want to repeat my comment on your last post.

I think your opposition to graders is really opposition to simple graders, that are never updated, that can’t account for non-consequentialist aspects of plans (e.g. “sketchiness”), and that are facing an extremely large search space of possibilities including out-of-the-box ones. And I think your value-vs-evaluation distinction is kinda different from graders-vs-non-graders.


  • For “nonrobust decision-influences can be OK”—I don’t think that’s a unique feature of not-having-a-grader. If there is a grader, but the grader is of the form “Here are a billion patterns with corresponding grades, try to pattern-match your plan to all billion of those patterns and do a weighted average”, then probably you can throw out a few of those billion patterns and the grader will still work the same.
  • For “values steer optimization; they are not optimized against”—I think you’re comparing apples and oranges. Let’s say I’m a human. I want “diamonds (as understood by me)”. So I attempt to program an AGI to want “diamonds (as understood by me)”.
    • In the framework you advocate, the AGI winds up “directly” “valuing” “diamonds (as understood by the AGI)”. And this can go wrong because “diamonds (as understood by me)” may differ from “diamonds (as understood by the AGI)”. If that’s what happens, then from my perspective, the AGI “was looking for, and found, an edge-case exploit”. From the AGI’s own perspective, all it was doing was “finding an awesome out-of-the-box way to make lots of diamonds”.
    • Whereas in the grader-optimizer framework, I delegate to a grader, and the AGI does the things that increase “diamonds (as understood by the grader)”. And this can go wrong because “diamonds (as understood by me)” may differ from “diamonds (as understood by the grader)”. From my perspective, the AGI is again “looking for edge-case exploits”.
    • It’s really the same problem, but in the first case you can temporarily forget the fact that I, the programmer, exist, and then there seems not to be any conflict / exploits / optimizing-against in the system. But the conflict is still there! It’s just off-stage.
  • For “Since values steer cognition, reflective agents try to avoid adversarial inputs to their own values”—Again, first of all, it’s the AGI itself that is deciding what is or isn’t adversarial, and the things that are adversarial from the perspective of the programmer might be just a great clever out-of-the-box idea from the perspective of the AGI. Second of all, I don’t think the things you’re saying are incompatible with graders, they’re just incompatible with “simple static graders”.

Hmm, the only way I can make sense of this article is to replace the word “biases” by “heuristics” everywhere in the article including the title. Heuristics are useful, whereas biases are bad by definition. Heuristics tend to create biases and biases tend to be created by the use of heuristics, such that I can imagine people mixing up the two terms.

Sorry if I’m misunderstanding.


One of my disagreements with your U,V,P,W,A model is that I think V & W are randomly-initialized in animals. Or maybe I’m misunderstanding what you mean by “brains also can import varying degrees of prior knowledge into other components”.

I also (relatedly?) am pretty against trying to lump the brainstem / hypothalamus and the cortex / BG / etc. into a single learning-algorithm-ish framework.

I’m not sure if this is exactly your take, but I often see a perspective (e.g. here) where someone says “We should think of the brain as a learning algorithm. Oh wait, we need to explain innate behaviors. Hmm OK, we should think of the brain as a pretrained learning algorithm.”

But I think that last step is wrong. Instead of “pretrained learning algorithm”, we can alternatively think of the brain as a learning algorithm plus other things that are not learning algorithms. For example, I think most newborn behaviors are purely driven by the brainstem, which is doing things of its own accord without any learning and without any cortex involvement.

To illustrate the difference between “pretrained learning algorithm” and “learning algorithm + other things that are not learning algorithms”:

Suppose I’m making a robot. I put in a model-based RL system. I also put in a firmware module that detects when the battery is almost empty and when it is, it shuts down the RL system, takes control, and drives the robot back to the charging station.

Leaving aside whether this is a good design for a robot, or a good model for the brain (it’s not), let’s just talk about this system. Would we describe the firmware module as “importing prior knowledge into components of the RL algorithm”? No way, right? Instead we would describe the firmware module as “a separate component from the RL algorithm”.

By the same token, I think there are a lot of things happening in the brainstem / hypothalamus which we should describe as “a separate component from the RL algorithm”.

Sufficiently distant consequences is exactly what empowerment is for, as the universal approximator of long term consequences. Indeed the animals can't learn about that long term benefit through trial-and-error, but that isn't how most learning operates. Learning is mostly driven by the planning system 1 - M/P - which drives updates to V/A based on both current learned V and U - and U by default is primarily estimating empowerment and value of information as universal proxies.

[M/P is a typo for W/P right?]

Let’s say I wake up in the morning and am deciding whether or not to put a lock pick set in my pocket. There are reasons to think that this might increase my empowerment—if I find myself locked out of something, I can maybe pick the lock. There are also reasons to think that this might decrease my empowerment—let’s say, if I get frisked by a cop, I look more suspicious and have a higher chance of spurious arrest, and also I’m carrying around more weight and have less room in my pockets for other things.

So, all things considered, is it empowering or disempowering to put the lock pick set into my pocket for the day? It depends. In a city, it’s maybe empowering. On a remote mountain, it’s probably disempowering. In between, hard to say.

The moral is: I claim that figuring out what’s empowering is not a “local” / “generic” / “universal” calculation. If I do X in the morning, it is unknowable whether that was an empowering or disempowering action, in the absence of information about where I’m likely to find myself in in the afternoon. And maybe I can make an intelligent guess at those, but I’m not omniscient. If I were a newborn, I wouldn’t even be able to guess.

So anyway, if an animal could practice skill X versus skill Y as a baby, it is (in general) unknowable which one is a more empowering course of action, in the absence of information about what kinds of situations the animal is likely to find itself in when it’s older. And the animal itself doesn’t know that—it’s just a baby.

Since I’m a smart adult human, I happen to know that:

  • it’s empowering for baby cats to practice pouncing,
  • it’s empowering for baby bats to practice arm-flapping,
  • it’s empowering for baby humans to practice grasping,
  • it’s not empowering for baby humans to practice arm-flapping,
  • it’s not empowering for baby bats to practice pouncing
  • etc.

But I don’t know how the baby cats, bats, and humans are supposed to figure that out, via some “generic” empowerment calculation. Arm-flapping is equally immediately useless for both newborn bats and newborn humans, but newborn humans never flap their arms and newborn bats do constantly.

So yeah, it would be simple and elegant to say “the baby brain is presented with a bunch of knobs and levers and gradually discovers all the affordances of a human body”. But I don’t think that fits the data, e.g. the lack of human newborn arm-flapping experiments in comparison to newborn bats.

Instead, I think baby humans have an innate drive to stand up, an innate drive to walk, an innate drive to grasp, and probably a few other things like that. I think they already want to do those things even before they have evidence (or other rational basis to believe) that doing so is empowering.

I claim that this also fits better into a theory where (1) the layout of motor cortex is relatively consistent between different people (in the absence of brain damage), (2) decorticate rats can move around in more-or-less species-typical ways, (3) there’s strong evolutionary pressure to learn motor control fast and we know that reward-shaping is helpful for that, (4) and that there’s stuff in the brainstem that can do this kind of reward-shaping, (5) lots of animals can get around reasonably well within a remarkably short time after birth, (6) stimulating a certain part of the brain can create “an urge to move your arm” etc. which is independent from executing the actual motion, (7) things like palmar grasp reflex, Moro reflex, stepping reflex, etc. (8) the sheer delight on the face of a baby standing up for the first time, (9) there are certain dopamine signals (from lateral SNc & SNl) that correlate with motor actions specifically, independent of general reward etc. (There’s kinda a long story, that I think connects all these dots, that I’m not getting into.)

(If you put a novel and useful motor affordance on a baby human—some funny grasper on their hand or something—I’m not denying that they would eventually figure out how to start using it, thanks to more generic things like curiosity, stumbling upon useful things, maybe learning-from-observation, etc. I just don’t think those kinds of things are the whole story for early acquisition of species-typical movements like grasping and standing. For example, I figure decorticate rats would probably fail to learn to use a weird novel motor affordance, but decorticate rats do move around in more-or-less species-typical ways.)

some amount of play fighting skill knowledge is prior instinctual, but much of it is also learned

Sure, I agree.

The only part of this that requires a more specific explanation is perhaps the safety aspect of play fighting: each animal is always pulling punches to varying degrees, the cat isn't using fully extended claws, neither is biting with full force, etc. That is probably the animal equivalent of empathy/altruism.

Yeah pulling punches is one thing. Another thing is that animals have universal species-specific somewhat-arbitrary signals that they’re playing, including certain sounds (laughing in humans) and gestures (“play bow” in dogs).

My more basic argument is that the desire to play-fight in the first place, as opposed to just relaxing or whatever, is an innate drive. I think we’re giving baby animals too much credit if we expect them to be thinking to themselves “gee when I grow up I might need to be good at fighting so I should practice right now instead of sitting on the comfy couch”. I claim that there isn’t any learning signal or local generic empowerment calculation that would form the basis for that.

Fun is complex and general/vague - it can be used to describe almost anything we derive pleasure from in your A.) or B.) categories.

Fair enough.

Thank you! I’ve been using the terms “inference algorithm” versus “learning algorithm” to talk about that kind of thing. What you said seems fine too, AFAIK.

I think that grader-optimization is likely to fail catastrophically when the grader is (some combination of):

  • more like “built / specified directly and exogenously by humans or other simple processes”, less like e.g. “a more and more complicated grader getting gradually built up through some learning process as the space-of-possible-plans gets gradually larger”
  • more like “looking at the eventual consequences of the plan”, less like “assessing plans for deontology and other properties” (related post) (e.g. “That plan seems to pattern-match to basilisk stuff” could be a strike against a plan, but that evaluation is not based solely on the plan’s consequences.)
  • more like “looking through tons of wildly-out-of-the-box plans”, less like “looking through a white-list of a small number of in-the-box plans”

Maybe we agree so far?

But I feel like this post is trying to go beyond that and say something broader, and I think that’s where I get off the boat.

I claim that maybe there’s a map-territory confusion going on. In particular, here are two possible situations:

  • (A) Part of the AGI algorithm involves listing out multiple plans, and another part of the algorithm involves a “grader” that grades the plans.
  • (B) Same as (A), but also assume that the high-scoring plans involve a world-model (“map”), and somewhere on that map is an explicit (metacognitive / reflective) representation of the “grader” itself, and the (represented) grader’s (represented) grade outputs (within the map) are identical to (or at least close to) the actual grader’s actual grades within the territory.

I feel like OP equivocates between these. When it’s talking about algorithms it seems to be (A), but when it’s talking about value-child and appendix C and so on, it seems to be (B).

In the case of people, I want to say that the “grader” is roughly “valence” / “the feeling that this is a good idea”.

I claim that (A), properly understood, should seem/feel almost tautological—like, it should be impossible to introspectively imagine (A) being false! It’s kinda the claim “People will do things that they feel motivated to do”, or something like that. By contrast, (B) is not tautological, or even true in general—it describes hedonists: “The person is thinking about how to get very positive valence on their own thoughts, and they’re doing whatever will lead to that”.

I think this is related to Rohin’s comment (“An AI system with a "direct (object-level) goal" is better than one with "indirect goals"”)—the AGI has a world-model / map, its “goals” are somewhere on the map (inevitably, I claim), and we can compare the option of “the goals are in the parts of the map that correspond to object-level reality (e.g. diamonds)”, versus “the goals are in the parts of the map that correspond to a little [self-reflective] portrayal of the AGI’s own evaluative module (or some other represented grader) outputting a high score”. That’s the distinction between (not-B) vs (B) respectively. But I think both options are equally (A).

(Sidenote: There are obvious reasons to think that (A) might lead to (B) in the context of powerful model-based RL algorithms. But I claim that this is not inevitable. I think OP would agree with that.)

Suppose most humans do X, where X increases empowerment. Three possibilities are:

  • (A) Most humans do X because they have an innate drive to do X; (e.g. having sex, breathing)
  • (B) Most humans do X because they have done X in the past and have learned from experience that doing X will eventually lead to good things (e.g. checking the weather forecast before going out)
  • (C) Most humans do X because they have indirectly figured out that doing X will eventually lead to good things—via either social / cultural learning, or via explicit means-end reasoning (e.g. avoiding prison, for people who have never been in prison)

I think Jacob & I both agree that there are things in all three categories, but we have disagreements where I want to put something into (A) and Jacob wants to put it into (B) or (C). Examples that came up in this post were “status-seeking / status-respecting behavior”, “fun”, and “enjoying river views”.

How do we figure it out? In general, 5 types of evidence that we can bring to bear are:

  • (1) Evidence from cases where we can rule out (C), e.g. sufficiently simple and/or young humans/animals. Then we can just see whether the animal is doing X more often than chance from the start, or whether it has to stumble upon X before it starts doing X more often than chance.
    • Example: If you’re a baby mouse who has never seen a bird (or bird-like projectile etc.) in your life, you have no rational basis for thinking that birds are dangerous. Nevertheless, lab experiments show that baby mice will run away from incoming birds, reliably, the first time. (Ref) So that has to be (A).
  • (2) Evidence from sufficiently distant consequences that we can rule out (B).
    • Example: Many animals will play-fight as children. This has a benefit (presumably) of eventually making the animals better at actual fighting as adults. But the animal can’t learn about that benefit via trial-and-error—the benefit won’t happen until perhaps years in the future. 
  • (3) Evidence from heritability—If doing X is heritable, I think an (A)-type explanation would make that fact very easy to explain—in fact, an (A)-type explanation for X would pretty much demand that doing X has nonzero heritability. Conversely, if doing X is heritable (in a way that’s not explained by heritability of “general intelligence” type stuff), well I don’t think (B) or (C) is immediately ruled out, but we do need to think about it and try to come up with a story of how that could work.
  • (4) Evidence from edge-cases where X is not actually empowering—Suppose doing X is usually empowering, but not always. If people do a lot of X even in edge-cases where it’s not empowering, I consider that strong evidence for (A) over (B) & (C). It’s not indisputable evidence though, because maybe you could argue that people are able to learn the simple pattern “X tends to be empowering”, but unable to learn the more complicated pattern “X tends to be empowering with the following exceptions…”. But still, I think it’s strong evidence.
    • Example: Humans can feel envy or anger or vengeance towards fictional characters, inanimate objects, etc. 
  • (5) Evidence from specific involuntary reactions, hypothalamus / brainstem involvement, etc.—For example, things that have specific universal facial expressions or sympathetic nervous system correlates, or behavior that can be reliably elicited by a particular neuropeptide injection (AgRP makes you hungry), etc., are probably (A).

A couple specific cases:

Status—I’m not sure whether Jacob is suggesting that human social status related behaviors are explained by (B) or (C) or both. But anyway I think 1,2,3,4 all push towards an (A)-type explanation for human social status behaviors. I think I would especially start with 3 (heritability)—if having high social status is generally useful for achieving a wide variety of goals, and that were the entire explanation for why people care about it, then it wouldn’t really make sense that some people care much more about status than others do, particularly in a way that (I’m pretty sure) statistically depends on their genes (including their sex) but which doesn’t much depend on their family environment (at least within a country), and which (I’m pretty sure) doesn’t particularly correlate with intelligence etc.

(As for 5, I’m not aware of e.g. some part of the hypothalamus or brainstem where stimulating it makes people feel high-status, but pretty please tell me if anyone has seen anything like that! I would be eternally grateful!)

Fun—Jacob writes “Fun is also probably an emergent consequence of value-of-information and optionality” which I take to be a claim that “fun” is (B) or (C), not (A). But I think it’s (A). I think 5 is strong evidence that fun involves (A). For one thing, decorticate rats will still do the activities we associate with “fun”, e.g. playing with each other (ref). For another thing, there’s a specific innate involuntary behavior / facial expression associated with “fun” (i.e. laughing in humans, and analogs-of-laughing in other animals), which again seems to imply (A). I also claim that 1,2,3,4 above also offer additional evidence for an (A)-type explanation of fun / play behavior, without getting into details.

One-size-fits-all introductions are hard; different people are going to have different backgrounds and preconceptions which call for different resources.

But to answer your question, if I had to pick one, in the absence of any specific information about who it’s for, I think I’d go with Ben Hilton’s 80,000 hours problem profile (August 2022).

You can do that using LeechBlock.

Load More