My take on Jacob Cannell’s take on AGI safety

Steven Byrnes

Jacob Cannell wrote some blog posts about AGI safety / alignment and neuroscience between 2010 and 2015, which I read and enjoyed quite early on when I was first getting interested in the same topics a few years ago. So I was delighted to see him reappear on Lesswrong a year ago where he has been a prolific and thought-provoking blogger and commenter (in his free time while running a startup!). See complete list of Jacob’s blog posts and comments.

Having read a bunch of his writings, and talked to him in various blog comments sections, I thought it would be worth trying to write up the places where he and I seem to agree and disagree.

This exercise will definitely be helpful for me, hopefully helpful for Jacob, and maybe helpful for people who are already pretty familiar with at least one of our two perspectives. (My perspective is here.) I’m not sure how helpful it will be for everyone else. In particular, I’m probably skipping over, without explanation, important areas where Jacob & I already agree—of which there are many!

(Before publishing I shared this post with Jacob, and he kindly left some responses / clarifications / counterarguments, which I have interspersed in the text, in gray boxes. I might reply back to some of those—check the comments section in the near future.)

1. How to think about the human brain

1.1 “Evolved modularity” versus “Universal learning machine”

Pause for background:

A. “Evolved modularity”: This is a school of thought wherein the human brain is a mishmosh of individual specific evolved capabilities, including a specifically-evolved language algorithm, a specifically-evolved “intuitive biology” algorithm, a specifically-evolved “intuitive physics” algorithm, an “intuitive human social relations” algorithm, a vision-processing algorithm, etc., all somewhat intermingled for sure, but all innate. Famous advocates of “evolved modularity” these days include Steven Pinker (see How the Mind Works) and Gary Marcus. I’m unfamiliar with the history but Jacob mentions early work by Cosmides & Tooby.
B. “Universal learning machine”: Jacob made up this term in his 2015 post “The Brain as a Universal Learning Machine”, to express the diametrically-opposite school of thought, wherein the brain has one extremely powerful and versatile within-lifetime learning algorithm, and this one algorithm learns language and biology and physics and social relations etc. This school of thought is popular among machine learning people, and it tends to be emphasized by computational neuroscientists, particularly in the “connectionist” tradition.

Here are two other things that are kinda related:

“Evolutionary psychology” is the basic idea of getting insight into psychological phenomena by thinking about evolution. In principle, “evolutionary psychology” and “evolved modularity” are different things, but unfortunately people seem to conflate them sometimes. For example, I read a 2018 book entitled Beyond Evolutionary Psychology, and it was entirely devoted to (a criticism of) evolved modularity, as opposed to evolutionary psychology per se. Well, I for one think that evolved modularity is basically wrong (as usually conceived; see next subsection), but I also think that doing evolutionary psychology (i.e., getting insight into psychological phenomena by thinking about evolution) is both possible and an excellent idea. Not only that, but I also think that actual evolutionary psychologists have in fact produced lots of good insights, as long as you’re able to sift them out from a giant pile of crap, just like in every field.
“Cortical uniformity” is the idea—due originally to Vernon Mountcastle in the 1970s and popularized by Jeff Hawkins in On Intelligence—that the neocortex is more-or-less a single configuration of neurons replicated over and over—in the case of humans, either 2 million “cortical columns” or 200 million “cortical minicolumns”, depending on who you ask. Cortical uniformity is a surprising hypothesis in light of the fact that different parts of the neocortex are intimately involved in seemingly-different domains like vision, language, math, reasoning, motor control, and so on. I say “more-or-less uniform” because neither Jeff Hawkins nor anyone else to my knowledge believes in literal “cortical uniformity”. There are well-known regional differences in the neocortex, but I like to think of them as akin to learning algorithm hyperparameters and architecture. Anyway, “cortical uniformity” is closely allied to the “universal learning machine” school of thought (see §2.5.3 here), but to flesh out that story you also need to say something about the other parts of the brain that are not the neocortex. For example, both Jacob (I think) and I take another big step in the “universal learning machine” direction by hypothesizing not only (quasi)uniformity of the cortex, but also of the striatum, cerebellum, and thalamus (with some caveats). Anyway, see below.

1.2 My compromise position

To oversimplify a bit, my position on the evolved-modularity versus universal-learning-machine spectrum is:

“Universal Learning Machine” is an excellent starting point for thinking about the telencephalon (neocortex, hippocampus, amygdala, striatum, etc.), thalamus, and cerebellum.
- I.e., when we try to understand those parts of the brain, we should be mainly on the lookout for powerful large-scale learning algorithms.
“Evolved Modularity” is an excellent starting point for thinking about the hypothalamus and brainstem.
- I.e., when we try to understand those parts of the brain, we should be mainly on the lookout for lots of little components that do specific fitness-enhancing things and which are specifically encoded by the genome.
- (See, for example, my discussion of a particular cluster of cells in the hypothalamus that orchestrate hunger-related behavior in Section 3 of my recent hypothalamus post.)

1.3 How complicated are innate drives?

If the within-lifetime learning algorithm of the human brain is a kind of RL algorithm, then it needs a reward function. (I actually think this is a bit of an oversimplification, but close enough.) Let’s use the term “innate drives” to refer to the things in that reward function—avoiding pain, eating sweets, etc. The reward function, in my view, is primarily calculated in the hypothalamus and brainstem.

(For more on my picture, see my posts “Learning From Scratch” in the Brain and Two subsystems: Learning & Steering.)

Jacob and I seem to have some disagreement about how complex these innate drives are, and how much we should care about that complexity; I’m on the pro-complexity side of the debate, and Jacob is on the pro-simplicity side.

1.3.1 Example: our disagreement about habitat-related aesthetics

For an example of where we disagree, consider the landscape preferences theory within evolutionary aesthetics. Here’s wikipedia (hyperlinks and footnotes removed):

An important choice for a mobile organism is selecting a good habitat to live in. Humans are argued to have strong aesthetical preferences for landscapes which were good habitats in the ancestral environment. When young human children from different nations are asked to select which landscape they prefer, from a selection of standardized landscape photographs, there is a strong preference for savannas with trees. The East African savanna is the ancestral environment in which much of human evolution is argued to have taken place. There is also a preference for landscapes with water, with both open and wooded areas, with trees with branches at a suitable height for climbing and taking foods, with features encouraging exploration such as a path or river curving out of view, with seen or implied game animals, and with some clouds. These are all features that are often featured in calendar art and in the design of public parks.
A survey of art preferences in many different nations found that realistic painting was preferred. Favorite features were water, trees as well as other plants, humans (in particular beautiful women, children, and well-known historical figures), and animals (in particular both wild and domestic large animals). Blue, followed by green, was the favorite color. Using the survey, the study authors constructed a painting showing the preferences of each nation. Despite the many different cultures, the paintings all showed a strong similarity to landscape calendar art. The authors argued that this similarity was in fact due to the influence of the Western calendar industry. Another explanation is that these features are those evolutionary psychology predicts should be popular for evolutionary reasons.

My snap reaction is that this evolutionary story seems probably true, and Jacob’s is that it’s probably false. We were arguing about it in this thread.

The disagreement seems to be less about the specifics of the landscape painting experiment mentioned above, and more about priors.

My prior is mainly coming from the following:

By default, being in the wrong micro-habitat gives a negative reward which can be both very sparse and often irreversibly fatal (e.g. a higher chance of getting eaten by a predator, starving to death, freezing to death, burning to death, falling to death, drowning, getting stuck in the mud, etc., depending on the species).
Therefore, it’s very difficult for an animal to learn which micro-habitat to occupy purely by trial-and-error without the help of any micro-habitat-specific reward-shaping.
Such reward-shaping is straightforward to implement by doing heuristic calculations on sensory inputs.
Animal brains (specifically brainstem & hypothalamus in the case of vertebrates) seem to be perfectly set up with the corresponding machinery to do this—visual heuristics within the superior colliculus, auditory heuristics within the inferior colliculus, taste heuristics within the medulla, smell heuristics within the hypothalamus, etc.
Therefore, I have a strong prior expectation that every mobile animal including humans will find types of visual input (and sounds, smells, etc.) to be inherently “appealing” / “pleasant”, in a way that would statistically lead the animal to spend more time in “good” micro-habitats / hunting grounds / etc. and less time in “bad” ones.

Jacob’s prior is mainly coming from the following, I think:

We know that when animals choose what to look at, this decision is at least partly based on information value…
- …Both based on priors (otherwise learning is difficult, and meanwhile there’s a large literature showing that curiosity is important for ML capabilities)
- …And based on studies of novelty / curiosity drive in animal brains (Jacob cites The psychology and neuroscience of curiosity, Systems Neuroscience of Curiosity, Shared striatal activity in decisions to satisfy curiosity and hunger at the risk of electric shocks)
And once you think through the implications of an information-value drive, it elegantly accounts for just about everything we know about human aesthetics. (Jacob allows for some exceptions including sexual attraction—see below.) That includes observations about which paintings people put on walls. So there’s nothing left to explain!

(I am more-or-less on board with the first top-level bullet point here^[1], but disagree with the last bullet point.)

All that was kinda priors. Now we turn to the specifics of the landscape painting thing.

Jacob & I argued about it for a while. I think the following is one of the root causes of the disagreement:

Jacob was interpreting the hypothesis in question as “Humans have (among other things) an innate preference for looking at water, trees, etc.”.
The hypothesis that I believe is: “Humans have (among other things) a pleasant innate reaction upon looking at visual scenes for which F(visual input) takes a high value, where F is some rather simple function, I don’t know exactly what, but definitely way too simple a function to include a proper water-detector, or tree-detector, etc.^[2]

Jacob and I both agree that the first hypothesis is wrong. (To be fair, he wasn’t getting it from nowhere—it’s probably what most advocates of the hypothesis would say that they are arguing for!)

(And this is an example of our more general dispositions where I tend to think “10% of evolutionary psychology is true important things that we need to explain, let’s get to work explaining them properly” and Jacob tends to think “90% of evolutionary psychology is crap, let’s get to work throwing it out”. These are not inconsistent! But they’re different emphases.)

Anyway, my hypothesis is coming from:

I think the function F is implemented in the superior colliculus (part of the brainstem), which is too small and low-resolution to do good image processing;
We only have 25,000 genes in our whole genome, and building a proper robust tree-detector seems too complicated for that;
There’s some evidence that the human superior colliculus has an innate human-face detector, but it’s not really a human-face detector, it’s really a detector of three dark blobs in a roughly triangular pattern, and this blob-detector incidentally triggers on faces. Likewise, an incoming-bird-detector in the mouse superior colliculus is really more like an “expanding dark blob in the upper field-of-view” detector (ref).

Let’s go back to evidence from surveys and market research on wall-calendars and paintings, mentioned in that Wikipedia excerpt above. Unfortunately, it seems that neither Jacob nor I have theories that make sharp predictions on what people will want to hang on their walls. One problem is that we both agree that people can hang things on walls for reasons related to neither “innate aesthetics” nor “information value”, like impressing your friends, or bringing back sentimental memories of your first kiss. I have the additional problem that I don’t know exactly what the alleged habitat-aesthetics function F is (and there are probably several F’s), and thus I find it perfectly plausible (indeed, expected) that F can be triggered by, say, an abstract painting which nobody in their right mind would mistake for a savannah landscape. And I have no predictions about which abstract paintings!^[3] And conversely, the question of what does or doesn’t provide information value is likewise complicated—it depends on one’s goals and prior knowledge. Thus Jacob and I were disagreeing here about whether a window view of a river provides high or low information value. (Suppose that you’ve had that same window view for the past 3 years already, and the river never has animals or boats on it.) I say the information value of that window view is roughly zero, Jacob says it’s significantly positive (“it's constantly changing with time of day lighting, weather, etc.…the river view suggests you can actually go out and explore the landscape”), and I’m not sure how we’re going to resolve that.

DALL-E 2 prompts: “the view out my window has high information value” (left) and “a window view with high information value” (right). 🤔🤔🤔

So it seems like we’re stuck, or at least our disagreement probably won’t get resolved by looking into people’s wall-art preferences.

1.3.2 …But this doesn’t seem to be a super-deep disagreement

Why don’t I think it’s a super-deep disagreement?

For one thing, I proposed that “fully describing the [reward function] of a human would probably take like thousands of lines of pseudocode” and Jacob said “sounds reasonable”.

For another thing, while we disagree about habitat-aesthetics-in-humans, there are structurally-similar cases where Jacob & I are in fact on the same page:

I brought up the case of a little camouflaged animal having an innate preference to be standing on the appropriate background to its camouflage, implemented via the superior colliculus calculating some function on visual inputs and feeding that information into the reward function (as one among many contributions to the reward function). Jacob seemed at least willing to entertain that as a plausible hypothetical.
Jacob definitely believes that there are innate sexual preferences related to the visual appearances of potential mates. Let’s turn to that next.

1.3.3 “Correlation-guided proxy matching”

Here is Jacob describing the idea of “correlation-guided proxy matching”:

Any time evolution started using a generic learning system, it had to figure out how to solve this learned symbol grounding problem, how to wire up dynamically learned concepts to extant conserved, genetically-predetermined behavioral circuits.
Evolution's general solution likely is correlation-guided proxy matching: a Matryoshka-style layered brain approach where a more hardwired oldbrain is redundantly extended rather than replaced by a more dynamic newbrain. Specific innate circuits in the oldbrain encode simple approximations of the same computational concepts/patterns as specific circuits that will typically develop in the newbrain at some critical learning stage - and the resulting firing pattern correlations thereby help oldbrain circuits locate and connect to their precise dynamic circuit counterparts in the newbrain. This is why we see replication of sensory systems in the ‘oldbrain’, even in humans who rely entirely on cortical sensory processing.

[Translation guide: When Jacob talks about “oldbrain” it’s roughly equivalent to when I talk about “hypothalamus and brainstem”.]

In the case of innate sexual preferences, Jacob proposes “dumb simple humanoid shape detectors and symmetry detectors etc encoding a simple sexiness visual concept”^[4] as an example.

Anyway, leaving aside some nitpicky arguments over implementation details, I see this as very much on the right track. I’m bringing it up because we’ll get back to it later.

1.3.4 Should we think of (almost) all innate drives as “an approximation to (self)-empowerment”?

Let’s loosely define “empowerment” as “having lots of options in the future”—see Jacob’s post Empowerment is (almost) all we need for better discussion, and I’ll get back to empowerment in Section 3 below in the context of AGI.

If a sufficiently-clear-thinking human were deliberately trying to empower herself, she would do lots of things that humans actually do. She would stay alive and healthy, she would win friends and allies and high social status, she would gain skills and knowledge, she would accumulate money or other resources, she would stay abreast of community gossip, and so on.

Maybe you’re tempted to look at the above paragraph and say “Aha! An “empowerment drive” is a grand unified theory of human innate drives!!” But that would be wrong for a couple reasons.

The first reason is that empowerment comes apart from inclusive genetic fitness in a couple places—particularly having sex, raising children, and more generally helping close relatives survive and have children à la kin selection theory. And we see this in e.g. the human innate sex drive.

The second reason is that infants cannot realistically calculate which actions will lead to “empowerment”.

Jacob responds: On the contrary I think it's fairly clear now that the primary learning signals driving the infant brain are some combination of self-supervised learning for prediction and then value of information and optionality/empowerment for decisions (motor and planning).

The evidence for this comes from DL experiments as well as neuroscience, but also just obvious case examples:

https://www.lesswrong.com/posts/hpjou9ZnLZkSJR7sd/reflections-on-six-months-of-fatherhood

Indeed, I claim that even adult humans often do things that advance their own empowerment without understanding why and how. For example, if someone is quick to anger and vengeance, then that tendency can (indirectly via their reputation) increase their empowerment, via people learning not to mess with them. But that’s not why they’re quick to anger and vengeance—it’s just their personality! And if they haven’t read Thomas Schelling or whatever, they might never appreciate the underlying logic.

So we don’t have an innate drive for “empowerment” per se, because it’s not realistically computable. Instead:

We have a set of innate drives which can be collectively viewed as “an approximation to a hypothetical empowerment drive”. For example, innate fear-of-heights is part of an approximation to empowerment, insofar as falling off a cliff tends to be disempowering.
We will generally learn empowerment-advancing behaviors and patterns within our lifetimes, because those behaviors and patterns tend to be useful for lots of things. For example, I like having money, not because of any innate drive, but because of experience using money to get lots of other things I like.

Out of these two points, I think Jacob has more-or-less agreed with both. For the first one, he recognizes sex as a non-empowerment-related innate drive here (“The values that deviate from empowerment are near exclusively related to sex”—well that seems an overstatement given childrearing, but whatever.) For the second one, here I had proposed “There is innate stuff in the genome that makes humans want social status. Oh by the way, the reason that this stuff wound up in the genome is because social status tends to lead to empowerment, which in turn tends to lead to higher inclusive genetic fitness. Ditto curiosity, fun, etc.”, and Jacob at least “mostly” agreed.

Jacob responds: Social status drive emerges naturally from empowerment, which children acquire by learning cultural theory of mind and folk game theory through learning to communicate with and through their parents. Children quickly learn that hidden variables in their parents have huge effect on their environment and thus try to learn how to control those variables.

I mostly agree that curiosity - or value of information - is innate; which is not the same as optionality-empowerment, but is closely connected to it and a primary innate motivational drive. Fun is also probably an emergent consequence of value-of-information and optionality.

But unlike Jacob, I get the takeaway message: “OK, so at the end of the day, ‘empowerment’ is pretty useless as a way to think about human innate drives. Let’s not do that.” For example, I can say “fear-of-heights is part of an approximation to empowerment”, and that’s correct! But what’s the point? I can equally well say “fear-of-heights is part of an approximation to inclusive genetic fitness”. Or better yet, “fear-of-heights tends to stop you from falling off cliffs and getting injured or killed, which in turn would be bad for inclusive genetic fitness”. I don’t see how “empowerment” is adding anything to the conversation here.

And I think “empowerment” adds to confusion if we’re not scrupulously careful to avoid mixing up “empowerment” and “approximation-to-empowerment”. Approximations tend to come apart in new environments—that’s Goodhart’s law. We’ll get back to that in Section 3.3 below.

Likewise, we can say that “status drive is an approximation to empowerment”, and we’re correct to say that, but saying that gets us ≈0% of the way towards explaining exactly what status drive is or how it’s implemented.

(Unless you think that there’s no such thing as an innate status drive, and that humans engage in status-seeking and status-respecting behaviors purely because they’ve learned within their lifetime that those behaviors are instrumentally useful. That’s certainly a hypothesis worth entertaining, but I strongly believe that it’s wrong.)

Jacob responds (to “we can say that ‘status drive is an approximation to empowerment’”): Well no, I'd say status drive is not truly innate at all, but is learned very early on as a natural empowerment manifestation or proxy.

Infants don't even know how to control their own limbs, but they automatically learn through a powerful general empowerment learning mechanism. That same general learning signal absolutely does not - and can not - discriminate between hidden variables representing limb poses (which it seeks to control) and hidden variables representing beliefs in other humans minds (which determine constraints on the child's behavior). It simply seeks to control all such important hidden variables.

Steve sidenote: Leaving aside the question of who is correct, I think it’s helpful to note that this disagreement here has the same pattern as the one in Section 1.3.1 above—Jacob thinks that the human brain within-lifetime RL reward function is simpler (a.k.a. smaller number of different “innate drives”) and I think it’s more complicated (a.k.a. larger number of different “innate drives”).

OK, let’s switch gears to a somewhat different topic:

2. Will AGI algorithms look like brain algorithms?

2.1 The spectrum from “giant universe of possible AGI algorithms” versus “one natural practical way to build AGI”

Here are two opposite schools of thought:

“Giant Universe” school-of-thought: There is a vast universe of possible AGI algorithms. If you zoom in enough, you can eventually find a tiny speck, and inside that speck is every human mind that has ever existed. (Cf. Eliezer Yudkowsky 2008.)
“Unique Solution” school-of-thought: The things we expect AGI to do (learn, understand, plan, reason, invent, etc.) comprise a problem, and maybe it turns out that there’s just one natural practical way to solve that problem. If so, we would expect future AGI algorithms to resemble human brain algorithms. (Cf. Jacob Cannell 2022)

Before proceeding, a few points of clarification:

People can easily talk past each other by mixing up “learning algorithm” versus “trained model”. I’m closer to the “unique solution” camp when we’re talking about the learning algorithm, and I’m closer to the “giant universe” camp when we talk about the trained model.
As a particularly safety-relevant example of why I’m in the “giant universe” camp for trained models, I think human-brain-like RL with 1000 different reward functions can lead to trained models that have 1000 wildly different desires / goals / intuitions about what’s good and right. (But they all might act the same for a while thanks to instrumental convergence.) In this context, I think it’s important to remember that people can (and by default will) make AGIs with reward functions that are radically different from those of any human or animal, e.g. “reward for paperclips”. (More discussion and caveats in my post here.)
We can also reconcile the two schools of thought by the fact that the “Giant Universe” claim is about “possible” algorithms and the “Unique Solution” claim is about “practical” algorithms. Even if there is just one unique practical learning algorithm that scales to AGI, there are certainly lots of other wildly impractical ones. Two examples in the latter category (in my opinion) would be: (1) a learning algorithm that recapitulates the process of animal evolution, and (2) computable approximations to AIXI such as “AIXItl”.

Going back to those two schools of thought, and focusing on the learning algorithm not the trained model, are there any good reasons to believe in “Unique Solution”?

It seems at least plausible to me. After all, there do seem to be “natural” solutions to at least some algorithmic problems—e.g. the Fast Fourier Transform was more-or-less independently invented multiple times. Would an intelligent extraterrestrial civilization invent the belief propagation algorithm, in a form recognizable to us? Hard to say, but it seems at least plausible, right?

We get stronger evidence from the cases where AI researchers have come up with an idea and then later discover that they reinvented something that evolution has already put into the human brain. Examples are controversial, but arguably include Temporal Difference learning, self-supervised learning (i.e. the idea of updating models on sensory prediction errors), and feedback control. What about the overlap between deep learning and the brain—distributed representations, adjustable weights, etc.? Well, those things were historically brought into AI from neuroscience, which complicates our ability to draw lessons. But still, the remarkable successes of more-brain-like deep learning compared to various less-brain-like alternatives in AI does seem to be at least some evidence for “Unique Solution”. (But see next subsection.)

Jacob offers another reason that he’s strongly in the “Unique Solution” school of thought, related to his claim that brains are near various theoretical efficiency limits. Leaving aside the question of whether brains are in fact near various theoretical efficiency limits (I have no strong opinion), I don’t understand this argument. Why can’t a wildly different algorithm also approach the same theoretical efficiency limits?

Well anyway, I join Jacob in the “Unique Solution” camp, albeit with a bit less confidence and for different underlying reasons. Indeed, when I explain to people why I’m working on brain-like AGI (e.g. here), I usually offer the justification that we AGI safety researchers should be making contingency plans for any plausible AGI design that we can think of, and brain-like AGI is at least plausible. But that’s just a polite diplomatic cop-out. What I really believe is that the researchers pursuing broadly-brain-like paths to AGI are the ones who will probably succeed, and everyone else will probably fail, and/or gradually pivot / converge towards brain-like approaches. If you disagree with that claim, I’m not particularly interested in arguing with you (for the obvious infohazard reasons)—we can agree to disagree, and I will fall back to my polite diplomatic cop-out answer above, and we’re all going to find out sooner or later!

2.2 How similar are brain learning algorithms versus today’s deep learning algorithms? (And implications for timelines.)

Jacob and I seem to be in agreement that human brain learning algorithms are similar in some ways and different in other ways from today’s deep learning algorithms. But I have a strong sense that Jacob expects substantially bigger similarities and substantially smaller differences than I do. That’s hard to pin down, and as above I don’t want to argue about it. We’ll find out sooner or later!

In terms of timelines, Jacob & I agree that AGI is probably already possible for a reasonable price with today’s chips and data centers, and we’re just waiting on algorithmic advances. (Jacob: “So my model absolutely is that we are limited by algorithmic knowledge. If we had that knowledge today we would be training AGI right now”.)

So then my next step is to say “OK then. How long will we be waiting on those algorithmic advances? Hmm. I dunno! Maybe 5-30 years?? Then let’s also add, umm, 5-10 more years after that to work out the kinks and run trainings before we have AGI.” (When I say “5-30” years, I have a bit more going on under the hood besides wild guessing. But not much more!)

Jacob proposes more confidently that we’ll get AGI soon (“75% by 2032”). He thinks that a certain amount of compute / memory / etc. is required to train an AGI (and we can figure out roughly how much by looking at human brain within-lifetime learning), and by the time that a great many groups around the world have easy access to this much compute / memory / etc., they will come up with whatever algorithmic advances are necessary for AGI. He writes: “Algorithmic innovation is rarely the key constraint on progress in DL, due to the vast computational training expense of testing new ideas. Ideas are cheap, hardware is not.” (I have heard that Hans Moravec’s forecasts were based on a similar assumption.)

I’m much less confident than Jacob in “ideas are cheap”. It seems to me that plenty of useful algorithms are published decades later than they theoretically could have been published, for reasons unrelated to the availability of compute. For example, Judea Pearl published the belief propagation algorithm in 1982. Why hadn’t someone already published it in 1962? Or 1922?? That’s not a rhetorical question—I’m not an expert, maybe there’s a good answer! Leave a comment if you know. But anyway, where I’m at right now is that I wouldn’t be surprised if there were, say, 10 or 20 years between lots of groups having easy access to compute sufficient for AGI, and someone actually making AGI. So I have longer timelines than Jacob, although that’s a pretty low bar by “normie” standards.

Again, this all seems probably downstream of our different opinions about how similar deep learning algorithms are to brain learning algorithms—a question which (I would argue) is slightly relevant for safety and extremely relevant for capabilities, so I don’t care to talk about it. But it certainly seems likely that Jacob is imagining smaller ideas (tweaks) which are cheap, and I’m thinking of bigger ideas which are more expensive.

2.3 Will AGI use neuromorphic (or processing-in-memory) chips?

Jacob and I both agree that (1) the first AGIs that people will make will probably use “normal” chips like GPUs or other ASICs, (2) when thousandth-generation Jupiter-brain AGIs are building Dyson spheres, they’re probably going to be using neuromorphic / processing-in-memory architectures of some sort, since those seem to have the best properties in terms of both scaling up to extremely large information capacity, and energy efficiency. (See Jacob’s discussion here).

I think I’m a bit more negative than Jacob on the current state of neuromorphic chips and technical challenges ahead, and thus I expect the transition to neuromorphic chips to happen later than Jacob expects, probably. I also put higher probability on AGI also using fast serial coprocessors to unlock algorithmic possibilities that brains don’t have access to, both for early AGI and in the distant future. (Think of how “a human with a pocket calculator” can do things that a human can’t. Then think much bigger than that!) But whatever; this disagreement doesn’t seem to be too important for anything.

3. Human-empowerment as an AGI motivation

See Jacob’s recent post Empowerment is (almost) All We Need (and slightly earlier LOVE in a simbox is all you need).

Two questions immediately jump to mind:

The outer alignment question is: “Do we want to make an AGI that’s trying to “empower” humanity?”

The inner alignment question is: “How would we make an AGI that’s trying to “empower” humanity?”

Jacob’s answer to the latter (inner alignment) question is mostly “correlation-guided proxy matching” as described above, possibly supplemented by interpretability—see his comment here.

My perspective is that we shouldn’t really be asking these two questions separately. I think we’re going to follow Procedure X (let’s say, correlation-guided proxy matching with proxy P and hyperparameters A,B,C in environment E), and we’re going to get an AGI that’s trying to do Y. I expect that Y will not be identical to “empowerment” because perfect inner alignment is a pipe dream. So we shouldn’t ask the two questions: “(1) How similar is Y to “empowerment”, and (2) Is “empowerment” what we want?”. Instead, I think we should ask the one question “Is Y what we want?”.

So I want to push the question of empowerment to the side and just look at the actual plan. When I do, I find that Jacob’s proposals are very similar to my own! But I do think we have some minor differences worth discussing.

Jacob’s proposed plan described here suggested two things, one related to reverse-engineering social instincts in the brain, and the other related to interpretability. Let’s take them one at a time:

3.1 Social instincts / empathy

Jacob and I both agree that it would be good to understand human social instincts well enough that we could write them into future AGI source code if we wanted to (here’s my own post motivating that). We both agree that this code would probably involve something like correlation-guided proxy matching (I have a post on that too). But my impression is that Jacob expects that we’re going to get most of the way towards solving this problem by reading the (massive) existing neuroscience literature concerning morality, sociality, affects, etc., whereas I think that literature is all kinda garbage—or rather, not answering the questions that I'm interested in—and we still have our work cut out.

Jacob responds: Not quite - my prior is that success in reverse engineering human altruism (which probably depends on innate social instincts for grounding) will depend on existing neuroscience literature to about the same extent that progress in DL has.

So Jacob seems to have more of a “it’s OK we have a plan” attitude, while I’m sitting here poring over technical studies of neuropeptide receptors in the lateral septum, feeling like I’m racing the clock, even though my timelines-to-AGI are actually longer than his.

Somewhat relatedly, and echoing the discussion of innate drives above, I think Jacob expects human social instincts to be simpler than I do—maybe he expects human social instincts to comprise like 5 separable “innate reactions” (e.g. here) and I expect like 30, or whatever. So maybe he thinks we can just think about it a bit in our armchairs and write down the answer, and it will be either correct or close enough, whereas I expect more of a big research project that will produce non-obvious results.

Jacob responds: I think most of the system complexity for innate symbol grounding is split vaguely equally between sexual attraction and altruism-supporting innate social instincts, and that reverse engineering, testing and improving these mechanisms for DL agents in sim sandboxes is much of the big research project.

3.2 Interpretability

Jacob suggests that we could “use introspection/interpretability tools to more manually locate learned models of external agents (and their values/empowerment/etc), and then extract those located circuits and use them directly as proxies in the next agent”. (See also here.) I think that’s a perfectly good idea (see e.g. my comment here), and I think our disagreement (such as it is) is a bit like Jacob saying “Maybe it will work” and me saying “Maybe it won’t work”. These can both be true. Hopefully we can all agree that it would be better to have a strong positive reason to believe that our plan will definitely work, particularly given challenges related to “concept extrapolation”. (See also the rest of that post.)

Jacob has a clever additional twist on interpretability in his proposal that we could “listen in” on an AGI’s internal monologue (see here). Again, I do think this is a fine idea that could help us, particularly if we can figure out interventions that make the AGI a “verbal thinker” to the greatest extent possible. I don’t think that this offers any strong guarantees that this interpretability won’t be missing important things. For example, I’m somewhat of a verbal thinker, I guess, but my internal monologue has lots of idiosyncratic made-up terms which are only meaningful to myself. It also has lots of very different thoughts associated with the same words. Let’s explore this avenue anyway, for sure, but I don’t want to get my hopes too high.

3.3 OK, but still, is humanity-empowerment what we want?

(In other words, if we somehow made an AGI that wanted to maximize the future empowerment of “humanity”, would it be “aligned”?)

I argued just above that this is not really the right question to ask. But it’s not entirely irrelevant either. So let’s have at it.

Let’s say that “humanity” (CEV or whatever) has terminal goals T (a utopia of truth, beauty, friendship, love, fun, diversity, kittens, whatever). Let’s also say that, given the choice and knowledge and power, “humanity” would pursue instrumental empowerment-type goals P as a means to an end of achieving T.

If we make an AGI that wants humanity to wind up maximally empowered in the future, it would be “aligned” to the human pursuit of P, but “misaligned” to the human pursuit of T.

Jacob responds: The convergence theorems basically say that optimizing for P[t] converges to optimizing for T[t+d] for some sufficient timespan d. So optimizing for our empowerment today is equivalent to optimizing for our future ability to maximize our long term values, whatever they are. I think you are confusing optimizing for P[t] (current empowerment) with optimizing for P[t+d] (future empowerment). Convergence requires a sufficient time gap between the moment of empowerment and the future utility, which wouldn't occur for P[t+d] and T[t+d].

In other words, the AGI does not want humans to “cash in” their empowerment to purchase T.^[5]

Even worse, the AGI does not want humans to want to “cash in” their empowerment to purchase T.

Jacob responds: If the AGI is optimizing for rolling future discounted empowerment, that is equivalent only to optimizing for the long term components of our utility function. Long term utility never wants us to 'cash' in empowerment, and this same conflict occurs in human brains (spend vs save/invest). The obvious solution as I mentioned is to use a learned model for the short term utility, and probably learn the discount schedule.

Also it is worth noting that lower discount rates lead to more success in the long term, and lower discount rates increase the convergence (lower the importance of short term utility).

T is the whole value of the future. T is what we’re fighting for. T is the light at the end of the tunnel. If we make a powerful autonomous AGI that doesn’t care about T, then we’re doing the wrong thing!

This seems to be the obvious objection, and indeed I find it persuasive. But Jacob offers several rebuttals.

First (see here and here), I think Jacob is imagining two stages:

In Stage 1, the AGI accumulates P and gives it to humans.
In Stage 2, the now-super-empowered uplifted posthumans (or whatever) spend their P to buy T.

Jacob responds: Yeah this is what success looks like. There may be other success stories, but the main paths look like this (empowered posthumanity). So if your AGI is not working towards this path, something is probably wrong.

Steve again: (Just to be crystal-clear, I agree that this two-stage story sounds pretty great, if we can make it happen. Here I’m questioning whether it would happen, under the given assumptions.)

I’m skeptical of this story—or at least confused. It seems like the AGI would be unhappy about (post)humanity’s decision to throw out their own option value by purchasing T in stage 2. Maybe in stage 2, the AGI is no longer able to do anything about it—it’s too late, the posthumans are super-powerful and thus back in control of their own fate. But it’s not too late in stage 1! And even in stage 1, the AGI will see this “problem” coming, and so it can and will preemptively solve it.

Jacob responds: Imagine for example that mass uploading will become feasible in 2048 (with AGI's help), and we created the AGI to maximize our empowerment - in 2048. The AGI will then not care how we spend that empowerment in 2049. Now generalize that to a continuous empowerment schedule with a learned discount rate and learned short term utility, and we can avoid issues with the AGI changing our minds too much before handing over power.

Steve again: OK I agree that an AGI with the stable goal of “maximize human empowerment in 2048” would not have the specific problem I brought up here.

Thus, for example, as the AGI is going through the process of “uplifting” the humans to posthumans, it would presumably do so in a way that deletes the human desire for T and adds a direct posthuman desire for P. Right?

Jacob responds: Doubtful - that would only occur if you had no short term model of T and also a too loose conception of 'humanity' to empower.

Second (see here and here), Jacob notes that evolution was optimizing for inclusive genetic fitness, and got some amount of T incidentally. So maybe an AGI optimizing for P will also incidentally produce T. Or even better: maybe T just is what happens when an optimization process pursues P! Or in Jacob’s words:

Humans and all our complex values are the result of evolutionary optimization for a conceptually simple objective: inclusive fitness. A posthuman society transcends biology and inclusive fitness no longer applies. What is the new objective function for post-biological evolution? Post humans are still intelligent agents with varying egocentric objectives and thus still systems for which the behavioral empowerment law applies. So the outcome is a natural continuation of our memetic/cultural/technological evolution which fills the lightcone with a vast and varied complex cosmopolitan posthuman society. The values that deviate from empowerment are near exclusively related to sex which no longer serves any direct purpose, but could still serve fun and thus empowerment. Reproduction still exists but in a new form. Everything that survives or flourishes tends to do so because it ultimately serves the purpose of some higher level optimization objective.

I think there’s a Goodhart’s law problem here.

People intrinsically like fun and beauty and friendship—they’re part of the T. But simultaneously, it turns out that they serve as an approximation to human empowerment—they’re a proxy to P (see Section 1.3.4). That’s reassuring, right? No it’s not, thanks to Goodhart’s law.

I claim that if an AGI was really good at optimizing P, it would find places where fun and beauty and friendship come apart from P, and then make sure that the posthumans’ actual desire in those cases is for P, and not for fun and beauty and friendship. And the more we push into weird out-of-distribution futures, the more likely this is to happen.

Jacob responds: Empowerment is a convergent efficient universal long term value approximator that any successful AGI will end up using due to the difficulties in efficiently optimizing directly for very specific values in the long term future from issues like accumulating uncertainty and the optimizer's curse. The real question then is whether the AGI is optimizing for its own empowerment, or ours.

Weird-out-of-distribution futures are exactly the scenarios where it's important that the AGI is optimizing for our empowerment rather than its own.

The AGI will probably not replace our desire for fun/beauty/friendship with P because of some combination of 1.) direct approximation of T (fun/beauty/friendship) for short term utility, 2.) a conservative model of 'humanity' to empower than prevents changing humans too much (which is necessary for any successful scheme regardless, as otherwise the AGI just assimilates us into itself to make optimizing for its self-empowerment equivalent to optimizing for 'our' values simply by redefining/changing us), 3.) control over the discount schedule

For example, maybe some clever futuristic system of smart contracts is objectively much better at managing interpersonal coordination and trade than the old-fashioned notion of “trust and friendship”. And if the AGI sets up this smart-contract system, while simultaneously making (post)humans feel no intrinsic trust-and-friendship-related feelings and drives whatsoever, maybe those (post)humans would be “more empowered”. But I don’t care! That’s still bad! I still don’t want the AGI to do that! I want the feelings of trust and friendship to survive into the distant future!

Anyway, I don’t really know what a maximal-P future looks like. (I’m not sure that, in our current state of knowledge, P is defined well enough to answer that??) But my strong expectation is that it would not look like a complex cosmopolitan posthuman society. Maybe it would look like a universe full of computronium and machinery, working full-time to build even more computronium and machinery.

Third (from here),

“Empowerment is only a good bound of the long term component of utility functions, for some reasonable future time cutoff defining 'long term'. But I think modelling just the short term component of human utility is not nearly as difficult as accurately modelling the long term, so it's still an important win. I didn't investigate that much in the article, but that is why the title is now "Empowerment is (almost) all we need".”

OK, well, insofar as I’m opposed to empowerment, I naturally think “empowerment + other stuff” is a step in the right direction! :) However, my hunch is that for a sufficiently good choice of “other stuff”, the “empowerment” part will be rendered unnecessary or counterproductive. It seems likely that, if the future goes well, the AGI will facilitate human empowerment at the end of the day, but maybe it can do so because the AGI ultimately wants to maximize human flourishing, and it can reason that increasing human empowerment is instrumentally useful towards that end, for example.

Another thing is: Jacob writes: “no matter what your values are, optimizing for your empowerment today is identical to optimizing for your long term values today.” I think that kind of thinking is a bit confused. I reject the idea that if the AGI is making good decisions right now, then all is well. As mentioned above, if the AGI is motivated to manipulate human values, that motivation might only manifest in the AGI’s behavior way down the line, like when the AGI is uploading human brains but deleting the parts that entail an intrinsic desire for anything besides power. But while that problem will only manifest in the distant future, the time to solve it is right at the beginning, when we’re building the AGI and thus still have direct control over its motivations.

4. Simboxes

Jacob is a big fan of “simulation sandboxes”, which he calls “simboxes” for short. These are air-gapped virtual worlds which serve as environments in which you can train an AGI. See Jacob’s recent post LOVE in a simbox is all you need, section 5.

Jacob is optimistic about being able to set up simboxes such that the AGI-under-test does not escape (mainly because it doesn’t know it’s in a simbox, or even what a simbox is—as he writes, “these agents will lack even the requisite precursor words and concepts that we take for granted such as computation, simulation, etc.”), and Jacob is also optimistic that these tests will allow us to iterate our way to AGI safety / alignment.

While I’m much less optimistic than Jacob about achieving both those things simultaneously, my very important take-home message is: I think simbox testing is an excellent idea. I think we should not only be doing simbox testing in the endgame, but we should be working right now to build infrastructure and culture that makes future simbox testing maximally easy and safe and effective, and maximally likely to actually happen, not just a little but a lot. (Just like every other form of code testing and validation that we can think of.) We should also be working right now to think through exactly what simbox tests to run and how. I even previously included one ingredient of the path-to-simbox-testing—namely, feature-rich user-friendly super-secure sandbox software compatible with large-scale ML—as a Steve-endorsed shovel-ready AGI safety project on my list here.

Having said all that, I think we should mainly think of simbox testing as “an extra layer of protection” on top of other reasons to expect safe and beneficial AGI.

Specifically, I proposed in this comment two ways to think about what the simbox test is doing:

A. We’re going to have strong theoretical reasons to expect alignment, and we’re going to use simbox testing to validate those theories.
B. We’re going to have an unprincipled approach that might or might not create aligned models, and we’re going to use simbox testing to explore / tweak specific trained models and/or explore / tweak the training approach.

A is good. B is problematic, for reasons I’ll get to shortly.

But first, I want to emphasize that I see this A-vs-B distinction as a continuum, not a binary. There’s a whole spectrum from “unprincipled approach” to “strong theoretical reasons to expect alignment”, as we get a progressively more specific and fleshed-out story underlying why we expect our AGI to be aligned. For example:

All the way at the extreme of “strong theoretical reasons to expect alignment” would be Vanessa Kosoy’s research program working towards a rigorous mathematical proof of AGI safety (which I’m pessimistic about, but I wish her luck!).
All the way at “unprincipled” would be just doing capabilities research, not thinking about alignment at all, and seeing what happens with the trained models at the end. Ajeya Cotra’s “human feedback on diverse tasks” would be basically in that category.
Somewhere in between these two extremes would be, say, Alex Turner’s recent diamond-alignment post, where we engage in speculation about what the “baby AGI” is probably thinking about in different situations, and then try to send reward signals at carefully-chosen times to seed desired motivations. Or my toy example proposal here to make an AGI that learns the abstract concept “human flourishing” from observations, and then tries to maximize the extent to which its beliefs pattern-match to that abstract concept. These proposals may well fail, for sure, but at least we’re not totally in the dark when it comes to anticipating where and how they might fail, and what tests might help us figure that out.

In terms of simbox use strategy, I think “somewhere in between A and B” is all I’m hoping for, and I consider my research goal to be trying to get as close to A as possible.

Jacob’s response was: “As for A vs B: ideally you may want A but you settle mostly for B. That's just how the world often works, how DL progressed, etc. We now have more established theory of how DL works as approx bayesian inference, but what actually drove most progress was B style tinkering.”

I think Jacob is selling himself short here. I think his simbox plan has a lot of “A” in it. I think Jacob has pretty specific ideas in mind for how alignment is going to happen and how it could fail, and these ideas are informing his picture of what kind of simbox testing is most useful, and what we would be looking for, etc.

By the way, what’s the problem with B? The problem is that the simboxes will be different from reality in lots of ways. For example, Jacob proposes “these agents will lack even the requisite precursor words and concepts that we take for granted such as computation, simulation, etc.” Well, that’s a great idea if we want to prevent the AGI from escaping the sim! But that’s a terrible idea if we want to avoid any distribution-shift between the simboxes and reality! (Cf. “ontological crisis”.) And if there’s any distribution-shift, then there’s the possibility that the same training procedure will produce aligned AGIs in the simboxes and misaligned AGIs in reality.

Jacob responds: The distribution shift from humans born in 0AD to humans born in 2000AD seems fairly inconsequential for human alignment. Indeed, any useful AGI alignment mechanism should be at least as robust as human brains under such mild distribution shifts. Regardless, we can use various analogs of technological concepts if needed.

Luckily, this problem is progressively less problematic as we move from “B” towards “A”. Then we have some understanding of possible failure modes, and we can ensure that those failure modes are being probed by our simboxes.

(However, on my models, right now we are NOT close enough to “A” that all the remaining failure modes can be simbox-tested. For example, the distribution shift from “agents that are unaware of the concept of computation” to “agents that are aware of the concept of computation” is fraught with danger, difficult to reason about in our current state of knowledge (see my discussion of “concept extrapolation” here), and risky to probe in simboxes. So we still have lots more simbox-unrelated work to do, in parallel with the important simbox-prep work.)

(Thanks Jacob for bearing with me through lots of discussion over the past months, and for leaving comments above. Thanks also to Linda Linsefors & Alex Turner for critical comments on an earlier draft.)

^{^}
I say “more or less” because I think Jacob and I have some disagreements about the “neuroscience of novelty and curiosity” literature. For example, I think there’s a theory relating serotonin to information value, which Jacob likes and I dislike. But leaving aside those details, I am strongly on board with the more basic idea that the brain has an innate curiosity drive of some sort or another, and right now I don’t have much of a specific take on how it works.
^{^}
In addition to the direct effects of F (“I like looking at X because F(X) is high”), there could also be indirect effects of F (“I like looking at X because it pattern-matches to / reminds me of Y, which I like, and oh by the way the reason I like Y is because F(Y) was high when I looked at it as a child”). See discussion of “correlation-guided proxy matching” below.
^{^}
It’s not that this is unknowable, but I think figuring it out would require a heroic effort and/or detailed connectomic data about the human superior colliculus (and maybe also the neighboring parabigeminal nucleus). And someone should totally do that!! I would be very grateful!!
^{^}
UPDATE: Just to be clear, I don’t have an opinion on the specific question of whether or not humans have innate visual “sexiness”-related heuristics. I do think there has to be something that solves the “symbol grounding” problem, but I’m not confident that it’s even partly visual. It could alternatively involve the sense of smell, and/or empathetic simulation of body shape and sensations (vaguely along these lines but involving the proprioceptive and somatosensory systems). Or maybe it is visual, I don’t know.
^{^}
There’s a weird dynamic here in that I’m saying that an AGI which supposedly wants humanity to be empowered would be motivated to prevent humanity from exercising its power. Isn’t that contradictory? I think the way to square that circle is that the proposal as I understand it is for the AGI to want humanity to be empowered later—to eventually wind up empowered. However, there’s a tradeoff between empowerment-now and empowerment-later. If I’m empowered-now, then I can choose NOT to be empowered-later—e.g., by spending my money instead of hoarding it. Or jumping off a cliff. Therefore an AGI that always wants humanity to be empowered-later is an AGI that never wants humanity to be empowered-now. So then the “later” never arrives—not even at the end of the universe!!

Suppose most humans do X, where X increases empowerment. Three possibilities are:

(A) Most humans do X because they have an innate drive to do X; (e.g. having sex, breathing)
(B) Most humans do X because they have done X in the past and have learned from experience that doing X will eventually lead to good things (e.g. checking the weather forecast before going out)
(C) Most humans do X because they have indirectly figured out that doing X will eventually lead to good things—via either social / cultural learning, or via explicit means-end reasoning (e.g. avoiding prison, for people who have never been in prison)

I think Jacob & I both agree that there are things in all three categories, but we have disagreements where I want to put something into (A) and Jacob wants to put it into (B) or (C). Examples that came up in this post were “status-seeking / status-respecting behavior”, “fun”, and “enjoying river views”.

How do we figure it out? In general, 5 types of evidence that we can bring to bear are:

(1) Evidence from cases where we can rule out (C), e.g. sufficiently simple and/or young humans/animals. Then we can just see whether the animal is doing X more often than chance from the start, or whether it has to stumble upon X before it starts doing X more often than chance.
- Example: If you’re a baby mouse who has never seen a bird (or bird-like projectile etc.) in your life, you have no rational basis for thinking that birds are dangerous. Nevertheless, lab experiments show that baby mice will run away from incoming birds, reliably, the first time. (Ref) So that has to be (A).
(2) Evidence from sufficiently distant consequences that we can rule out (B).
- Example: Many animals will play-fight as children. This has a benefit (presumably) of eventually making the animals better at actual fighting as adults. But the animal can’t learn about that benefit via trial-and-error—the benefit won’t happen until perhaps years in the future.
(3) Evidence from heritability—If doing X is heritable, I think an (A)-type explanation would make that fact very easy to explain—in fact, an (A)-type explanation for X would pretty much demand that doing X has nonzero heritability. Conversely, if doing X is heritable (in a way that’s not explained by heritability of “general intelligence” type stuff), well I don’t think (B) or (C) is immediately ruled out, but we do need to think about it and try to come up with a story of how that could work.
(4) Evidence from edge-cases where X is not actually empowering—Suppose doing X is usually empowering, but not always. If people do a lot of X even in edge-cases where it’s not empowering, I consider that strong evidence for (A) over (B) & (C). It’s not indisputable evidence though, because maybe you could argue that people are able to learn the simple pattern “X tends to be empowering”, but unable to learn the more complicated pattern “X tends to be empowering with the following exceptions…”. But still, I think it’s strong evidence.
- Example: Humans can feel envy or anger or vengeance towards fictional characters, inanimate objects, etc.
(5) Evidence from specific involuntary reactions, hypothalamus / brainstem involvement, etc.—For example, things that have specific universal facial expressions or sympathetic nervous system correlates, or behavior that can be reliably elicited by a particular neuropeptide injection (AgRP makes you hungry), etc., are probably (A).

A couple specific cases:

Status—I’m not sure whether Jacob is suggesting that human social status related behaviors are explained by (B) or (C) or both. But anyway I think 1,2,3,4 all push towards an (A)-type explanation for human social status behaviors. I think I would especially start with 3 (heritability)—if having high social status is generally useful for achieving a wide variety of goals, and that were the entire explanation for why people care about it, then it wouldn’t really make sense that some people care much more about status than others do, particularly in a way that (I’m pretty sure) statistically depends on their genes (including their sex) but which doesn’t much depend on their family environment (at least within a country), and which (I’m pretty sure) doesn’t particularly correlate with intelligence etc.

(As for 5, I’m not aware of e.g. some part of the hypothalamus or brainstem where stimulating it makes people feel high-status, but pretty please tell me if anyone has seen anything like that! I would be eternally grateful!)

Fun—Jacob writes “Fun is also probably an emergent consequence of value-of-information and optionality” which I take to be a claim that “fun” is (B) or (C), not (A). But I think it’s (A). I think 5 is strong evidence that fun involves (A). For one thing, decorticate rats will still do the activities we associate with “fun”, e.g. playing with each other (ref). For another thing, there’s a specific innate involuntary behavior / facial expression associated with “fun” (i.e. laughing in humans, and analogs-of-laughing in other animals), which again seems to imply (A). I also claim that 1,2,3,4 above also offer additional evidence for an (A)-type explanation of fun / play behavior, without getting into details.

I'll start with a basic model of intelligence which is hopefully general enough to cover animals, humans, AGI, etc. You have a model-based agent with a predictive world model W learned primarily through self-supervised predictive learning (ie learning to predict the next 'token' for a variety of tokens), a planning/navigation subsystem P which uses W to approximately predict sample important trajectories according to some utility function U, a value function V which computes the immediate net expected discounted future utility of actions from current state (including internal actions), and then some action function A which just samples high value actions based on V. The function of the planning subsystem P is then to train/update V.

The utility function U obviously needs some innate bootstrapping, but brains also can import varying degrees of prior knowledge into other components - and most obviously into V, the value function. Many animals need key functionality 'out of the box', which you can get by starting with a useful prior on V/A. The benefit for innate prior knowledge in V/A diminishes as brains scale up in net training compute (size * training time), so that humans - with net training compute ~1e25 ops vs ~1e21 ops for a cat - rely far more on learned knowledge for V/A rather than prior/innate knowledge.

So now to translate into your 3 levels:

A.): Innate drives: Innate prior knowledge in U and in V/A.

B.): Learned from experience and subsumed into system 1: using W/P to train V/A.

C.): System 2 style reasoning: zero shot reasoning from W/P.

(1) Evidence from cases where we can rule out (C), e.g. sufficiently simple and/or young humans/animals

So your A.) - innate drives - corresponds to U or the initial state of V/A at birth. I agree the example of newborn rodents avoiding birdlike shadows is probably mostly innate V/A - value/action function prior knowledge.

(2) Evidence from sufficiently distant consequences that we can rule out (B) Example: Many animals will play-fight as children. This has a benefit (presumably) of eventually making the animals better at actual fighting as adults. But the animal can’t learn about that benefit via trial-and-error—the benefit won’t happen until perhaps years in the future.

Sufficiently distant consequences is exactly what empowerment is for, as the universal approximator of long term consequences. Indeed the animals can't learn about that long term benefit through trial-and-error, but that isn't how most learning operates. Learning is mostly driven by the planning system 1 - M/P - which drives updates to V/A based on both current learned V and U - and U by default is primarily estimating empowerment and value of information as universal proxies.

The animals play-fighting is something I have witnessed and studied recently. We have a young dog and a young cat who organically have learned to play several 'games'. The main game is a simple chase where the larger dog tries to tackle the cat. The cat tries to run/jump to safety. If the dog succeeds in catching the cat, the dog will tackle constrain it on the ground, teasing it for a while. We - the human parents - often will interrupt the game at this point and occasionally punish the dog if it plays too rough and the cat complains. In the earliest phases the cat was about as likely to chase and attack the dog as the other way around, but over time learned it would near always lose wrestling matches and up in a disempowered state.

There is another type of ambush game the cat will play in situations where it can 'attack' the dog from safety or in range to escape to safety, and then other types of less rough play fighting they do close to us.

So I suspect that some amount of play fighting skill knowledge is prior instinctual, but much of it is also learned. The dog and cat both separately enjoy catching/chasing balls or small objects, the cat play fights and 'attacks' other toys, etc. So early on in their interactions they had these skills available, but those alone are not sufficient to explain the game(s) they play together.

The chase game is well explained by empowerment drive: the cat has learned that allowing the dog to chase it down leads to an intrinsically undesirable disempowered state. This is a much better fit for the data and also has much lower intrinsic complexity than a bunch of innate drives for every specific disempowered situation, vs a general empowerment drive. It's also empowering for the dog to control and disempower the cat to some extent. So much of innate hunting skill drives seem like just variations and/or mild tweaks to empowerment.

The only part of this that requires a more specific explanation is perhaps the safety aspect of play fighting: each animal is always pulling punches to varying degrees, the cat isn't using fully extended claws, neither is biting with full force, etc. That is probably the animal equivalent of empathy/altruism.

Status—I’m not sure whether Jacob is suggesting that human social status related behaviors are explained by (B) or (C) or both. But anyway I think 1,2,3,4 all push towards an (A)-type explanation for human social status behaviors. I think I would especially start with 3 (heritability)—if having high social status is generally useful for achieving a wide variety of goals, and that were the entire explanation for why people care about it, then it wouldn’t really make sense that some people care much more about status than others do, particularly in a way that (I’m pretty sure) statistically depends on their genes

Status is almost all learned B: system 2 W/P planning driving system 1 V/A updates.

Earlier I said - and I don't see your reply yet, so i'll repeat it here:

Infants don't even know how to control their own limbs, but they automatically learn through a powerful general empowerment learning mechanism. That same general learning signal absolutely does not - and can not - discriminate between hidden variables representing limb poses (which it seeks to control) and hidden variables representing beliefs in other humans minds (which determine constraints on the child's behavior). It simply seeks to control all such important hidden variables.

Social status drive emerges naturally from empowerment, which children acquire by learning cultural theory of mind and folk game theory through learning to communicate with and through their parents. Children quickly learn that hidden variables in their parents have huge effect on their environment and thus try to learn how to control those variables.

It's important to emphasize that this is all subconscious and subsumed into the value function, it's not something you are consciously aware of.

I don't see how heritability tells us much about how innate social status is. Genes can control many hyperparms which can directly or indireclty influence the later learned social status drive. One obvious example is just the relevant weightings of value-of-information (curiosity) vs optionality-empowerment and other innate components of U at different points in time (development periods). I think this is part of the explanation for children who are highly curious about the world and less concerned about social status vs the converse.

Fun—Jacob writes “Fun is also probably an emergent consequence of value-of-information and optionality” which I take to be a claim that “fun” is (B) or (C), not (A). But I think it’s (A).

Fun is complex and general/vague - it can be used to describe almost anything we derive pleasure from in your A.) or B.) categories.

Thanks!

One of my disagreements with your U,V,P,W,A model is that I think V & W are randomly-initialized in animals. Or maybe I’m misunderstanding what you mean by “brains also can import varying degrees of prior knowledge into other components”.

I also (relatedly?) am pretty against trying to lump the brainstem / hypothalamus and the cortex / BG / etc. into a single learning-algorithm-ish framework.

I’m not sure if this is exactly your take, but I often see a perspective (e.g. here) where someone says “We should think of the brain as a learning algorithm. Oh wait, we need to explain innate behaviors. Hmm OK, we should think of the brain as a pretrained learning algorithm.”

But I think that last step is wrong. Instead of “pretrained learning algorithm”, we can alternatively think of the brain as a learning algorithm plus other things that are not learning algorithms. For example, I think most newborn behaviors are purely driven by the brainstem, which is doing things of its own accord without any learning and without any cortex involvement.

To illustrate the difference between “pretrained learning algorithm” and “learning algorithm + other things that are not learning algorithms”:

Suppose I’m making a robot. I put in a model-based RL system. I also put in a firmware module that detects when the battery is almost empty and when it is, it shuts down the RL system, takes control, and drives the robot back to the charging station.

Leaving aside whether this is a good design for a robot, or a good model for the brain (it’s not), let’s just talk about this system. Would we describe the firmware module as “importing prior knowledge into components of the RL algorithm”? No way, right? Instead we would describe the firmware module as “a separate component from the RL algorithm”.

By the same token, I think there are a lot of things happening in the brainstem / hypothalamus which we should describe as “a separate component from the RL algorithm”.

Sufficiently distant consequences is exactly what empowerment is for, as the universal approximator of long term consequences. Indeed the animals can't learn about that long term benefit through trial-and-error, but that isn't how most learning operates. Learning is mostly driven by the planning system 1 - M/P - which drives updates to V/A based on both current learned V and U - and U by default is primarily estimating empowerment and value of information as universal proxies.

[M/P is a typo for W/P right?]

Let’s say I wake up in the morning and am deciding whether or not to put a lock pick set in my pocket. There are reasons to think that this might increase my empowerment—if I find myself locked out of something, I can maybe pick the lock. There are also reasons to think that this might decrease my empowerment—let’s say, if I get frisked by a cop, I look more suspicious and have a higher chance of spurious arrest, and also I’m carrying around more weight and have less room in my pockets for other things.

So, all things considered, is it empowering or disempowering to put the lock pick set into my pocket for the day? It depends. In a city, it’s maybe empowering. On a remote mountain, it’s probably disempowering. In between, hard to say.

The moral is: I claim that figuring out what’s empowering is not a “local” / “generic” / “universal” calculation. If I do X in the morning, it is unknowable whether that was an empowering or disempowering action, in the absence of information about where I’m likely to find myself in in the afternoon. And maybe I can make an intelligent guess at those, but I’m not omniscient. If I were a newborn, I wouldn’t even be able to guess.

So anyway, if an animal could practice skill X versus skill Y as a baby, it is (in general) unknowable which one is a more empowering course of action, in the absence of information about what kinds of situations the animal is likely to find itself in when it’s older. And the animal itself doesn’t know that—it’s just a baby.

Since I’m a smart adult human, I happen to know that:

it’s empowering for baby cats to practice pouncing,
it’s empowering for baby bats to practice arm-flapping,
it’s empowering for baby humans to practice grasping,
it’s not empowering for baby humans to practice arm-flapping,
it’s not empowering for baby bats to practice pouncing
etc.

But I don’t know how the baby cats, bats, and humans are supposed to figure that out, via some “generic” empowerment calculation. Arm-flapping is equally immediately useless for both newborn bats and newborn humans, but newborn humans never flap their arms and newborn bats do constantly.

So yeah, it would be simple and elegant to say “the baby brain is presented with a bunch of knobs and levers and gradually discovers all the affordances of a human body”. But I don’t think that fits the data, e.g. the lack of human newborn arm-flapping experiments in comparison to newborn bats.

Instead, I think baby humans have an innate drive to stand up, an innate drive to walk, an innate drive to grasp, and probably a few other things like that. I think they already want to do those things even before they have evidence (or other rational basis to believe) that doing so is empowering.

I claim that this also fits better into a theory where (1) the layout of motor cortex is relatively consistent between different people (in the absence of brain damage), (2) decorticate rats can move around in more-or-less species-typical ways, (3) there’s strong evolutionary pressure to learn motor control fast and we know that reward-shaping is helpful for that, (4) and that there’s stuff in the brainstem that can do this kind of reward-shaping, (5) lots of animals can get around reasonably well within a remarkably short time after birth, (6) stimulating a certain part of the brain can create “an urge to move your arm” etc. which is independent from executing the actual motion, (7) things like palmar grasp reflex, Moro reflex, stepping reflex, etc. (8) the sheer delight on the face of a baby standing up for the first time, (9) there are certain dopamine signals (from lateral SNc & SNl) that correlate with motor actions specifically, independent of general reward etc. (There’s kinda a long story, that I think connects all these dots, that I’m not getting into.)

(If you put a novel and useful motor affordance on a baby human—some funny grasper on their hand or something—I’m not denying that they would eventually figure out how to start using it, thanks to more generic things like curiosity, stumbling upon useful things, maybe learning-from-observation, etc. I just don’t think those kinds of things are the whole story for early acquisition of species-typical movements like grasping and standing. For example, I figure decorticate rats would probably fail to learn to use a weird novel motor affordance, but decorticate rats do move around in more-or-less species-typical ways.)

some amount of play fighting skill knowledge is prior instinctual, but much of it is also learned

Sure, I agree.

The only part of this that requires a more specific explanation is perhaps the safety aspect of play fighting: each animal is always pulling punches to varying degrees, the cat isn't using fully extended claws, neither is biting with full force, etc. That is probably the animal equivalent of empathy/altruism.

Yeah pulling punches is one thing. Another thing is that animals have universal species-specific somewhat-arbitrary signals that they’re playing, including certain sounds (laughing in humans) and gestures (“play bow” in dogs).

My more basic argument is that the desire to play-fight in the first place, as opposed to just relaxing or whatever, is an innate drive. I think we’re giving baby animals too much credit if we expect them to be thinking to themselves “gee when I grow up I might need to be good at fighting so I should practice right now instead of sitting on the comfy couch”. I claim that there isn’t any learning signal or local generic empowerment calculation that would form the basis for that.

Fun is complex and general/vague - it can be used to describe almost anything we derive pleasure from in your A.) or B.) categories.

Fair enough.

One of my disagreements with your U,V,P,W,A model is that I think V & W are randomly-initialized in animals. Or maybe I’m misunderstanding what you mean by “brains also can import varying degrees of prior knowledge into other components”.

I think we agree the cortex/cerebellum are randomly initialized, along with probably most of the hippocampus, BG, perhaps amagdyla? and a few others. But those don't map cleanly to U, W/P, and V/A.

For example, I think most newborn behaviors are purely driven by the brainstem, which is doing things of its own accord without any learning and without any cortex involvement.

Of course - and that is just innate unlearned knowledge in V/A. V/A (value and action) generally go together, because any motor/action skills need pairing with value estimates so the BG can arbitrate (de-conflict) action selection.

The moral is: I claim that figuring out what’s empowering is not a “local” / “generic” / “universal” calculation. If I do X in the morning, it is unknowable whether that was an empowering or disempowering action, in the absence of information about where I’m likely to find myself in in the afternoon. And maybe I can make an intelligent guess at those, but I’m not omniscient. If I were a newborn, I wouldn’t even be able to guess.

Empowerment and value-of-information (curiosity) estimates are always relative to current knowledge (contextual to the current wiring and state of W/P and V/A). Doing X in the morning generally will have variable optionality value depending on the contextual state, goals/plans, location, etc. I'm not sure why you seem to think that I think of optionality-empowerment estimates as requiring anything resembling omniscience.

The newborns VoI and optionality value estimates will be completely different and focused on things like controlling flailing limbs and making sounds, moving the head, etc.

But I don’t know how the baby cats, bats, and humans are supposed to figure that out, via some “generic” empowerment calculation. Arm-flapping is equally immediately useless for both newborn bats and newborn humans, but newborn humans never flap their arms and newborn bats do constantly.

There's nothing to 'figure out' - it just works. If you're familiar with the approximate optionality-empowerment literature, it should be fairly obvious that a generic agent maximizing optionality, will end up flapping it's wing-arms when controlling a bat body, flailing limbs around in a newborn human body, balancing pendulums, learning to walk, etc. I've already linked all this - but maximizing optionality automatically learns all motor skills - even up to bipedal walking.

So yeah, it would be simple and elegant to say “the baby brain is presented with a bunch of knobs and levers and gradually discovers all the affordances of a human body”. But I don’t think that fits the data, e.g. the lack of human newborn arm-flapping experiments in comparison to bats.

Human babies absolutely do the equivalent experiments - most of the difference is simply due to large differences in the arm structure. The bat's long extensible arms are built to flap, the human infants' short stubby arms are built to flail.

Also keep in mind that efficient optionality is approximated/estimated from a sampling of likely actions in the current V/A set, so it naturally and automatically takes advantage of any prior knowledge there. Perhaps the bat does have prior wiring in V/A that proposes&generates simple flapping that can be improved

Instead, I think baby humans have an innate drive to stand up, an innate drive to walk, an innate drive to grasp, and probably a few other things like that. I think they already want to do those things even before they have evidence (or other rational basis to believe) that doing so is empowering.

This just doesn't fit the data at all. Humans clearly learn to stand and walk. They may have some innate bias in V/U which makes that subgoal more attractive, but that is intrinsically more complex addition to the basic generic underlying optionality control drive.

I claim that this also fits better into a theory where (1) the layout of motor cortex is relatively consistent between different people (in the absence of brain damage),

We've already been over that - consistent layout is not strong evidence of innate wiring. A generic learning system will learn similar solutions given similar inputs & objectives.

(2) decorticate rats can move around in more-or-less species-typical ways,

The general lesson from the decortication experiments is that smaller brain mammals rely on (their relatively smaller) cortex less. Rats/rabbits can do much without the cortex and have many motor skills available at birth. Cats/dogs need to learn a bit more, and then primates - especially larger ones - need to learn much more and rely on the cortex heavily. This is extreme in humans, to the point where there is very little innate motor ability left, and the cortex does almost everything.

(3) there’s strong evolutionary pressure to learn motor control fast and we know that reward-shaping is certainly helpful for that,

It takes humans longer than an entire rat lifespan just to learn to walk. Hardly fast.

(4) and that there’s stuff in the brainstem that can do this kind of reward-shaping,

Sure, but there is hardly room in the brainstem to reward-shape for the different things humans can learn to do.

Universal capability requires universal learning.

(5) lots of animals can get around reasonably well within a remarkably short time after birth,

Not humans.

(6) stimulating a certain part of the brain can create “an urge to move your arm” etc. which is independent from executing the actual motion,

Unless that is true for infants, it's just learned V components. I doubt infants have an urge to move the arm in a coordinated way, vs lower level muscle 'urges', but even if they did that's just some prior knowledge in V.

(If you put a novel and useful motor affordance on a baby human—some funny grasper on their hand or something—I’m not denying that they would eventually figure out how to start using it, thanks to more generic things like curiosity,

We know that humans can learn to see through their tongue - and this does not take much longer than an infant learning to see through its eyes.

I think we both agree that sensory cortex uses a pretty generic universal learning algorithm (driven by self supervised predictive learning). I just also happen to believe the same applies to motor and higher cortex (driven by some mix of VoI, optionality control, etc).

I think we’re giving baby animals too much credit if we expect them to be thinking to themselves “gee when I grow up I might need to be good at fighting so I should practice right now instead of sitting on the comfy couch”. I claim that there isn’t any learning signal or local generic empowerment calculation that would form the basis for that

Comments like these suggest you don't have the same model of optionality-empowerment as I do. When the cat was pinned down by the dog in the past, it's planning subsystem computed low value for that state - mostly based on lack of optionality - and subsequently the V system internalizes this as low value for that state and states leading towards it. Afterwards when entering a room and seeing the dog on the other side, the W/P planning system quickly evaluates a few options like: (run into the center and jump up onto the table), (run into the center and jump onto the couch), (run to the right and hide behind the couch), etc - and subplan/action (run into the center ..) gets selected in part because of higher optionality. It's just an intrinsic component of how the planning system chooses options on even short timescales, and chains recursively through training V/A.

Thanks!

I'm not sure why you seem to think that I think of optionality-empowerment estimates as requiring anything resembling omniscience.

If we assume omniscience, it allows a very convenient type of argument:

Argument I [invalid]: Suppose an animal has a generic empowerment drive. We want to know whether it will do X. We should ask: Is X actually empowering?

However, if we don’t assume omniscience, then we can’t make arguments of that form. Instead we need to argue:

Argument II [valid]: Suppose an animal has a generic empowerment drive. We want to know whether it will do X. We should ask: Has the animal come to believe (implicitly or explicitly) that doing X is empowering?

I have the (possibly false!) impression that you’ve been implicitly using Argument I sometimes. That’s how omniscience came up.

For example, has a newborn bat come to believe (implicitly or explicitly) that flapping its arm-wings is empowering? If so, how did it come to believe that? The flapping doesn’t accomplish anything, right? They’re too young and weak to fly, and don’t necessarily know that flying is an eventual option to shoot for. (I’m assuming that baby bats will practice flapping their wings even if raised away from other bats, but I didn’t check, I can look it up if it’s a crux.) We can explain a sporadic flap or two as random exploration / curiosity, but I think bats practice flapping way too much for that to be the whole explanation.

Back to play-fighting. A baby animal is sitting next to its sibling. It can either play-fight, or hang out doing nothing. (Or cuddle, or whatever else.) So why play-fight?

Here’s the answer I prefer. I note that play-fighting as a kid presumably makes you a better real-fighter as an adult. And I don’t think that’s a coincidence; I think it’s the main point. In fact, I thought that was so obvious that it went without saying. But I shouldn’t assume that—maybe you disagree!

If you agree that “child play-fighting helps train for adult real-fighting” not just coincidentally but by design, then I don’t see the “Argument II” logic going through. For example, animals will play-fight even if they’ve never seen a real fight in their life.

So again: Why don’t your dog & cat just ignore each other entirely? Sure, when they’re already play-fighting, there are immediately-obvious reasons that they don’t want to be pinned. But if they’re relaxing, and not in competition over any resources, why go out of their way to play-fight? How did they come to believe that doing so is empowering? Or if they are in competition over resources, why not real-fight, like undomesticated adult animals do?

maximizing optionality automatically learns all motor skills - even up to bipedal walking

I agree, but I don’t think that’s strong evidence that nothing else is going on in humans. For example, there’s a “newborn stepping reflex”—newborn humans have a tendency to do parts of walking, without learning, even long before their muscles and brains are ready for the whole walking behavior. So if you say “a simple generic mechanism is sufficient to explain walking”, my response is “Well it’s not sufficient to explain everything about how walking is actually implemented in humans, because when we look closely we can see non-generic things going on”.

Here’s a more theoretical perspective. Suppose I have two side-by-side RL algorithms, learning to control identical bodies. One has a some kind of “generic” empowerment reward. The other has that same reward, plus also a reward-shaping system directly incentivizing learning to use some small number of key affordances that are known to work well for that particular body (e.g. standing).

I think the latter would do all the same things as the former, but it would learn faster and more reliably, particularly very early on. Agree or disagree? If you agree, then we should expect to find that in the brain, right?

(When I say “more reliably”, I’m referring to the trope that programming RL agents is really finicky, moreso than other types of ML. I don’t really know if that trope is correct though.)

Sure, but there is hardly room in the brainstem to reward-shape for the [] different things humans can learn to do.

I hope we’re not having one of those silly arguments where we both agree that empowerment explains more than 0% and less than 100% of whatever, and then we’re going back and forth saying “It’s more than 0%!” “No way, it’s less than 100%!” “No way, it’s more than 0%!” … :)

Anyway, I think the brainstem “knows about” some limited number of species-typical behaviors, and can probably execute those behaviors directly without learning, and also probably reward-shapes the cortex into learning those behaviors faster. Obviously I agree that the cortex can also learn pretty much arbitrary other behaviors, like ballet and touch-typing, which are not specifically encoded in the brainstem.

I like this post! Steven Byrnes, and Jacob Cannell are two people with big models of the brain and intelligence which give concrete predictions which are unique, and large contributors to my own thinking. The post can only be excellent, and indeed it is! Byrnes doesn't always respond to Cannell how I would, but his responses usually shifted my opinion somewhat.

The convergence theorems basically say that optimizing for P[t] converges to optimizing for T[t+d] for some sufficient timespan d.

The idea of a convergence theorem showing that optimizing any objective leads to empowerment has been brought up a bunch of times in these discussions, as in this quote. Is there some well-known proof/paper where this is shown? AFAICT the original empowerment do not show any proof like this (may have missed it). Is this based off of Alex Turner's work (https://arxiv.org/pdf/1912.01683.pdf) which results in a different measure than information theoretic empowerment (but intuitively related), or something else?

Excellent post btw.

I'm glad Jacob agrees that empowerment could theoretically help arbitrary entities achieve arbitrary goals. (I recall someone who was supposedly great at board games recommending it as a fairly general strategy.) I don't see how, if empowerment is compatible with almost any goal, it could prevent the AI from changing our goals whenever this is convenient.

Perhaps he thinks we can define "empowerment" to exclude this? Quick reaction: that seems likely to be FAI-complete, and somewhat unlikely to be a fruitful approach. My understanding of physics says that pretty much action has a physical effect on our brains. Therefore, the definition of which changes to our brains "empower" and which "disempower" us, may be doing all of the heavy lifting. How does this become easier to program than CEV?

Jacob responds: The distribution shift from humans born in 0AD to humans born in 2000AD seems fairly inconsequential for human alignment.

I now have additional questions. The above seems likely enough in the context of CEV (again), but otherwise false.

The above seems likely enough in the context of CEV (again), but otherwise false.

I think there might be a mix-up here. There are two topics of discussion:

One topic is: “We should look at humans and human values since those are the things we want to align an AGI to.”
The other topic is: “We should look at humans and human values since AGI learning algorithms are going to resemble human brain within-lifetime learning algorithms, and humans provide evidence for what those algorithms do in different training environments”.

The part of the post that you excerpted is about the latter, not the former.

Imagine that God gives you a puzzle: You get most of the machinery for a human brain but some of the innate drive neural circuitry has been erased and replaced by empty boxes. You’re allowed to fill in the boxes however you want. You’re not allowed to cheat by looking at actual humans. Your goal is to fill in the boxes such that the edited-human winds up altruistic.

So you have a go at filling in the boxes. God lets you do as many validation runs as you want. The validation runs involve raising the edited-human in a 0AD society and seeing what they wind up like. After a few iterations, you find settings where the edited-humans reliably grow up very altruistic in every 0AD society you can think to try.

Now that your validation runs are done, it’s time for the test run. So the question is: if you put the same edited-human-brain in a 2022AD society, will it also grow up altruistic on the first try?

I think a good guess is “yes”. I think that’s what Jacob is saying.

(For my part, I think Jacob’s point there is fair, and a helpful way to think about it, even if it doesn’t completely allay my concerns.)

Thinking about (innate drives -> valenced world states -> associated states -> learned drives -> increasingly abstract valenced empowerment) brings up for me this question of seeking a very specific world state with high predicted valence & empowerment. And this I feel like is accurately described, but awkward to think about, from the frame of Jacob's W/P/U/V/A distinction. Like how it's accurate but difficult to think about biology from the frame of movements of protons and electrons. I think if we zoom in on the W/P plan making portion and adopt a different frame, we see a consequentialist plan generator that does directed search through projected futures based on W (world model). And this then is rather like Eliezer's Outcome Pump. If you zoom out, the Outcome Pump is one part of an agent. It's only in the zoomed in view that you see a non-sentient search process that searches for valenced empowerment over extrapolations made from running simulations of the World Model. I'd argue that something very like this planning process is occuring in AlphaZero and DeepNash (stratego). But those have narrow world models, and search systems designed to work over narrow world models.

Quote from the Outcome Pump:

Consider again the Tragedy of Group Selectionism: Some early biologists asserted that group selection for low subpopulation sizes would produce individual restraint in breeding; and yet actually enforcing group selection in the laboratory produced cannibalism, especially of immature females. It's obvious in hindsight that, given strong selection for small subpopulation sizes, cannibals will outreproduce individuals who voluntarily forego reproductive opportunities. But eating little girls is such an un-aesthetic solution that Wynne-Edwards, Allee, Brereton, and the other group-selectionists simply didn't think of it. They only saw the solutions they would have used themselves.

So, notice this idea that humans are doing an aesthetically guided search. Seems to me this is an accurate description of human thought / planning. I think this has a lot of overlap with the aesthetic imagining of a nice picture being done by Stable Diffusion or other image models. And overlap with using Energy Models to imagine plans out of noise.

I see this paper as having some valuable insights for unifying the sort of Multi-Objective variable-strength complex valence/reward system that the neuroscience perspective describes with a need to tie these dynamically weighted objectives together into a cohesive plan of action. https://arxiv.org/abs/2211.10851 (h/t Capybasilisk)

I also put higher probability on AGI also using fast serial coprocessors to unlock algorithmic possibilities that brains don’t have access to, both for early AGI and in the distant future. (Think of how “a human with a pocket calculator” can do things that a human can’t. Then think much bigger than that!)

Does anybody know of research in this direction?

Hm, I used the inline comment function but somehow this comment doesn't show up inline.

And this is an example of our more general dispositions where I tend to think “10% of evolutionary psychology is true important things that we need to explain, let’s get to work explaining them properly” and Jacob tends to think “90% of evolutionary psychology is crap, let’s get to work throwing it out”. These are not inconsistent! But they’re different emphases.

Top highlight. Nice reflection.

Despite feeling that there are some really key points in Jacob's 'it all boils down to empowerment' point of view (which is supported by the paper I linked in my other comment), I still find myself more in agreement with Steven's points about innate drives.

(A) Most humans do X because they have an innate drive to do X; (e.g. having sex, breathing)
(B) Most humans do X because they have done X in the past and have learned from experience that doing X will eventually lead to good things (e.g. checking the weather forecast before going out)
(C) Most humans do X because they have indirectly figured out that doing X will eventually lead to good things—via either social / cultural learning, or via explicit means-end reasoning (e.g. avoiding prison, for people who have never been in prison)

So, I think a missing piece here is that 'empowerment' is perhaps better described as 'ability to reach desired states' where the desire stems from innate drives. This is very different sense of 'empowerment' than a more neutral 'ability to reach any state' or 'ability to reach as many states as possible'.

If I had available to me a button which, when I pressed it, would give me 100 unique new ways in which it was possible for me to choose to be tortured and the ability to activate any of those tortures at will... I wouldn't press that button!

If there was another button that would give me 100 unique new ways to experience pleasure and the ability to activate those pleasures at will, I would be strongly tempted to press it.

Seems like my avoiding the 'new types of torture' button is me declining reachability / empowerment / optionality. This illustrates why I don't think a non-valenced empowerment seeking is an accurate description of human/animal behavior.

Of course, we can learn to associate innate-drive-neutral things, like money, with innate-drive-valenced empowerment. Or even innate-drive-negative things, so long as the benefit sufficiently outweighs the cost.

And once you've gotten as far as 'valenced empowerment with ability to bridge locally negative states', then you start getting into decision making about various plans over the various conceptual directions in valenced state space (with the valence originating from, but now abstracted away from, innate drives), and this to me is very much what Shard Theory is about.

Suppose most humans do X, where X increases empowerment. Three possibilities are:

(A) Most humans do X because they have an innate drive to do X; (e.g. having sex, breathing)
(B) Most humans do X because they have done X in the past and have learned from experience that doing X will eventually lead to good things (e.g. checking the weather forecast before going out)
(C) Most humans do X because they have indirectly figured out that doing X will eventually lead to good things—via either social / cultural learning, or via explicit means-end reasoning (e.g. avoiding prison, for people who have never been in prison)

How do we figure it out? In general, 5 types of evidence that we can bring to bear are:

(1) Evidence from cases where we can rule out (C), e.g. sufficiently simple and/or young humans/animals. Then we can just see whether the animal is doing X more often than chance from the start, or whether it has to stumble upon X before it starts doing X more often than chance.
- Example: If you’re a baby mouse who has never seen a bird (or bird-like projectile etc.) in your life, you have no rational basis for thinking that birds are dangerous. Nevertheless, lab experiments show that baby mice will run away from incoming birds, reliably, the first time. (Ref) So that has to be (A).
(2) Evidence from sufficiently distant consequences that we can rule out (B).
- Example: Many animals will play-fight as children. This has a benefit (presumably) of eventually making the animals better at actual fighting as adults. But the animal can’t learn about that benefit via trial-and-error—the benefit won’t happen until perhaps years in the future.
(3) Evidence from heritability—If doing X is heritable, I think an (A)-type explanation would make that fact very easy to explain—in fact, an (A)-type explanation for X would pretty much demand that doing X has nonzero heritability. Conversely, if doing X is heritable (in a way that’s not explained by heritability of “general intelligence” type stuff), well I don’t think (B) or (C) is immediately ruled out, but we do need to think about it and try to come up with a story of how that could work.
(4) Evidence from edge-cases where X is not actually empowering—Suppose doing X is usually empowering, but not always. If people do a lot of X even in edge-cases where it’s not empowering, I consider that strong evidence for (A) over (B) & (C). It’s not indisputable evidence though, because maybe you could argue that people are able to learn the simple pattern “X tends to be empowering”, but unable to learn the more complicated pattern “X tends to be empowering with the following exceptions…”. But still, I think it’s strong evidence.
- Example: Humans can feel envy or anger or vengeance towards fictional characters, inanimate objects, etc.
(5) Evidence from specific involuntary reactions, hypothalamus / brainstem involvement, etc.—For example, things that have specific universal facial expressions or sympathetic nervous system correlates, or behavior that can be reliably elicited by a particular neuropeptide injection (AgRP makes you hungry), etc., are probably (A).

A couple specific cases:

So now to translate into your 3 levels:

A.): Innate drives: Innate prior knowledge in U and in V/A.

B.): Learned from experience and subsumed into system 1: using W/P to train V/A.

C.): System 2 style reasoning: zero shot reasoning from W/P.

(1) Evidence from cases where we can rule out (C), e.g. sufficiently simple and/or young humans/animals

(2) Evidence from sufficiently distant consequences that we can rule out (B) Example: Many animals will play-fight as children. This has a benefit (presumably) of eventually making the animals better at actual fighting as adults. But the animal can’t learn about that benefit via trial-and-error—the benefit won’t happen until perhaps years in the future.

Status—I’m not sure whether Jacob is suggesting that human social status related behaviors are explained by (B) or (C) or both. But anyway I think 1,2,3,4 all push towards an (A)-type explanation for human social status behaviors. I think I would especially start with 3 (heritability)—if having high social status is generally useful for achieving a wide variety of goals, and that were the entire explanation for why people care about it, then it wouldn’t really make sense that some people care much more about status than others do, particularly in a way that (I’m pretty sure) statistically depends on their genes

Status is almost all learned B: system 2 W/P planning driving system 1 V/A updates.

Earlier I said - and I don't see your reply yet, so i'll repeat it here:

Infants don't even know how to control their own limbs, but they automatically learn through a powerful general empowerment learning mechanism. That same general learning signal absolutely does not - and can not - discriminate between hidden variables representing limb poses (which it seeks to control) and hidden variables representing beliefs in other humans minds (which determine constraints on the child's behavior). It simply seeks to control all such important hidden variables.

Social status drive emerges naturally from empowerment, which children acquire by learning cultural theory of mind and folk game theory through learning to communicate with and through their parents. Children quickly learn that hidden variables in their parents have huge effect on their environment and thus try to learn how to control those variables.

It's important to emphasize that this is all subconscious and subsumed into the value function, it's not something you are consciously aware of.

Fun—Jacob writes “Fun is also probably an emergent consequence of value-of-information and optionality” which I take to be a claim that “fun” is (B) or (C), not (A). But I think it’s (A).

Fun is complex and general/vague - it can be used to describe almost anything we derive pleasure from in your A.) or B.) categories.

Thanks!

I also (relatedly?) am pretty against trying to lump the brainstem / hypothalamus and the cortex / BG / etc. into a single learning-algorithm-ish framework.

To illustrate the difference between “pretrained learning algorithm” and “learning algorithm + other things that are not learning algorithms”:

By the same token, I think there are a lot of things happening in the brainstem / hypothalamus which we should describe as “a separate component from the RL algorithm”.

Sufficiently distant consequences is exactly what empowerment is for, as the universal approximator of long term consequences. Indeed the animals can't learn about that long term benefit through trial-and-error, but that isn't how most learning operates. Learning is mostly driven by the planning system 1 - M/P - which drives updates to V/A based on both current learned V and U - and U by default is primarily estimating empowerment and value of information as universal proxies.

[M/P is a typo for W/P right?]

Since I’m a smart adult human, I happen to know that:

it’s empowering for baby cats to practice pouncing,
it’s empowering for baby bats to practice arm-flapping,
it’s empowering for baby humans to practice grasping,
it’s not empowering for baby humans to practice arm-flapping,
it’s not empowering for baby bats to practice pouncing
etc.

some amount of play fighting skill knowledge is prior instinctual, but much of it is also learned

Sure, I agree.

The only part of this that requires a more specific explanation is perhaps the safety aspect of play fighting: each animal is always pulling punches to varying degrees, the cat isn't using fully extended claws, neither is biting with full force, etc. That is probably the animal equivalent of empathy/altruism.

Fun is complex and general/vague - it can be used to describe almost anything we derive pleasure from in your A.) or B.) categories.

Fair enough.

One of my disagreements with your U,V,P,W,A model is that I think V & W are randomly-initialized in animals. Or maybe I’m misunderstanding what you mean by “brains also can import varying degrees of prior knowledge into other components”.

I think we agree the cortex/cerebellum are randomly initialized, along with probably most of the hippocampus, BG, perhaps amagdyla? and a few others. But those don't map cleanly to U, W/P, and V/A.

For example, I think most newborn behaviors are purely driven by the brainstem, which is doing things of its own accord without any learning and without any cortex involvement.

The moral is: I claim that figuring out what’s empowering is not a “local” / “generic” / “universal” calculation. If I do X in the morning, it is unknowable whether that was an empowering or disempowering action, in the absence of information about where I’m likely to find myself in in the afternoon. And maybe I can make an intelligent guess at those, but I’m not omniscient. If I were a newborn, I wouldn’t even be able to guess.

The newborns VoI and optionality value estimates will be completely different and focused on things like controlling flailing limbs and making sounds, moving the head, etc.

But I don’t know how the baby cats, bats, and humans are supposed to figure that out, via some “generic” empowerment calculation. Arm-flapping is equally immediately useless for both newborn bats and newborn humans, but newborn humans never flap their arms and newborn bats do constantly.

So yeah, it would be simple and elegant to say “the baby brain is presented with a bunch of knobs and levers and gradually discovers all the affordances of a human body”. But I don’t think that fits the data, e.g. the lack of human newborn arm-flapping experiments in comparison to bats.

Instead, I think baby humans have an innate drive to stand up, an innate drive to walk, an innate drive to grasp, and probably a few other things like that. I think they already want to do those things even before they have evidence (or other rational basis to believe) that doing so is empowering.

I claim that this also fits better into a theory where (1) the layout of motor cortex is relatively consistent between different people (in the absence of brain damage),

We've already been over that - consistent layout is not strong evidence of innate wiring. A generic learning system will learn similar solutions given similar inputs & objectives.

(2) decorticate rats can move around in more-or-less species-typical ways,

(3) there’s strong evolutionary pressure to learn motor control fast and we know that reward-shaping is certainly helpful for that,

It takes humans longer than an entire rat lifespan just to learn to walk. Hardly fast.

(4) and that there’s stuff in the brainstem that can do this kind of reward-shaping,

Sure, but there is hardly room in the brainstem to reward-shape for the different things humans can learn to do.

Universal capability requires universal learning.

(5) lots of animals can get around reasonably well within a remarkably short time after birth,

Not humans.

(6) stimulating a certain part of the brain can create “an urge to move your arm” etc. which is independent from executing the actual motion,

(If you put a novel and useful motor affordance on a baby human—some funny grasper on their hand or something—I’m not denying that they would eventually figure out how to start using it, thanks to more generic things like curiosity,

We know that humans can learn to see through their tongue - and this does not take much longer than an infant learning to see through its eyes.

I think we’re giving baby animals too much credit if we expect them to be thinking to themselves “gee when I grow up I might need to be good at fighting so I should practice right now instead of sitting on the comfy couch”. I claim that there isn’t any learning signal or local generic empowerment calculation that would form the basis for that

Thanks!

I'm not sure why you seem to think that I think of optionality-empowerment estimates as requiring anything resembling omniscience.

If we assume omniscience, it allows a very convenient type of argument:

Argument I [invalid]: Suppose an animal has a generic empowerment drive. We want to know whether it will do X. We should ask: Is X actually empowering?

However, if we don’t assume omniscience, then we can’t make arguments of that form. Instead we need to argue:

Argument II [valid]: Suppose an animal has a generic empowerment drive. We want to know whether it will do X. We should ask: Has the animal come to believe (implicitly or explicitly) that doing X is empowering?

I have the (possibly false!) impression that you’ve been implicitly using Argument I sometimes. That’s how omniscience came up.

Back to play-fighting. A baby animal is sitting next to its sibling. It can either play-fight, or hang out doing nothing. (Or cuddle, or whatever else.) So why play-fight?

maximizing optionality automatically learns all motor skills - even up to bipedal walking

(When I say “more reliably”, I’m referring to the trope that programming RL agents is really finicky, moreso than other types of ML. I don’t really know if that trope is correct though.)

Sure, but there is hardly room in the brainstem to reward-shape for the [] different things humans can learn to do.

The convergence theorems basically say that optimizing for P[t] converges to optimizing for T[t+d] for some sufficient timespan d.

Excellent post btw.

Jacob responds: The distribution shift from humans born in 0AD to humans born in 2000AD seems fairly inconsequential for human alignment.

I now have additional questions. The above seems likely enough in the context of CEV (again), but otherwise false.

The above seems likely enough in the context of CEV (again), but otherwise false.

I think there might be a mix-up here. There are two topics of discussion:

One topic is: “We should look at humans and human values since those are the things we want to align an AGI to.”
The other topic is: “We should look at humans and human values since AGI learning algorithms are going to resemble human brain within-lifetime learning algorithms, and humans provide evidence for what those algorithms do in different training environments”.

The part of the post that you excerpted is about the latter, not the former.

Now that your validation runs are done, it’s time for the test run. So the question is: if you put the same edited-human-brain in a 2022AD society, will it also grow up altruistic on the first try?

I think a good guess is “yes”. I think that’s what Jacob is saying.

(For my part, I think Jacob’s point there is fair, and a helpful way to think about it, even if it doesn’t completely allay my concerns.)

Quote from the Outcome Pump:

Consider again the Tragedy of Group Selectionism: Some early biologists asserted that group selection for low subpopulation sizes would produce individual restraint in breeding; and yet actually enforcing group selection in the laboratory produced cannibalism, especially of immature females. It's obvious in hindsight that, given strong selection for small subpopulation sizes, cannibals will outreproduce individuals who voluntarily forego reproductive opportunities. But eating little girls is such an un-aesthetic solution that Wynne-Edwards, Allee, Brereton, and the other group-selectionists simply didn't think of it. They only saw the solutions they would have used themselves.

I also put higher probability on AGI also using fast serial coprocessors to unlock algorithmic possibilities that brains don’t have access to, both for early AGI and in the distant future. (Think of how “a human with a pocket calculator” can do things that a human can’t. Then think much bigger than that!)

Does anybody know of research in this direction?

Hm, I used the inline comment function but somehow this comment doesn't show up inline.

And this is an example of our more general dispositions where I tend to think “10% of evolutionary psychology is true important things that we need to explain, let’s get to work explaining them properly” and Jacob tends to think “90% of evolutionary psychology is crap, let’s get to work throwing it out”. These are not inconsistent! But they’re different emphases.

Top highlight. Nice reflection.

(A) Most humans do X because they have an innate drive to do X; (e.g. having sex, breathing)
(B) Most humans do X because they have done X in the past and have learned from experience that doing X will eventually lead to good things (e.g. checking the weather forecast before going out)
(C) Most humans do X because they have indirectly figured out that doing X will eventually lead to good things—via either social / cultural learning, or via explicit means-end reasoning (e.g. avoiding prison, for people who have never been in prison)

If there was another button that would give me 100 unique new ways to experience pleasure and the ability to activate those pleasures at will, I would be strongly tempted to press it.

72

My take on Jacob Cannell’s take on AGI safety

72

Ω 25

1. How to think about the human brain

1.1 “Evolved modularity” versus “Universal learning machine”

1.2 My compromise position

1.3 How complicated are innate drives?

1.3.1 Example: our disagreement about habitat-related aesthetics

1.3.2 …But this doesn’t seem to be a super-deep disagreement

1.3.3 “Correlation-guided proxy matching”

1.3.4 Should we think of (almost) all innate drives as “an approximation to (self)-empowerment”?

2. Will AGI algorithms look like brain algorithms?

2.1 The spectrum from “giant universe of possible AGI algorithms” versus “one natural practical way to build AGI”

2.2 How similar are brain learning algorithms versus today’s deep learning algorithms? (And implications for timelines.)

2.3 Will AGI use neuromorphic (or processing-in-memory) chips?

3. Human-empowerment as an AGI motivation

3.1 Social instincts / empathy

3.2 Interpretability

3.3 OK, but still, is humanity-empowerment what we want?

4. Simboxes

72

Ω 25

72

Ω 25