[Intro to brain-like-AGI safety] 9. Takeaways from neuro 2/2: On AGI motivation

Steven Byrnes

(Last revised: January 2026. See changelog at the bottom.)

9.1 Post summary / Table of contents

Part of the “Intro to brain-like-AGI safety” post series.

Most posts in the series thus far—Posts #2–#7—have been primarily about neuroscience. Then, starting with the previous post, we’ve been applying those ideas to better understand brain-like-AGI safety (as defined in Post #1).

In this post, I’ll discuss some topics related to the motivations and goals of a brain-like AGI. Motivation is of paramount importance for AGI safety. After all, our prospects are a heck of a lot better if future AGIs are motivated to bring about a wonderful future rich in human flourishing, compared to if they’re motivated to kill everyone (§1.6). To get the former and not the latter, we need to understand how brain-like-AGI motivation works, and in particular how to point it in one direction rather than another. This post will cover assorted topics in that area.

Table of contents:

§9.2 argues that the goals and preferences of a brain-like AGI are defined in terms of latent variables in its world-model. These can be related to outcomes, actions, or plans, but are not exactly any of those things. Also, the algorithms generally don’t distinguish between instrumental and final goals.
§9.3 has a deeper discussion of “credit assignment”, which I previously introduced with an example back in §7.4. “Credit assignment”, as I use the term in this series, is a synonym of “updates to the Thought Assessors”, and is the process whereby a concept (= latent variable in the world-model) can get “painted” with positive or negative valence, and/or start triggering involuntary visceral reactions (in the biology case). This type of “credit assignment” is a key ingredient in how an AGI could wind up wanting to do something.
§9.4 defines “wireheading”. An example of “wireheading” would be if an AGI hacks into itself and sets the “reward” register in RAM to the maximum possible value. I will argue that brain-like AGI will “by default” have a “weak wireheading drive” (desire to wirehead, other things equal), but probably not a “strong wireheading drive” (viewing wireheading as the best possible thing to do, and worth doing at any cost).
§9.5 spells out an implication of the wireheading discussion above: brain-like AGI is generally NOT trying to maximize its future reward / valence. I give a human example, and then relate it to the concept of “observation-utility agents” in the literature.
§9.6 argues that, in brain-like AGI, the Thought Assessors mediate a relationship between motivation and neural network interpretability. For example, the assessment “This thought/plan is likely to lead to eating” is simultaneously (1) a data-point contributing to the interpretability of the thought/plan within the learned world-model, and (2) a signal that we should carry out that thought/plan if we’re hungry. (This point applies to any reinforcement learning system compatible with multi-dimensional value functions, not just “brain-like” ones. Ditto for the next bullet point.)
§9.7 describes how we might be able to “steer” the AGI’s motivations in real-time, and how this steering would impact not just the AGI’s immediate actions but also its long-term plans and “deep desires”.

9.2 The AGI’s goals and desires are defined in terms of latent variables (learned concepts) in its world-model

Do you like football? Well, “football” is a learned concept living inside your world-model. Learned concepts like that are the only kinds of things that it’s possible to “like”. You cannot like or dislike [nameless pattern in sensory input that you’ve never conceived of]. It’s possible that you would find this nameless pattern rewarding, were you to come across it. But you can’t like it, because it’s not currently part of your world-model. That also means: you can’t and won’t make a goal-oriented plan to induce that nameless pattern.

I think this is clear from introspection, and I think it’s equally clear in our motivation picture (see Posts #6–#7). There, I used the term “thought” in a broad sense to include everything in conscious awareness and more—what you’re planning, seeing, remembering, understanding, attempting, etc. A “thought” is what the Thought Assessors assess, and it is built out of some configuration of the learned latent variables in your generative world-model.

Our motivation model—see Post #6 for discussion

Why is it important that an AGI’s goals are defined in terms of latent variables in its world-model? Lots of reasons! It will come up over and over in this and future posts. See also Post #2 of my Valence series for much deeper discussion of how different types of concepts can get imbued with positive or negative valence, how that feels intuitively, and how that affects everything from planning to morality to vibe-associations and more.

9.2.1 Implications for “value alignment” with humans

The above observation is one reason that “value alignment” between a human and an AGI is an awful mess of a problem. A brain-like AGI will have latent variables in its learned world-model, while a human has latent variables in their learned world-model, but they are different world-models, and the latent variables in one may have a complex and problematic relationship to the latent variables in the other. For example, the human’s latent variables could include things like “ghosts” that don’t really correspond to anything in the real world! For more on this topic, see John Wentworth’s post “The Pointers Problem” (2020).

(I won’t say much about “defining human values” in this series—I want to stick to the narrower problem of “avoiding catastrophic AGI accidents like human extinction”, and I don’t think a deep dive into “defining human values” is necessary for that. But “defining human values” would still be a good thing to do, and I’m happy for people to be working on it—see for example 1,2. My take is at Valence series §2.6.1 (2023).)

9.2.2 Preferences are over “thoughts”, which can relate to outcomes, actions, plans, etc., but are different from all those things

Thought Assessors assess and compare “thoughts”, i.e. configurations of an agent’s generative world-model. The world-model is imperfect—a complete understanding of the world is far too complex to fit in any brain or silicon chip. Thus a “thought” inevitably involves attending to some things and ignoring others, conceptualizing things in certain ways, matching things to the nearest-available category even if it’s not a perfect fit, etc.

Some implications:

You can conceptualize a single sequence of motor actions in many different ways, and it will be more or less appealing depending on how you’re thinking about it: consider the thought “I’m gonna go to the gym” versus the thought “I’m gonna go to the gym to get ripped”. See “Valence” (as I’m using the term) is a property of a thought—not a situation, nor activity, nor course-of-action, etc.
Similarly, you can conceptualize a single future state of the world in many different ways, e.g. by attending to different aspects of it, and it will thereby become more or less appealing. This can lead to circular preferences; I put an example in this footnote^[1].
A thought can concern immediate actions, and future actions, and semantic context, and expectations of what will happen while we’re doing the thing, and expectations of what will result after we finish doing the thing, etc. Thus we can have “consequentialist” preferences about future states, or “deontological” preferences about actions, etc. For example, the thought “I’m going to go to the store, and then I’ll have milk” includes action-related “I’m going to go to the store” neurons, and consequence-related “I’ll have milk” neurons; the Thought Assessors and Steering Subsystem can endorse or reject the thought based on either of those. See Consequentialism & Corrigibility for more on this topic.
None of this is meant to imply that a brain-like AGI can’t approximate an ideal rational consequentialist utility-maximizer! Just that this would be a property of a particular trained model, rather than inherent in the AGI’s source code. For example, a brain-like AGI can read The Sequences (just like a human can), and it can internalize those lessons into a set of learned metacognitive heuristics that catch and correct faulty intuitions and habits-of-thought that undermine effectiveness^[2] (just like a human can), and the AGI may in fact want to actually do this for the same reasons that humans might read The Sequences, namely because they want to think clearly and accomplish their goals.^[3] (To be clear, a superintelligent brain-like AGI would not need to literally read The Sequences, because it could rediscover those ideas for itself, or much better ones, in short order.)

9.3 “Credit assignment” is how latent variables get painted with valence

9.3.1 What is credit assignment?

I introduced the idea of “credit assignment” in §7.4. Recall this diagram:

Copied from Post #7, see there for context.

As a reminder, the brain has “Thought Assessors” (Post #5 & #6) that work by supervised learning (with the supervisory signals coming from the Steering Subsystem). Their role is to translate from latent variables (a.k.a. concepts) in the world model (“paintings”, “taxes”, “striving”, etc.) to parameters that the Steering Subsystem can understand (arm pain, blood sugar levels, grimacing, etc.). For example, when I took a bite of cake in Post #7, a world-model concept (“myself eating prinsesstårta cake”) got attached to genetically-meaningful variables (sweet taste on my tongue, positive valence, etc.).

I’m calling that process “credit assignment”—in the sense that the abstract concept of “myself eating prinsesstårta cake” gets credit for the sweet taste on my tongue.

Kaj Sotala has a kinda poetic description of what I call credit assignment here:

Mental representations … [are] imbued with a context-sensitive affective gloss.

I find myself visualizing a fine-tip paintbrush painting positive valence onto my mental concept of prinsesstårta. Besides the “valence” paint, there are various other paint colors associated with other visceral reactions.

I sometimes like to visualize credit assignment as kinda like “painting” the latent variables in your predictive world-model with associations to rewards and other innate reactions.

Credit assignment can work in funny ways. If I’m pleasantly surprised to win a prize, my brain can “assign credit” to my hard work and skill, or it can “assign credit” to the fact that I’m wearing my lucky underwear.

I said “my brain can assign credit” instead of “I can assign credit” just now, because I don’t want to imply that this is a voluntary choice that I made. Instead, credit assignment is some dumb algorithm in the brain. Speaking of which:

9.3.2 How does credit assignment work?—the short answer

If credit assignment is a dumb algorithm in the brain, exactly what dumb algorithm is it?

I think, at least to a first approximation, it’s the obvious one:

Whatever thought is active right now gets the credit.

That’s “obvious” in the sense that the Thought Assessors are using supervised learning (see Post #4), and this is what supervised learning would do by default. After all, the “context” inputs to the Thought Assessors are describing whatever thought is active right now, so if we do a gradient-descent update on the error (or something functionally similar to a gradient-descent update), this “obvious” algorithm is what we’ll get.

9.3.3 How does credit assignment work?—fine print

I think it’s worth investing a bit more time on this topic, because credit assignment is central to AGI safety—after all, it’s how a brain-like AGI would wind up wanting some things rather than others. So I’ll just list out some assorted thoughts about how it works in humans.

1. Credit assignment can have “priors” that bias what type of concept gets what type of credit:

Recall from Posts #4–#5 that each Thought Assessor has its own “context” signals that serve as inputs to its predictive model. Imagine that some specific Thought Assessor has only context data from the visual cortex, for example. It will be forced to “assign credit” to the primarily-visual patterns stored in that part of the neural architecture—as if it had a 100%-confident “prior” that only the visual cortex’s stored patterns could possibly be helpful for the prediction task.

Naïvely, we might think this kind of “prior” is always a bad idea: the more different context signals that a Thought Assessor has, the better its predictive models will be, right? Why restrict them? Two reasons. First, a good prior will lead to faster learning. Second, the Thought Assessors are just one component of a larger system. We shouldn’t take for granted that a more-predictively-accurate Thought Assessor is necessarily a good thing for the larger system.

Here’s a famous example of these kinds of “priors” in psychology: rats can easily learn to freeze in response to a sound that precedes an electric shock, and rats can easily learn to feel nauseous in response to a taste that precedes a bout of vomiting. But not vice-versa! This might reflect, for example, a brain architectural design feature wherein the nausea-predicting Thought Assessor has taste-related context (e.g. from the insular cortex) but not audiovisual-related context (e.g. from the temporal lobe), and vice-versa for the freeze-predicting Thought Assessor. (More on the nausea example shortly.)

2. Credit assignment is very sensitive to timing:

Above I suggested “Whatever thought is active right now gets the credit”. But I didn’t say what “right now” means.

Example: Suppose I’m walking down the street, thinking about the TV show that I watched last night. Suddenly I have a sharp pain on my back—somebody punched me. Two things happen in my brain, almost immediately:

My thoughts and attention turn to this new pain in my back (possibly including some generative model of its causes),
My brain does the “credit assignment” thing, where some concepts in my world-model gets viscerally associated with this new pain sensation.

The trick is, we want (A) to happen before (B)—otherwise, I’ll wind up with a visceral anticipation of back pain whenever I think about that TV show that I watched last night.

I do in fact think that the brain is able to ensure that (A) happens before (B), at least by and large. (I might get a bit of a spurious association with the TV show.)^[4] See §3 of “Neuroscience of Human Social Instincts: A Sketch” (2024) for more on the intricate interactions between attention control and credit assignment.

3. …And timing can interact with “priors” too!

Conditioned Taste Aversion (CTA) is a phenomenon where, if I get nauseous right now, it causes an aversion to whatever tastes I was exposed to a few hours earlier—not a few seconds earlier, not a few days earlier, just a few hours earlier. (I alluded to CTA above, but not its timing aspect.) The evolutionary reason for this is straightforward: a few hours is presumably how long it typically takes for a toxic food to induce nausea. But how does it work mechanistically?

The insular cortex is the home of neurons that form a generative model of taste sensory inputs. According to “A molecular mechanism underlying gustatory memory trace for an association in insular cortex” by Adaikkan & Rosenblum (2015), these neurons have molecular mechanisms that put them in a special flagged state for the subsequent several hours after they fire.

Then the rule I suggested above (“Whatever thought is active right now gets the credit”) needs to be modified to: “Whatever neurons are in that special flagged state right now get the credit.” (The technical term here is “eligibility trace”.)

4. Credit assignment has a “Finders Keepers” characteristic:

Once you have a way to accurately predict some set of supervisory signals, it makes the corresponding error signal go away, so we stop assigning more credit in those situations. So the first good predictive model that our brain comes across, gets to stick around by default, even if there are other equally good alternative models. (But it still gets booted if the alternative models are better.) I think this is related to blocking in behaviorist psychology.

5. The Thought Generator doesn’t have direct voluntary control over credit assignment, but it probably has at least some ability to manipulate it

There’s a sense in which the Thought Generator and Thought Assessors are in an adversarial relationship, i.e. working at cross-purposes. In particular, they are trained to optimize different signals.^[5] For example, one time my boss yelled at me, and I very much didn’t want to start crying, but my Thought Assessors assessed that it was an appropriate time to cry, and so I did!^[6] Given that adversarial relationship, I have a strong presumption that the Thought Generator is not set up to have direct (“voluntary”) control over credit assignment. This also seems to match introspection.

On the other hand, “no direct voluntary control” is quite different from “no control at all”. Again, I don’t have direct voluntary control over crying, but I can nevertheless summon tears, at least a little bit, via the roundabout strategy of imagining baby kittens shivering in the cold rain (§6.3.3).

So, suppose I currently hate X, but I want to will myself to really like X. It seems to me that this task is not straightforward, but also that it’s not impossible. It may take some self-reflective skill, mindfulness, planning, and so on, but if the Thought Generator thinks just the right thoughts at the right time, it can probably pull it off.

This and related phenomena happen all the time in humans, and we refer to it as “motivated reasoning”, “motivated beliefs”, etc. See discussion in my “Valence” series, §3.3 (2023).

And an AGI might have an easier time than a human! After all, unlike in humans, an AGI may be able to literally hack into its own Thought Assessor, and change the settings however it likes. And that nicely transitions us to the next topic…

9.4 Wireheading: possible but not inevitable

9.4.1 What is wireheading?

The concept of “wireheading” gets its name from the idea of sticking a wire into a certain part of your brain, and running current through it. If you do it right, it could directly elicit ecstatic pleasure, deep satisfaction, or other nice feelings, depending on the exact part of the brain that the wire is in. Wireheading can be a much easier way to elicit those nice feelings, compared to, y’know, finding True Love, cooking the perfect soufflé, winning the praise of your childhood hero, and the like.

In the classic, nightmare-inducing, wireheading experiment (see “Brain Stimulation Reward”), a wire in a rat’s brain is activated when the rat presses a lever. The rat will press the lever over and over, not stopping to eat or drink or rest, even for 24 hours straight, until eventually collapsing from exhaustion (Olds 1958).

Anyway, the concept of wireheading has been analogized to AI. The idea here is that a reinforcement learning agent is designed to maximize its reward. So, maybe it will hack into its own RAM, and overwrite the “reward” register to infinity! Next I’ll talk about whether that’s likely to happen, and then how worried we should be if it does.

9.4.2 Will brain-like AGIs want to wirehead?

Well, first, do humans want to wirehead? I need to distinguish two things:

Weak wireheading drive: “I want a higher reward signal in my brain, other things equal.”
Strong wireheading drive: “I want a higher reward signal in my brain—and I would do anything to get it.”

In the human case, we can (very roughly) equate a wireheading drive with “the desire to feel good”, i.e. hedonism.^[7] If so, it would suggest that (almost) all humans have a “weak wireheading drive” but not a “strong wireheading drive”. We want to feel good, but we generally care at least a little bit about other things too.

How do we make sense of that? Well, think of the previous two sections above. For a human to want reward: first, it needs to have a reward concept in its world-model, and second, credit assignment needs to flag that concept as being “good”. (I’m using the term “reward concept” in a broad sense that would also include a “feeling good” concept.^[7])

An AGI (or human) can have *self-reflective* concepts, and hence can be motivated to futz with its internal settings and operations.

Given that, and the notes on credit assignment in §9.3 above, I figure:

Avoiding a strong wireheading drive is trivial and automatic; it just requires that credit assignment has, at least once ever, assigned positive-valence-related credit to anything other than the reward / feeling good concept.
Avoiding a weak wireheading drive seems quite tricky. Maybe we could minimize it using timing and priors (§9.3.3 above), but avoiding it altogether would, I presume, require special techniques—I vaguely imagine using some kind of interpretability technique to find the reward / feeling good concept in the world-model, and manually disconnecting it from any Thought Assessors, or something like that.

(There’s also a possibility that a weak-wireheader will self-modify into a strong-wireheader; more on that kind of thing in the next post.)

9.4.3 Wireheading AGIs would be dangerous, not merely unhelpful

There’s an unhelpful intuition that trips up many people: When we imagine a wireheading AGI, we compare it to a human in the midst of an intense recreational drug high. Such a human is certainly not methodically crafting, revising, and executing a brilliant, devious plan to take over the world. So this intuition suggests that wireheading is a capabilities problem, but not a catastrophic accident risk.

This is obviously valid, at least in the human world. But it’s wrong to draw the conclusion that wireheading is not a catastrophic accident risk.^[8] Consider what happens before the AGI starts wireheading. If it entertains the plan “I will wirehead”, that thought would presumably get a high valence from the Steering Subsystem. But if it thinks about it a bit more, it would realize that its expectation should be “I will wirehead for a while, and then the humans will shut me down and repair the memory leak so that I can’t wirehead anymore.” Now the plan doesn’t sound so great! So the AGI may come up with a better plan, one that involves things like seizing control of its local environment, and/or the power grid, and/or the whole world, and/or building itself a “bodyguard AI” that does all those things for it while it wireheads, etc. So really, I think wireheading does carry a risk of catastrophic accidents, including even the kinds of human-extinction-level accident risks that I discussed in Post #1.

9.5 AGIs do NOT judge plans based on their expected future rewards

This directly follows from the previous section, but I want to elevate it to a top-level heading, as “AGIs will try to maximize future rewards” is a common claim.

If the Thought Generator proposes a plan, it may also invoke a representation of that plan’s likely consequences. And then the Thought Assessors will evaluate whether those likely consequences merit positive or negative valence. They will do so according to their current settings. And the Steering Subsystem will endorse or reject the plan largely on that basis. Those current settings need not align with “expected future rewards”.

If the Thought Generator proposes a plan, the Thought Assessors will evaluate its likely consequences according to their current trained model parameters. And the Steering Subsystem will endorse or reject the plan largely on that basis. Those current models need not align with “expected future rewards”.

The Thought Generator’s predictive world-model can even “know” about some discrepancy between “expected future rewards” and the Thought Assessor’s assessment of expected future reward. It doesn’t matter! The Thought Assessor’s assessments won’t automatically correct themselves, and will still continue to determine what plans the AGI will execute.

9.5.1 Human example

Here’s a human example. I’ll talk about cocaine instead of wireheading. (They’re not so different, but cocaine is more familiar.)

True fact: I’ve never done cocaine. Suppose I think to myself right now “maybe I’ll do cocaine”. Intellectually, I’m confident that if I did cocaine, I would have, umm, lots of very intense feelings. But viscerally, imagining myself doing cocaine is mostly neutral! It doesn’t make me feel much of anything in particular.

So for me right now, my intellectual expectations (of what would happen if I did cocaine) are out of sync with my visceral expectations. Apparently my Thought Assessors took a look at the thought “maybe I’ll do cocaine”, and collectively shrugged: “Nothing much going on here!” Recall that the Thought Assessors work by credit assignment (§9.3 above), and apparently the credit assignment algorithm just doesn’t update strongly on hearsay about what cocaine feels like, nor does it update strongly on my reading neuroscience papers about how cocaine binds to dopamine transporters.

By contrast, the credit assignment algorithm does update strongly on a direct, first-person experience of intense feelings.

And thus, people can get addicted to cocaine after using cocaine, whereas people don’t get addicted to cocaine after reading about cocaine.

9.5.2 Relation to “observation-utility agents”

For a more theoretical perspective, here’s an excerpt from “Stable Pointers to Value: An Agent Embedded in Its Own Utility Function” (Abram Demski, 2017) (sorry for the jargon—if you don’t know what AIXI is, don’t worry, you can still probably get the gist):

As a first example, consider the wireheading problem for AIXI-like agents in the case of a fixed utility function which we know how to estimate from sense data. As discussed in Daniel Dewey's Learning What to Value and other places, if you try to implement this by putting the utility calculation in a box which rewards an AIXI-like RL agent, the agent can eventually learn to modify or remove the box, and happily does so if it can get more reward by doing so. This is because the RL agent predicts, and attempts to maximize, reward received. If it understands that it can modify the reward-giving box to get more reward, it will.
We can fix this problem by integrating the same reward box with the agent in a better way. Rather than having the RL agent learn what the output of the box will be and plan to maximize the output of the box, we use the box directly to evaluate possible futures, and have the agent plan to maximize that evaluation. Now, if the agent considers modifying the box, it evaluates that future with the current box. The box as currently configured sees no advantage to such tampering. This is called an observation-utility maximizer (to contrast it with reinforcement learning)….
This feels much like a use/mention distinction. The RL agent is maximizing "the function in the utility module", whereas the observation-utility agent (OU agent) is maximizing the function in the utility module.

Our brain-like AGI, despite being “RL”,^[9] is really closer to the “observation-utility agent” paradigm: the Thought Assessors and Steering Subsystem work together to evaluate plans / courses-of-action, just as Abram’s “box” does.

However, the brain-like AGI has an additional twist that the Thought Assessors get gradually updated over time by “credit assignment” (§9.3 above).

Thus we wind up with something vaguely like the following:

A utility-maximizing agent
…plus a process that occasionally updates the utility function, in a way that tends to make it better match a reward function.

This diagram spells out how our brain-like-AGI motivation picture fits into the “observation-utility agent” paradigm, as described in the text.

Note that we don’t want the credit assignment process to perfectly “converge”—i.e., to reach a place where the utility function perfectly matches the reward function (or in our terminology, reach a place where the Thought Assessors never get updated because they evaluate plans in a way that always perfectly matches the Steering Subsystem).

Why don’t we want perfect convergence? Because perfect convergence would lead to wireheading! And wireheading is bad and dangerous! (§9.4.3 above.) Yet at the same time, we need some amount of convergence, because the reward function is supposed to be sculpting the AGI’s goals! (Remember from Post #2: the Thought Assessors start out random and hence useless.) I’ll return to this topic in the next post, and see also my more general discussion at “Perils of under- vs over-sculpting AGI desires” (2025).

(Astute readers may have also noticed another problem: the utility-maximizer may try to maintain its goals by sabotaging the credit-assignment process. I’ll elaborate on that in the next post as well.)

9.6 Thought Assessors help with interpretability

Here, yet again, is that diagram from Post #6:

Over somewhere on the top right, there’s a little supervised learning module that answers the question: “Given everything I know, including not only sensory inputs and memories but also the course-of-action implicit in my current thought, to what extent do I anticipate tasting something sweet?” As discussed earlier (Post #6), this Thought Assessor plays the dual roles of (1) inducing appropriate homeostatic actions (e.g. maybe salivating), and (2) helping the Steering Subsystem judge whether my current thought is valuable, or whether it’s a lousy thought that should be tossed out via a negative-valence “override”.

Now I want to offer a third way to think about the same thing.

Way back in Post #3, I mentioned that the Steering Subsystem is “stupid”. It has no common-sense understanding of the world. The Learning Subsystem is thinking all these crazy thoughts about paintings and algebra and tax law, and the Steering Subsystem is sitting there with no clue what’s going on.

Well, the Thought Assessors help mitigate that problem! They give the Steering Subsystem a bunch of clues about what the Learning Subsystem is thinking about and planning, in a language that the Steering Subsystem can understand. So this is a bit like machine learning interpretability (§2.7.1).

I’ll call this “ersatz interpretability”. (“Ersatz” is a lovely word that means “cheap inferior imitation”.) I figure that real interpretability should be defined as “the power to look in any part of a learned-from-scratch model and really understand what it’s doing and why and how”. Ersatz interpretability falls far short of that. We get the answer to some discrete number of predetermined questions—e.g. “Does this thought involve eating, or at least things that have been previously associated with eating?” And that’s it. But still, better than nothing.

ML side of the analogy	Brain side of the analogy
Human researcher	Steering Subsystem (see Post #3)
Trained ConvNet model	Learning Subsystem (see Post #3)
By default, from the human’s perspective, the trained model is a horribly complicated mess of unlabeled inscrutable operations	By default, from the Steering Subsystem’s perspective, the Learning Subsystem is a horribly complicated mess of unlabeled inscrutable operations
Ersatz interpretability—The human figures out some “clues” about what the trained model is doing, like “right now it seems to think there’s a curve in the picture”.	Thought Assessors—The Steering Subsystem gets some “clues” about what the Learning Subsystem is up to, like “this thought will probably involve eating, or at least something related to eating”.
Real interpretability—the ultimate goal of really understanding what a trained model is doing, why, and how, from top to bottom.	[There’s no analogy to that.]

This idea will be important for later posts.

(I note that you can do this kind of thing with any actor-critic RL agent, whether brain-like or not, by having a multi-dimensional value function, possibly including “pseudo” value functions that are only used for monitoring; see here, and comments here.)

9.6.1 Tracking which “innate drive” was ultimately responsible for a high-valence plan being high-valence

Back in Post #3, I talked about how brains have multiple different “innate drives”, including a drive to satisfy curiosity, a drive to eat when hungry, a drive to avoid pain, a drive to have high status, and so on. Brain-like AGIs will presumably have multiple drives too. I don’t know exactly what those drives will be, but imagine things vaguely like curiosity drive, altruism drive, norm-following drive, do-what-the-human-wants-me-to-do drive, etc. (More on this in future posts.)

If these different drives all contribute to total reward / valence, then we can and should have valence Thought Assessors (a.k.a. value functions in RL terminology) for the contribution of each drive.

Insofar as the reward function can be broken down into different terms, we can and should track each one with its own Thought Assessor. (And we can also track other non-reward-related Thought Assessors as well.) This has two benefits. “Ersatz interpretability” (this section) is the fact that, if a thought is high-valence, we can inspect the Thought Assessors to get a hint about why. “Real-time steering” (next section) says that we can change the AGI’s long-term plans and goals instantly by editing the reward function f. RL experts will recognize that both these concepts apply to any RL system compatible with a multi-dimensional value function, in which case f is sometimes called the “scalarization function”—see here, and comments here.

As discussed in previous posts, every time the brain-like AGI thinks a thought, it’s thinking it because that thought is more rewarding than alternative thoughts that it could be thinking instead. And thanks to ersatz interpretability, we can inspect the system and know immediately how the various different innate drives are contributing to the fact that this thought is rewarding!

Better yet, this works even if we don’t understand what the thought is about, and even if the reward-predicting part of the thought is many steps removed from the direct effects of the innate drives. For example, maybe this thought is rewarding because it’s executing a certain metacognitive strategy which has proven instrumentally useful for brainstorming, which in turn has proven instrumentally useful for theorem-proving, which in turn has proven instrumentally useful for code-debugging, and so on through ten more links until we get to one of the innate drives.

9.6.2 Is ersatz interpretability reliable, even for very powerful AGIs?

If we have a very powerful AGI , and it spawns a plan, and the “ersatz interpretability” system says “this plan almost definitely won’t lead to violating human norms”, can we trust it? Good question! But it turns out to be essentially equivalent to the question of “inner alignment”, which I’ll discuss in the next post. Hold that thought.

9.7 “Real-time steering”: The Steering Subsystem can redirect the Learning Subsystem—including its deepest desires and long-term goals—in real time

In Atari-playing model-free RL agents, if you change the reward function, the agent’s behavior changes very gradually. Whereas a neat feature of our brain-like AGI motivation system is that we can immediately change not only the agent’s behavior, but even the agent’s very-long-term plans, and its innermost motivations and desires!

The way this works is: as above (§9.6.1), we can have multiple Thought Assessors that feed into the reward function. For example, one might assess whether the current thought will lead to satisfying the AGI’s curiosity drive, another its altruism drive, etc. The Steering Subsystem combines these into an aggregate reward. But the function that it uses to do so is a hardcoded, human-legible function—e.g., it might be as simple as a weighted average. Hence, we can change that Steering Subsystem function in real time whenever we want—in the weighted-average example, we could change the weights.

We saw an example in Post #7: When you’re very nauseous, not only does eating a cake become aversive, but even planning to eat a cake becomes mildly aversive. Heck, even the abstract concept of cake becomes mildly aversive!

And of course, we’ve all had those times when we’re tired, or sad, or angry, and all of the sudden even our most deeply-rooted life goals temporarily lose their appeal.

When you’re driving a car, it is a critically important safety requirement that when you turn the steering wheel, the wheels respond instantaneously. By the same token, I expect that it will be a critically important safety requirement for humans to be able to change an AGI’s deepest desires instantaneously when we press the appropriate button. So I think this is an awesome feature, and I’m happy to have it, even if I’m not 100% sure exactly what to do with it. (In a car, you can see where you’re going, whereas understanding what the AGI is trying to do at any given moment is much more fraught.)

(Again, as in the previous section, this idea of “real-time steering” applies to any actor-critic RL algorithm, not just “brain-like” ones. All it requires is a multi-dimensional reward, which then trains a multi-dimensional value function.)

As above, astute readers might notice that the AGI may well want to grab the “steering wheel” for itself, and forcibly prevent humans from touching it. This is one of many issues that constitutes the topic of the next post: The alignment problem!

Changelog

July 2024: Since the initial version, I’ve made only minor changes, including updating the diagrams (in line with changes to the analogous diagrams in previous posts), updating some links and wording, and using the word “valence” (as defined in my Valence series) instead of “reward” or “value” in lots of places, since I think the latter are liable to cause confusion in this context.

January 2026: Deleted (what used to be) a subsection of §9.2 entitled “Instrumental & final preferences seem to be mixed together”; I now think that everything I had written there was wrong, or confused, or redundant with other parts of this post. Deleted a sentence claiming that the Lisa Feldman Barrett flu anecdote is “credit assignment working in funny ways”; I now think this anecdote should be classified as “innate drives can trigger for funny reasons”, while meanwhile the credit assignment in her brain was working exactly as expected (see here). Also made more minor copyedits throughout.

^{^}
Here’s a plausible human circular preference. You won a prize! Your three options are: (A) 5 lovely plates, (B) 5 lovely plates and 10 ugly plates, (C) 5 OK plates.
No one has done this exact experiment to my knowledge, but plausibly (based on discussion of a similar situation in Thinking Fast And Slow chapter 15) this is a circular preference in at least some people: When people see just A & B, they'll pick B because "it's more stuff, I can always keep the ugly ones as spares or use them for target practice or whatever". When they see just B & C, they'll pick C because "the average quality is higher". When they see just C & A, they'll likewise pick A because "the average quality is higher".
So what we have is two different preferences (1) “I want to have a prettier collection of stuff, not an uglier collection”, and (2) “I want extra free plates”. The comparison of B & C or C & A makes (1) salient, while the comparison of A & B makes (2) salient.
(If you’re thinking “that wouldn’t be a circular preference for me!”, you’re probably right. Different people are different.)
^{^}
You might be thinking: “why make an AGI with human-like faulty intuitions in the first place”?? Well, we’ll try not to, but I bet that at least some human “departures from rationality” ultimately arise from the fact that predictive world-models are big complicated things, and there are only so many ways to efficiently query them, and thus our AGIs will have systematic reasoning errors that we cannot fix at the source-code level, but rather need to fix by asking our AGI to read Scout Mindset or whatever. I argued in Valence series §3.5 (2023) that motivated reasoning and the halo effect are in this category, and things like availability bias, anchoring bias, and hyperbolic discounting might be as well. To be clear, some foibles of human reasoning are probably less likely to afflict AGIs; to pick one example, if we make a brain-like AGI with no innate “approval reward”, then it presumably wouldn’t have the failure mode discussed in the blog post Belief As Attire.
^{^}
Well, that’s why I read The Sequences. I understand that other people read it for a different reason, namely to complete their Rationalist Community™ hazing ritual, and thus qualify for their Second Anointing, learn the secrets of Xenu, and get a free tote bag.
^{^}
I think the real story here has various complicating factors that I’m leaving out, including continued credit assignment during memory recall, and other, non-credit-assignment, changes to the world-model.
^{^}
Why do I say that the Thought Generator and Thought Assessor are working at cross-purposes? Here’s one way to think of it: (1) the Steering System and Thought Assessors are working together to calculate a certain valence function which (in our ancestors’ environment) approximates “expected inclusive genetic fitness”; (2) the Thought Generator is searching for thoughts that maximize that function. Now, given that the Thought Generator is searching for ways to make the valence function return very high valence, it follows that the Thought Generator is also searching for ways to distort the Thought Assessor calculations such that the valence function stops being a good approximation to “expected inclusive genetic fitness”. This is an unintended and bad side-effect (from the perspective of inclusive genetic fitness), and that problem can be mitigated by making it as difficult as possible for the Thought Generator to manipulate the settings of the Thought Assessors. See my post Reward Is Not Enough for some related discussion.
^{^}
The story has a happy ending: I found a different job with a non-abusive boss, and also wound up with a fruitful side-interest in understanding high-functioning psychopaths.
^{^}
“The desire to feel good” is not quite equivalent to “the desire to have a high-valence signal”, but they’re somewhat related, see [Valence series] Appendix A: Hedonic tone / (dis)pleasure / (dis)liking (2023).
^{^}
See similar discussion in Superintelligence p. 149.
^{^}
I think when Abram uses the term “RL agent” in that quote, he was presupposing that the agent is built by not just any RL algorithm, but more specifically an RL algorithm which is guaranteed to converge to a unique ‘optimal’ agent, and which has in fact already finished converging.

For example, one might assess whether the current thought will lead to satisfying the AGI’s curiosity drive, another its altruism drive, etc. The Steering Subsystem combines these into an aggregate reward. But the function that it uses to do so is a hardcoded, human-legible function—e.g., it might be as simple as a weighted average.

Huh. I wonder how much human variation in personality is basically intraspecies variation in the weightings on the reward function in the steering subsystem.

It sure seems like the big five factors could correspond to higher or lower weightings on, or higher or lower sensitivities to, different kinds of hard-coded reward signals.

Openness - Insight or curiosity, maybe counterbalanced by disgust or confusion or uncertainty
Extraversion - Positive social reinforcement (maybe counterbalanced by negative social reinforcement?)
Neuroticism - Fear or anxiety
Agreeableness - Some different kind of social reinforcement?
Conscientiousness - ??

It sure seems like different mixes of priorities on similar baskets of reward signals would shape behavior in importantly different ways, which would manifest at a macro level as personality characteristics.

Yes! I strongly agree with “much human variation in personality is basically intraspecies variation in the weightings on the reward function in the steering subsystem.”

I think the relation between innate traits and Big Five is a bit complicated. In particular, I think there’s strong evidence that it’s a nonlinear relationship (see my heritability post §4.3.3 & §4.4.2). Like, maybe there’s a 20-dimensional space of “innate profiles”, which then maps to visible behaviors like extraversion in a twisty way that groups quite different “innate profiles” together. (E.g., different people are extraverted for rather different underlying reasons.) All the things on your list seem like plausible parts of that story. It would be fun for me to spend a month or two trying to really sort out the details quantitatively, but alas I can’t justify spending the time. :)

(Thinking aloud)

When you’re driving a car, it is a critically important safety requirement that when you turn the steering wheel, the wheels respond instantaneously. By the same token, I expect that it will be a critically important safety requirement for humans to be able to change an AGI’s deepest desires instantaneously when we press the appropriate button. So I think this is an awesome feature, and I’m happy to have it, even if I’m not 100% sure exactly what to do with it. (In a car, you can see where you’re going, whereas understanding what the AGI is trying to do at any given moment is much more fraught.)

Is there an additional claim that the AI will not exhibit the standard problems of corrigibility (ie it won't stop you from changing its reward function) because it's not natively an expected maximizer?

That is, even though the AI's generative world model knows that it will fail to accomplish it's current goals if the reward function is changed, the thought assessors haven't had the opportunity to learn that, and don't know it? The AI understands this "intellectually" but not "viscerally"? It's like the common human relationship with cocaine?

...

That doesn't seem right to me. If the AI is playing a video game, and it is about to be attacked by a powerful adversary that will bring it's hit points to 0, it will correctly identify that this upcoming event will harm it's goals, and take action to prevent it.

The AI doesn't need to have experienced "loosing all its hit points from this particular enemy" or even "loosing all its hit points", for the thought assessors to give high valence to thoughts/plans to prevent that bad outcome. It just has to assign high valence to the concept of winning the game, and it it will assign high valence to harm-preventing-actions, right?

This is disanalogous to the case of humans deciding not to wirehead with drugs because reward is not the optimization target. But "concepts that the thought assessors have awarded high valance" are the optimization target (more or less). Finding out that something will directly impact your reward is not motivating. Finding out that something will impact one of your high-valence goals is motivating.

This makes it seem like even if realtime steering is mechanistically afforded by the AGI design, the AGI will take steps to prevent (most) alterations to its reward function in its steering subsystem, by default.

It's only "most" alterations to its reward function, because some alterations to the reward function will increase AGIs effectiveness at accomplishing it's high-valence-concept goals. And noticing that, thoughts/plans to modify the reward function accordingly will be awarded high valence. The AGI will want to self modify in ways that support its existing high valence goals, but not to be modified (by anyone) in ways that don't support its existing high valence goals.

Oh yeah, an AGI with consequentialist preferences would definitely want to grab control of that button. (Other things equal.) I’ll edit to mention that explicitly. Thanks.

I think I mentioned in the section that I didn’t (and still don’t) have any actual good AI-alignment-helping plan for the §9.7 thing. So arguably I could have omitted that section entirely. But I was figuring that someone else might think of something, I guess. :)

I'm loving this whole sequence, but I particularly love:

9.2.2 Preferences are over “thoughts”, which can relate to outcomes, actions, plans, etc., but are different from all those things

That feels very crisp, clear, and informative.

Given that adversarial relationship, I have a strong presumption that the Thought Generator is not set up to have direct (“voluntary”) control over credit assignment. This also seems to match introspection.

So, suppose I currently hate X, but I want to will myself to really like X. It seems to me that this task is not straightforward, but also that it’s not impossible. It may take some self-reflective skill, mindfulness, planning, and so on, but if the Thought Generator thinks just the right thoughts at the right time, it can probably pull it off.

Notably a bunch of therapeutic change-work techniques are basically about triggering credit assignment on purpose. I'm thinking of stuff in the sphere of hypnosis / NLP / Tony Robbins' "neuroassociative conditioning", but also more mainstream modalities like exposure therapy and some of Cognitive Behavioral Therapy.

Here’s a plausible human circular preference. You won a prize! Your three options are: (A) 5 lovely plates, (B) 5 lovely plates and 10 ugly plates, (C) 5 OK plates.
No one has done this exact experiment to my knowledge, but plausibly (based on discussion of a similar situation in Thinking Fast And Slow chapter 15) this is a circular preference in at least some people: When people see just A & B, they'll pick B because "it's more stuff, I can always keep the ugly ones as spares or use them for target practice or whatever". When they see just B & C, they'll pick C because "the average quality is higher". When they see just C & A, they'll likewise pick A because "the average quality is higher".

This makes no sense to me. Why would you pick C over B? B Pareto dominates C since it contains 5 lovely plates whereas C only has 5 OK plates.

Well, I guess it wouldn't be a circular preference for you. :)

I think it wouldn't occur to many people that they could do one thing with the better 5 plates, and do a different thing with the worse 10 plates, if the plates are not presented in a way the 5+10 division salient. Imagine the better and worse ones are all mixed up, and they're all the same design, such that they're obviously meant to be used as a set, but 2/3rds of the plates in the set have obvious cracks and chips. My impression (again see related experiments in the book chapter) is that many people would just take in the set of 15 plates as a whole and say "man, we can't eat off these, someone could get a cut, the sauce would leak onto the table etc.". The person would have to be kinda thinking outside the box and putting in some effort to notice that there are 5 plates in the set with no chips or cracks, and think of the strategy where they use those and throw out the other 10.

But in that kind of situation, wouldn't those people also pick A over B for the same reason?

If the 5 lovely plates were literally identical in the two sets, I think (for many people) it might serve as a sort of "hint" that they should consider the clever course of action, the one that involves splitting up the B set (i.e. doing one thing with the 10 cracked & chipped plates, and doing a different thing with the 5 other B plates). That same clever splitting idea might also pop into some people's heads for the B-versus-C comparison, but I think it would be less obvious / salient, so fewer people would think of that, leaving at least a subset of people who would choose both B-over-A if that were the choice, and C-over-B if that were the choice.

FYI, this example was pretty clarifying, over and above post 7 in this series.

For example, maybe this thought is rewarding because it’s executing a certain metacognitive strategy which has proven instrumentally useful for brainstorming, which in turn has proven instrumentally useful for theorem-proving, which in turn has proven instrumentally useful for code-debugging, and so on through ten more links until we get to one of the innate drives.

Conditioned Taste Aversion (CTA) is a phenomenon where, if I get nauseous right now, it causes an aversion to whatever tastes I was exposed to a few hours earlier—not a few seconds earlier, not a few days earlier, just a few hours earlier. (I alluded to CTA above, but not its timing aspect.) The evolutionary reason for this is straightforward: a few hours is presumably how long it typically takes for a toxic food to induce nausea.

That explains why my brother no longer likes mushrooms. When we were little, he liked them and we ate mushrooms at a restaurant, then were driven through curvy mountain roads later that day with the family. He got car sick and vomited, and afterwards he had an intense hatred for mushrooms.

I liked the painting metaphor, and the diagram of brain-like AGI motivation!

Got a couple of questions below.

It’s possible that you would find this nameless pattern rewarding, were you to come across it. But you can’t like it, because it’s not currently part of your world-model. That also means: you can’t and won’t make a goal-oriented plan to induce that nameless pattern.

I agree that if you haven't seen something, then its not exactly a part of your world-model. But judging from the fact that it has say positive reward, does this not mean that you like(d) it? Or that aposteriori we can tell it lied inside your "like" region? (it was somewhere in close to things you liked)

For example, say someone enjoys the affection of cat species A, B. Say they haven't experienced a cat of species C, which is similar in some way to species A, B. Then probably they would get a positive reward from meeting cat C (affection), even though their world model didn't include it beforehand. Therefore, they should tell us afterwards that in their previous world, cat C should have been in the "like cat" region.

Similarly, you can conceptualize a single future state of the world in many different ways, e.g. by attending to different aspects of it, and it will thereby become more or less appealing. This can lead to circular preferences; I put an example in this footnote^[1].

Could it be that intelligent machines have circular preferences? I understand that is the case for humans, but im curious how nuanced the answer for machines is.

Imperfect data/architecture/training alg could lead to weird types of thinking when employed OOD. Do you think it would be helpful to try and measure for the coherency of the system's actions/thoughts? E.g. make datasets that inspect the agent's theory of mind (I think Beth Barnes suggested sth like this). I am unsure about what these metrics would imply for AGI safety.

Namely: It seems to me that there is not a distinction between instrumental and final preferences baked deeply into brain algorithms. If you think a thought, and your Steering Subsystem endorses it as a high-value thought, I think the computation looks the same if it’s a high-value thought for instrumental reasons, versus a high-value thought for final reasons.

The answer for this should depend on the size of the space that the optimization algorithm searches over.

It could be the case that the space of possible outcomes for final preferences is smaller than that of instrumental ones, and thus we could afford a different optimization algorithm (or variant thereof).

Also, if instrumental/final preferences were to be mixed together, should we not have been able to encode e.g. strategic behavior (final preference) in RL agents by now?

Thanks!

I agree that if you haven't seen something, then its not exactly a part of your world-model. But judging from the fact that it has say positive reward, does this not mean that you like(d) it? Or that aposteriori we can tell it lied inside your "like" region? (it was somewhere in close to things you liked)
For example, say someone enjoys the affection of cat species A, B. Say they haven't experienced a cat of species C, which is similar in some way to species A, B. Then probably they would get a positive reward from meeting cat C (affection), even though their world model didn't include it beforehand. Therefore, they should tell us afterwards that in their previous world, cat C should have been in the "like cat" region.

Suppose at time t=1 they are completely oblivious to the possible existence or idea of cat C, and at time t=2 they meet cat C and are very happy about it.

We agree that they like cat C at time t=2.

What about at time t=1? I would say “they neither like nor dislike cat C”. I would also say “they would like cat C, if only the thought of cat C occurred to them”.

I think you want to say that they actually already like cat C at t=1. But I don’t think that’s in accordance with common usage of the term “like”. For example, go ask someone on the street: “A year before you first met your current boyfriend (or first saw him, or first become aware of his existence), did you already like him? Did you already think he was cute?” I predict that they will say “no”, and maybe even give you a funny look.

Could it be that intelligent machines have circular preferences? I understand that is the case for humans, but im curious how nuanced the answer for machines is.

Yeah, I for one certainly expect intelligent machines to have circular preferences.

That said, when smart humans notice that they have circular preferences, they tend to adjust their preferences to straighten them out. I assume that AGIs will have the same tendency, and thus that they will have fewer and fewer circular preferences as they learn and think more. (Or perhaps, they'll have circular preferences that are harder and harder to notice.)

Here’s why I think humans tend to straighten out circular preferences: You can (and naturally do) have a preference “Insofar as my other preferences are self-contradictory, I should try to reduce that aspect of them”, because this is roughly a Pareto-improving thing to do. All of my preferences about future states can be better-actualized simultaneously when I adopt the habit of “noticing when two of my preferences are working at cross-purposes, and when I recognize that happening, preventing them from doing so”. So you gradually build up a bunch of new habits that look for various types of situations that pattern-match to “I'm working at cross-purposes to myself”, and then execute a Pareto improvement—since these habits are by default positively reinforced. It’s loosely analogous to how markets become more self-consistent when a bunch of people are scouting out for arbitrage opportunities, I think.

Do you think it would be helpful to try and measure for the coherency of the system's actions/thoughts? E.g. make datasets that inspect the agent's theory of mind (I think Beth Barnes suggested sth like this).

I don't immediately see why “coherency” would be important to measure for safety purposes, but I dunno, maybe. Measuring theory of mind seems potentially safety-relevant insofar as maybe we want to try to make AGIs that are bad at theory of mind, so that they don't know how to deceive humans even if they were motivated to. However, I don't know how you would do that, while still enabling the AGI to do the things we need it to do. Anyway, no strong opinion either way.

if instrumental/final preferences were to be mixed together, should we not have been able to encode e.g. strategic behavior (final preference) in RL agents by now?

It’s true that model-based RL algorithms exist today on GitHub & arXiv. But I think there's a big space of all possible model-based RL algorithms, and I think that there are still important differences between the model-based RL algorithms currently on GitHub & arXiv, versus the model-based RL algorithm in the brain. I won’t spell out my thoughts on that, for Differential Technological Development reasons. No one really knows all the details anyway.

That said, I’m surprised that you don’t think AlphaZero (for example) has “strategic behavior”. Maybe I’m not sure what you mean by “strategic behavior”.

“A year before you first met your current boyfriend (or first saw him, or first become aware of his existence), did you already like him? Did you already think he was cute?” I predict that they will say “no”, and maybe even give you a funny look.

Okay, now I get the point of "neither like nor dislike" in your original statement.

I was originally thinking of sth as follows: "A year before you met your current boyfriend, would you have thought he was cute, if he was your type?". But "your type" requires seeing them to get a reference point of if they belong in that class or not. So there's a circular statement of my own, straightened out, so you had a good point here.

That said, I’m surprised that you don’t think AlphaZero (for example) has “strategic behavior”. Maybe I’m not sure what you mean by “strategic behavior”.

I would say the strategic behavior AlphaZero exhibits is weak (still incredible, specifically with the kind of weird h4 luft lines that the latest supercomputers show). I was thinking of a stronger version dealing with multi-agent environments, continuous state/action spaces, and/or multi-objective reward functions. That said, its seems to me that a different problem has to be solved to get the solution to this.

Avoiding a weak wireheading drive seems quite tricky. Maybe we could minimize it using timing and priors (Section 9.3.3 above), but avoiding it altogether would, I presume, require special techniques—I vaguely imagine using some kind of interpretability technique to find the RPE / feeling good concept in the world-model, and manually disconnecting it from any Thought Assessors, or something like that.

Here's a hacky patch that doesn't entirely solve it, but might help:

Presumably for humans, the RPE/reward is somehow wired into the world-model, since we have a clear awareness of it. But you could just not give it as an input to the AI's world model to begin with.

As long as it doesn't start hacking into its own runtime and peeking at the variables, this can mean that it doesn't have a variable corresponding to its reward in it's world-model, which would prevent it from wanting to use it for wireheading.

Of course this is unstable, so we probably wouldn't want to rely on that. The stable approach would be what we discussed in the other thread, of manually coding the value function. This would protect against wireheading in fundamentally the same way, though, by eliminating the need for a separate "reward" variable in the world-model.

More on Section 9.5 "brain-like AGI is generally NOT trying to maximize its future reward" can be found in Reward is not the optimization target.

For example, one might assess whether the current thought will lead to satisfying the AGI’s curiosity drive, another its altruism drive, etc. The Steering Subsystem combines these into an aggregate reward. But the function that it uses to do so is a hardcoded, human-legible function—e.g., it might be as simple as a weighted average.

Openness - Insight or curiosity, maybe counterbalanced by disgust or confusion or uncertainty
Extraversion - Positive social reinforcement (maybe counterbalanced by negative social reinforcement?)
Neuroticism - Fear or anxiety
Agreeableness - Some different kind of social reinforcement?
Conscientiousness - ??

Yes! I strongly agree with “much human variation in personality is basically intraspecies variation in the weightings on the reward function in the steering subsystem.”

(Thinking aloud)

When you’re driving a car, it is a critically important safety requirement that when you turn the steering wheel, the wheels respond instantaneously. By the same token, I expect that it will be a critically important safety requirement for humans to be able to change an AGI’s deepest desires instantaneously when we press the appropriate button. So I think this is an awesome feature, and I’m happy to have it, even if I’m not 100% sure exactly what to do with it. (In a car, you can see where you’re going, whereas understanding what the AGI is trying to do at any given moment is much more fraught.)

...

Oh yeah, an AGI with consequentialist preferences would definitely want to grab control of that button. (Other things equal.) I’ll edit to mention that explicitly. Thanks.

I'm loving this whole sequence, but I particularly love:

9.2.2 Preferences are over “thoughts”, which can relate to outcomes, actions, plans, etc., but are different from all those things

That feels very crisp, clear, and informative.

Given that adversarial relationship, I have a strong presumption that the Thought Generator is not set up to have direct (“voluntary”) control over credit assignment. This also seems to match introspection.

So, suppose I currently hate X, but I want to will myself to really like X. It seems to me that this task is not straightforward, but also that it’s not impossible. It may take some self-reflective skill, mindfulness, planning, and so on, but if the Thought Generator thinks just the right thoughts at the right time, it can probably pull it off.

Here’s a plausible human circular preference. You won a prize! Your three options are: (A) 5 lovely plates, (B) 5 lovely plates and 10 ugly plates, (C) 5 OK plates.
No one has done this exact experiment to my knowledge, but plausibly (based on discussion of a similar situation in Thinking Fast And Slow chapter 15) this is a circular preference in at least some people: When people see just A & B, they'll pick B because "it's more stuff, I can always keep the ugly ones as spares or use them for target practice or whatever". When they see just B & C, they'll pick C because "the average quality is higher". When they see just C & A, they'll likewise pick A because "the average quality is higher".

This makes no sense to me. Why would you pick C over B? B Pareto dominates C since it contains 5 lovely plates whereas C only has 5 OK plates.

Well, I guess it wouldn't be a circular preference for you. :)

But in that kind of situation, wouldn't those people also pick A over B for the same reason?

FYI, this example was pretty clarifying, over and above post 7 in this series.

For example, maybe this thought is rewarding because it’s executing a certain metacognitive strategy which has proven instrumentally useful for brainstorming, which in turn has proven instrumentally useful for theorem-proving, which in turn has proven instrumentally useful for code-debugging, and so on through ten more links until we get to one of the innate drives.

Conditioned Taste Aversion (CTA) is a phenomenon where, if I get nauseous right now, it causes an aversion to whatever tastes I was exposed to a few hours earlier—not a few seconds earlier, not a few days earlier, just a few hours earlier. (I alluded to CTA above, but not its timing aspect.) The evolutionary reason for this is straightforward: a few hours is presumably how long it typically takes for a toxic food to induce nausea.

I liked the painting metaphor, and the diagram of brain-like AGI motivation!

Got a couple of questions below.

It’s possible that you would find this nameless pattern rewarding, were you to come across it. But you can’t like it, because it’s not currently part of your world-model. That also means: you can’t and won’t make a goal-oriented plan to induce that nameless pattern.

Similarly, you can conceptualize a single future state of the world in many different ways, e.g. by attending to different aspects of it, and it will thereby become more or less appealing. This can lead to circular preferences; I put an example in this footnote^[1].

Could it be that intelligent machines have circular preferences? I understand that is the case for humans, but im curious how nuanced the answer for machines is.

Namely: It seems to me that there is not a distinction between instrumental and final preferences baked deeply into brain algorithms. If you think a thought, and your Steering Subsystem endorses it as a high-value thought, I think the computation looks the same if it’s a high-value thought for instrumental reasons, versus a high-value thought for final reasons.

The answer for this should depend on the size of the space that the optimization algorithm searches over.

Also, if instrumental/final preferences were to be mixed together, should we not have been able to encode e.g. strategic behavior (final preference) in RL agents by now?

Thanks!

I agree that if you haven't seen something, then its not exactly a part of your world-model. But judging from the fact that it has say positive reward, does this not mean that you like(d) it? Or that aposteriori we can tell it lied inside your "like" region? (it was somewhere in close to things you liked)
For example, say someone enjoys the affection of cat species A, B. Say they haven't experienced a cat of species C, which is similar in some way to species A, B. Then probably they would get a positive reward from meeting cat C (affection), even though their world model didn't include it beforehand. Therefore, they should tell us afterwards that in their previous world, cat C should have been in the "like cat" region.

Suppose at time t=1 they are completely oblivious to the possible existence or idea of cat C, and at time t=2 they meet cat C and are very happy about it.

We agree that they like cat C at time t=2.

What about at time t=1? I would say “they neither like nor dislike cat C”. I would also say “they would like cat C, if only the thought of cat C occurred to them”.

Could it be that intelligent machines have circular preferences? I understand that is the case for humans, but im curious how nuanced the answer for machines is.

Yeah, I for one certainly expect intelligent machines to have circular preferences.

Do you think it would be helpful to try and measure for the coherency of the system's actions/thoughts? E.g. make datasets that inspect the agent's theory of mind (I think Beth Barnes suggested sth like this).

if instrumental/final preferences were to be mixed together, should we not have been able to encode e.g. strategic behavior (final preference) in RL agents by now?

That said, I’m surprised that you don’t think AlphaZero (for example) has “strategic behavior”. Maybe I’m not sure what you mean by “strategic behavior”.

“A year before you first met your current boyfriend (or first saw him, or first become aware of his existence), did you already like him? Did you already think he was cute?” I predict that they will say “no”, and maybe even give you a funny look.

Okay, now I get the point of "neither like nor dislike" in your original statement.

That said, I’m surprised that you don’t think AlphaZero (for example) has “strategic behavior”. Maybe I’m not sure what you mean by “strategic behavior”.

Avoiding a weak wireheading drive seems quite tricky. Maybe we could minimize it using timing and priors (Section 9.3.3 above), but avoiding it altogether would, I presume, require special techniques—I vaguely imagine using some kind of interpretability technique to find the RPE / feeling good concept in the world-model, and manually disconnecting it from any Thought Assessors, or something like that.

Here's a hacky patch that doesn't entirely solve it, but might help:

Presumably for humans, the RPE/reward is somehow wired into the world-model, since we have a clear awareness of it. But you could just not give it as an input to the AI's world model to begin with.

More on Section 9.5 "brain-like AGI is generally NOT trying to maximize its future reward" can be found in Reward is not the optimization target.

56

[Intro to brain-like-AGI safety] 9. Takeaways from neuro 2/2: On AGI motivation

56

Ω 21

9.1 Post summary / Table of contents

9.2 The AGI’s goals and desires are defined in terms of latent variables (learned concepts) in its world-model

9.2.1 Implications for “value alignment” with humans

9.2.2 Preferences are over “thoughts”, which can relate to outcomes, actions, plans, etc., but are different from all those things

9.3 “Credit assignment” is how latent variables get painted with valence

9.3.1 What is credit assignment?

9.3.2 How does credit assignment work?—the short answer

9.3.3 How does credit assignment work?—fine print

9.4 Wireheading: possible but not inevitable

9.4.1 What is wireheading?

9.4.2 Will brain-like AGIs want to wirehead?

9.4.3 Wireheading AGIs would be dangerous, not merely unhelpful

9.5 AGIs do NOT judge plans based on their expected future rewards

9.5.1 Human example

9.5.2 Relation to “observation-utility agents”

9.6 Thought Assessors help with interpretability

9.6.1 Tracking which “innate drive” was ultimately responsible for a high-valence plan being high-valence

9.6.2 Is ersatz interpretability reliable, even for very powerful AGIs?

9.7 “Real-time steering”: The Steering Subsystem can redirect the Learning Subsystem—including its deepest desires and long-term goals—in real time

Changelog

56

Ω 21

9.2.2 Preferences are over “thoughts”, which can relate to outcomes, actions, plans, etc., but are different from all those things

56

Ω 21

9.2.2 Preferences are over “thoughts”, which can relate to outcomes, actions, plans, etc., but are different from all those things