Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
The bridge to AGI control. Not quiiiiite ready for rush-hour traffic… Mind the gaps!! (image source)

(Update: Note that most of the things I wrote in this post are superseded (or at least explained better) in my later “Intro to Brain-Like-AGI Safety” post series.) 

I spend most of my time on relatively straightforward tasks—”straightforward” in the sense that I know how to proceed, and have some confidence that I can make progress, and making progress will probably be helpful. This is all very satisfying. But then I’m also trying to spend (at least) one day a week trying to “solve the whole AGI control problem”. (Specifically, the problem described in: My AGI Threat Model: Misaligned Model-Based RL Agent, which to be clear is still only one part of Safe & Beneficial AGI.)

So yeah, one day a week (at least), it’s big-picture end-to-end thinking, no excuses. My first couple attempts involved a lot of staring at a blank screen, muttering to myself "oh god oh god oh god, every idea is terrible, I'm in way over my head, we’re doomed…". But then I figured, it would be more useful to write up my current thoughts on the whole landscape of possible approaches, as an opportunity to clarify my thinking and get other people’s pushback. I'm hoping to repeat this exercise periodically.

(I’d love it if more people did this too FWIW.)

If I'm unjustifiably dismissing or omitting your favorite idea, try to talk me into it! Leave a comment or let’s chat.

Relatedly, if anyone feels an urge to trust my judgment (for reasons I can't fathom), this is an especially bad time to do so! I very much don’t want this brain-dump of a post to push people towards or away from what they think is a good direction. I haven’t put much thought into most parts of this. And what the heck do I know anyway?

Intended audience: People familiar with AGI safety literature. Lots of jargon and presumed background knowledge.

No claims to originality; many thanks to everyone whose ideas I’m stealing and butchering.

1. Corrigibility

(In the broad Paul sense of “the system is trying to do what the supervisor wants it to try to do”.)

1.0 Is corrigibility a worthwhile goal?

I think this is more-or-less uncontroversial, at least in the narrow sense of “knowing how to make a corrigible AGI would be a helpful thing to know how to do”. Well, Eliezer has argued (e.g. this comment) that corrigible motivation might not be stable for arbitrarily powerful self-reflective AGIs, or at least not straightforwardly. Then I imagine Paul responding, as here, that this is “a solution built to last (at most) until all contemporary thinking about AI has been thoroughly obsoleted…. I don’t think there is a strong case for thinking much further ahead than that.” Maybe these corrigible AGIs will not be all that powerful, or all that self-reflective and self-improving, but they’ll work well enough to help us solve the alignment problem? Or, being corrigible, we can tell them what we want their new goal to be? I mean, “install a motivation which is safe and stable for arbitrarily powerful self-reflective AGIs” is an awfully high bar. I don’t know of anyone trying to do that, at least not with a prayer of success. Like, as far as I can tell (from their limited public information), even Eliezer & his MIRI colleagues are more focused on task-limited AGIs (cf. the strawberry problem), and on laying theoretical groundwork, rather than on getting to CEV or whatever.

So anyway, overall, I’m on board with the idea that “installing a corrigible motivation system” is one thing that’s worth working on, while keeping potential problems in mind, and not to the exclusion of everything else.

1.1 Three versions of corrigible motivation

(As always assuming this type of AGl.) One of the selling points of corrigibility is that “being helpful” is a possible motivation for humans to have, so it should be a possible motivation for AGIs to have too. Well, how does it work in humans? I imagine three categories:

1.1.1 Reinforcement-ish corrigible motivation

What does that mean? For example, I find it pleasing to imagine helping someone, and then I do a good job, and then they shower me with praise and tell everyone what an awesome guy I am. I find it aversive to imagine deceiving or manipulating someone, and then (possibly) getting caught, and then they get really angry at me and tell everyone that I’m awful.

Discussion: This is obviously not the kind of corrigible motivation that we’re going for in our AGIs, because eventually the AGI might come up with a plan to deceive me or manipulate me which is so good that there’s no chance of getting caught and blamed, and then it’s perfectly happy to do that.

How would you install this motivation (if you wanted to)? Intuitively, this seems like the default thing you would get if you gave the programmer a remote-control reward button wired directly into the AGI’s motivation system. (Well, the remote-control scenario is more complicated than it sounds, I think, but at least this would probably be a big part of the motivation by default.)

In particular, this is what I expect from an approval-based system, unless we have sufficient transparency that we can approve “doing the right thing for the right reason” rather than merely “doing the right thing”. As discussed in a later section, I don’t know how to get that level of transparency. So I’m not spending any time thinking about approval signals; I wouldn’t know what to do with them.

1.1.2 Empathy-ish corrigible motivation

What does that mean? For example, I find it pleasing to imagine helping someone, and then they accomplish their goals and feel really good, and I love seeing them so happy. I find it aversive to imagine deceiving someone, and then they feel bad and sad, and gee, I hate seeing them like that.

Discussion: This kind of motivation seems less egregiously bad than the reinforcement one, but still seems to be not what we’re going for. The problem is that the AGI can sometimes make the programmer happier and more satisfied by deceiving and manipulating them (in such a way that they’ll never realize it). I mean, sure, there’s bound to be some murkiness in what is or isn’t problematic manipulation. But this AGI wouldn’t even be trying to not manipulate!

How would you install this motivation (if you wanted to)? You mean, install it reliably? Beats me. But a start might be making an AGI with a human-like system of social emotions—a built-in method of empathetic simulation that comes with various hooks tied to social emotions. See Section 5 below. I think that would be an unreliable method particularly because of the dehumanization problem (see here)—I think that people can interact with other people while deliberately avoiding activating their built-in empathetic simulation module; instead they just use their general cognitive capabilities to build a parallel human model from scratch. (I think that’s how some people with autism operate—long story.) Presumably an AGI could do the same trick.

1.1.3 Conceptual-ish corrigible motivation

What does that mean? For example, I find it pleasing to imagine helping someone, because, well, I like doing helpful things, and I like thinking of myself as a helpful guy. I find it aversive to imagine deceiving or manipulating someone, because, well, I don’t like being deceptive or manipulative, and I don’t like thinking of myself as the kind of guy who would be like that.

This seems more promising, right? Let’s look closer.

How would you install this motivation (if you wanted to)? The previous two attempts had issues from being grounded in something specific, and thus fell prey to Goodhart’s law. Here we have the opposite problem—it’s not grounded in anything, or at least not yet.

One approach would be labeled examples—watch YouTube or read a bunch of descriptions, and this scenario is “doing the right thing”, and that scenario is “doing the wrong thing”, etc., repeat N times. Another approach would be leaning directly on human-labeled concepts—e.g. literally flag the concepts associated with the English word “helpful” as good and with “manipulative” as bad. The two approaches are more similar than they look—after all, a large share of our understanding of the word “helpful” and “manipulative” is generalization from labeled examples throughout our lives.

Both of these approaches, unlike the previous two, have the advantage that we can seemingly install the idea that “it’s bad to manipulate, even if you don’t get caught, and even if the person you’re manipulating is better-off as a result”.

However, I immediately hit two problems.

First problem: If it works, the AGI would probably wind up with somewhat-incoherent motivations that break down at edge-cases. I’m hoping that “conservatism”, as discussed in Section 2 below, can deal with this.

Second problem: (let's call it "the 1st-person problem") Getting non-1st-person training signals into the 1st-person. So if we’re using labeled examples, we don’t want “watching a YouTube video where Alice manipulates Bob” to be aversive, rather we want the self-reflective “I, the AGI, am manipulating Bob” to be aversive. Likewise, if we’re using English words, we don’t want the abstract concept of “helpfulness” to be appealing, rather we want the self-reflective “I, the AGI, am being helpful” to be appealing.

I’m currently stuck on the second problem. (Any ideas?) I think I’m just not quite clear enough about what self-reflection will look like in the predictive world-model. Like, I have general ideas, but they’re not sufficiently nailed down to, say, assess the feasibility of using interpretability tools to fiddle with the self-reflectivity of an AGI’s thoughts, in order to transform 3rd-person labeled examples into 1st-person value function updates. I believe that there’s some literature on self-reflection in the special case of human brain algorithms (this overlaps with meta-cognition and “consciousness studies”); I don’t know if I’ll get anything out of diving into that, but worth a shot. That’s on my to-do list. Conveniently, this is also (I believe) not entirely unrelated to reasoning about other people, which in turn is part of the implementation of innate social instincts, which are also on my to-do list for unrelated reasons, as discussed below.

2. “Conservatism” to relax pressure on alignment, and to help with goal stability

See Conservatism in Neocortex-like AGIs (which in turn was inspired by Stuart Armstrong’s Model Splintering).

Intuitive summary is: I figure we’ll wind up with an AGI that has a somewhat incoherent mix of motivations (as humans do), e.g. as discussed here, for various reasons including the intended goal system not mapping cleanly into the AGI’s conceptual space, changes in the AGI’s conceptual space upon learning and reflection (e.g. ontological crises), the programmer’s intentions / rewards being themselves somewhat incoherent, etc.

So then when the AGI considers possible plans, it will sometimes hit edge-cases where its different motivations pull in different directions. We design it to just not do those things (or at least, to pause execution while the programmer double-checks what’s going on, or something). Instead, it will find those plans unappealing, and it will keep searching for things to do that seem unambiguously good according to all its different intuitions.

So for example, when the AGI encounters the trolley problem, we want it to say to itself “I don’t know! I don’t like either of those options!”, and to keep brainstorming about how to safely stop the train and save everyone on both tracks. And if it finds no acceptable thing to do, we want it to just stand there and do nothing at all—which is hard-coded as an always-acceptable default. Or we program it to temporarily shut down and (somehow) dump out a request for human guidance.

This would also hopefully prevent the thing where one component of its motivation system tries to undermine or remove a conflicting component of its motivation system (see here)—or as (my model of) Eliezer would describe it, it would hopefully “prevent the AGI from self-modifying to be more coherent”.

I’m hanging an awful lot of hope on this kind of “conservatism” (or something like it) working out, because it solves a bunch of problems for which I have no other ideas.

Would it work? And is there a way to know in advance? On the one hand, it's easy enough to imagine that, just as one motivation has Goodhart's law edge cases we don't like, likewise if there’s an incoherent mix of 50 motivations, each of them has edge cases, and what if those edge-cases all overlap somewhere?? I’m feeling optimistic that this won’t be a big problem, maybe partly because I'm imagining AGIs with human-like cognitive algorithms, which then (maybe) wind up with human-like concepts and human-like inductive biases. Also, killing all humans etc. would require a rather egregious misunderstanding of what we view as acceptable behavior!! But I could be wrong.

So anyway, I see this kind of conservatism as being very important to sort out. It's currently only a very sketchy intuitive proposal with many glaring gaps, so I want to fill those gaps, most immediately by developing a better understanding of the human brain motivation system (which I also want to do for other reasons anyway).

2.1 But aren’t capability-safety tradeoffs a no-go?

(Update: I later spun out this subsection into a separate post: Safety-capabilities tradeoff dials are inevitable in AGI.) 

Incidentally, one effect of this proposal is that, if successful, we will wind up designing an AGI with a dial, where one end of the dial says “more safe and less capable”, and the other end of the dial says “less safe and more capable”. This is obviously not a great situation, especially in a competitive world. But I don’t see any way around it. For my part, I want to just say: “Um, sorry everyone, we’re going to have these kinds of dials on our AGIs—in fact, probably several such dials. Hope the AGI strategy / governance folks can bail us out by coming up with a coordination mechanism!!” (Or I guess the other option is that one very-safety-conscious team will be way ahead of everyone else, and they’ll, ahem, take a pivotal action. This seems much worse for reasons here, albeit less bad than extinction, and I hope it doesn’t come to that.)

For example, a very general and seemingly-unavoidable capability-safety tradeoff is: an AGI can always be faster and more powerful and less safe if we remove humans from the loop—e.g. run the model without ever pausing to study and test it, and allow the AGI to execute plans that we humans do not understand or even cannot understand, and give the AGI unfiltered internet access and all the resources it asks for, etc. etc.

For what it’s worth, people do want their AGI to stay safe and under control! We’re not asking anyone to do anything wildly outside their self-interest. Now, I don’t think that that’s sufficient for people to set the dials to “more safe and less capable”—at least, not in a competitive, uncoordinated world—but I do think it’s something working in our favor.

By the way, I’m only arguing that we shouldn’t a priori universally rule out all safety-capability tradeoff dials. I’m not arguing that every dial is automatically fine. If there’s a dial where you need to throw away 99.999...% of the capabilities in order to get a microscopic morsel of safety, well then that’s not a practical path to safe transformative AGI. Is that the situation for this conservatism proposal? I don’t know enough to venture an answer, and until I do, this is a potential point of failure.

2.2 Other paths to goal stability

Umm, I dunno. Build a “Has-The-Correct-Goals-Meter” (or at least “Corrigibility-meter”) somehow, and disallow any changes to the system that makes the meter go up? Cache old copies of the AGI and periodically reactivate them and give them veto power over important decisions? Hope and pray that the AGI figures out how to solve the goal stability problem itself, before its goals shift?

I dunno.

As I’ve mentioned, I’m not a believer in “corrigibility is a broad basin of attraction”, so I see this as a big concern.

3. Transparency / interpretability

3.0 What are we hoping for?

An AGI will presumably be capable of honestly communicating what it’s trying to do—at least to the extent that it’s possible for any two different intelligent agents to try to communicate in good faith. Unfortunately, an AGI will also presumably be capable of dishonestly communicating what it’s trying to do. I would just love to get enough transparency to tell these two things apart.

I have an awfully hard time imagining success in the AGI control problem that doesn’t pass through transparency (except maybe Section 5 below). Unfortunately I haven’t seen or come up with transparency proposals that give me hope, like an end-to-end success story—even a vague one.

Here are a couple tools that seem like they might help, or maybe not, I dunno.

3.1 Segregate different reward components into different value function components

See Multi-dimensional rewards for AGI interpretability and control. (After writing that, I was delighted to learn from the comment section that this is a thing people are already doing, and it really does work the way I was imagining.) Here it is in brief:

The AGI will be getting reward signals, which then flow into changes in the value function. We should expect reward functions to usually be a sum of multiple, meaningfully different, components—e.g.  "reward for following the command I issued yesterday" versus "reward for following the command I issued last week". We can flow these different components into different value function components, and then add them up into the actual value function where necessary (i.e., when the AGI algorithm needs the total value to decide what actions to take or what thoughts to think).

Continuing with the intuitions: In the human case, my (tentative) claim is that for every thought you think, and every action you take, you’re doing so because it has a higher value than whatever you could be doing or thinking instead, and those values ultimately flow from some reward calculated (either now or in the past) by your brainstem. The path from a brainstem reward to “doing a thing” can be horrifically windy and indirect, passing through many layers of analogizing, and instrumental subgoals, and credit assignment errors, and who knows what, but there is a path. Humans don’t have any mechanism for tracing this path back to its source. (Did my throwing out that candy wrapper ultimately derive from my lifetime history of innate social approval rewards? Or from my lifetime history of innate aesthetics rewards? Or what? Who knows!) But we can put such a mechanism in our AGIs. Then we can say “this action ultimately flowed (somehow) mostly from such-and-such reward stream”.

Then what? I dunno. Again, I don’t have an end-to-end story here. I just wanted to mention this because it seems like the kind of thing that might be an ingredient in such a story.

3.2 AGIs steering AGIs

In principle, we could have a tower of two or more AGIs “steering” each other, with lower AGIs scrutinizing the cognition of the next AGI up, and sending rewards. Presumably the AGIs get more and more complex and powerful going up the tower, but gradually enough that each AGI is up to the task of steering the one above it.

It could of course also be a pyramid rather than a tower, with multiple dumber AGIs collaborating to steer a smarter AGI.

Problem #1: How exactly do the AGIs monitor each other? Beats me. This goes back to my not having a great story about interpretability. I’m not happy with the answer “The AGIs will figure it out”. They might or they might not.

Problem #2: What’s at the base of the tower? I talked above about approval-based training being problematic. I’ll talk about imitation in a later section, with the upshot being that if we can get imitation to work, then there are better and more straightforward ways to use such a system than building a tower of AGIs-steering-AGIs on top of it.

Problem #3: There might be gradual corruption of the human's intentions going up the tower—like a game of telephone.

All in all, I don’t currently see any story where AGIs-steering-AGIs is a good idea worth thinking about, at least not in the AGI development scenario I have in mind.

3.3 Other transparency directions

Like, I imagine sitting at my computer terminal with an AGI running on the server downstairs…

The world-model is full of entries that look like: “World-model entry #592378: If entry #98739 happens, it tends to be followed by entry #24567982”. Meanwhile the value function is full of entries that look like “Entry #5892748 has value +3.52, when it occurs in the context of entry #83246”. I throw up my hands. What am I supposed to do with this thing?

OK, I show the AGI the word “deception”. A few thousand interconnected entries in the world-model light up. (Some represent something like “noun”, some represent some low-level input processing, etc.) Hmm, maybe those entries are related to deception? Not necessarily, but maybe. Next I show my AGI a YouTube video of Alice deceiving Bob. Tens of thousands more entries in the world-model light up over the course of the video clip. Oh hey, there’s some overlap with the first group! Maybe that overlap is the AGI’s concept of “deception”?

You get the idea. I don’t have any strong argument that this kind of thing wouldn’t work, but I likewise have no strong argument that it would work. And how would we even know?? I wish I had something more solid to grab onto than this kind of tinkering around.

One potentially fruitful direction I can think of is what I mentioned above: trying to get a more detailed understanding of what self-concept and other-concept are going to look like in the world-model, and seeing if there’s a way to get at that with interpretability tools—i.e., recognizing that a thought is part of “I will do X” versus “Bob is doing X”, even if I’m not sure what X is. I don’t have any particularly strong reason to think this is possible, or that it would really help solve the overall problem, but I dunno, worth a shot. Again, that's somewhere on my to-do list.

4. Clever reward functions

I mentioned above why I don’t like human approval as a reward. What are other possibilities?

4.1 Imitation

If we can make a system that imitates its human operator, that seems to be relatively safe (kinda, sorta, with lots of caveats)—see here for that argument. Two questions are: (1) Can we do that, and (2) so what if we do? We already have humans!

4.1.1 How to make a human-imitating AGI

(As always assuming the model-based RL AGI I have in mind. Note also that my presumption here is that the AGI will be running a learning-and-acting algorithm that bears some similarity to the human’s (within-lifetime) learning-and-acting algorithm.)

How do we design such a system? Two ways I know of:

  • Straightforward RL: Ask a lot of questions, and send a big positive reward when the AGI gives exactly the same answer as the human operator. Maybe it also sends smaller rewards when the answer is close, as judged by an NLP model or something.
  • Gradient descent through the model: Here we put in a new kind of model-update step. Recall the normal model-update step of my AGI model involves updating the world-model via predictive learning, and updating the value function via something like TD learning based on the incoming rewards, and updating the planner / actor via something like increasing the probability of doing things that lead to positive RPE, or whatever. That’s the normal step. Now in addition to that, we add in a second, different kind of model update step, where if the human outputs X, we say “the calculation should have output X”. Then we differentiate all the way through the last N seconds of AGI operation, and then do a corresponding gradient update of all the weights in all three learned components (value function, world-model, actor/planner) in such a way as to make the output of X less unlikely in the future. To be clear, while the normal update step is kinda like how the human brain learns within a lifetime, this second type of update step is wildly different from anything that happens in biology. Not that there’s anything wrong with that.

Of the two, I suspect that the second one would work better, because it has a more direct and high-bandwidth flow of information from the operator’s answers into the value function.

Anyway, the ideal form of imitation—the form where we get those safety advantages—should involve the AGI reaching the same conclusions as humans for the same reasons. Otherwise the AGI won’t have the right inductive biases, or in other words, it will stop imitating as soon as you go off-distribution. (And the ability to go off-distribution is kinda the whole point of building an AGI in the first place! Creative thinking and problem-solving, right?)

While in theory any Turing-complete algorithm class can emulate any other algorithm, my feeling (see here) is that we won’t get human-brain-like inductive bias unless we build our AGIs with a human-brain-like algorithmic design. Conveniently, that’s the scenario I’m thinking about anyway.

And then since we’re especially interested in getting the right inductive biases around innate social instincts—which I believe are at the core of morality, norm-following, etc. (more discussion in Section 5)—we really need to code up our AGI with a scaffolding for human-like social instincts. Like, it’s fine if there are some adjustable parameters in there that we’re not sure about; we can just allow those parameters to get tweaked by those gradient updates. But we want to have the right general idea of how those social-instincts calculations are set up.

So this lands me at the idea that we should be working on reverse-engineering human social instincts—see Section 5.

And once we’re doing that anyway, do we really need the imitation step?? I’m not sure it adds much. Why not just set the social instincts to maximally pro-social (e.g. set jealousy to zero), and let 'er rip? That's the Section 5 approach.

Maybe. But the imitation step could well be useful, both for adjusting adjustable parameters that we can’t reverse-engineer (as mentioned), and for bailing us out in the case that it’s impossible to build human-like social instincts except by having a human body and growing up in a human community. Again see Section 5.

4.1.2. If we learn to imitate, then what?

Putting that aside, let’s say we really want to start with a model that imitates a specific human. Then what? We already have the human, right?

The “classic” answer to this question is one of many proposals centered around amplification & factored cognition, but I don’t like that, see Section 6.

Anyway, with the version of imitation above, there’s a better way. If we get an AGI that has a sufficiently high-fidelity clone of the cognition of a particular (trustworthy) human, including the social instinct circuits that underlie morality and norm-following, then since it’s an online-learning algorithm, we just do the usual thing: let it run and run, and it learns more and more, hopefully imitating that human having more time to think and read and study and brainstorm.

And we can do even better, capabilities-wise. At least in the human neocortical algorithm, I’m reasonably confident that you can take a learned model and “increase the intelligence without corrupting the knowledge or motivation”, at least to some extent. This would involve things like increasing working memory, increasing storage capacity, etc.

4.2 Debate

For the particular version of AGI alignment & control I’m working on, giving rewards for winning a debate seems like it would not lead to a good and safe motivation system. Instead, it would probably lead to a motivation of more-or-less “trying to win the debate by any means necessary”, which includes hacking into the debate opponent, hacking into the judge, breaking out of the box, and so on.

I guess there’s some hope that the two debate opponents, being equally smart and capable, will keep each other in check. But this doesn’t strike me as a reliable mechanism. The kind of AGI I have in mind is frequently learning and coming up with new ideas as it operates, and thus it’s entirely possible for one debate opponent to come up with a brilliant new plan, and execute it, in a way that would surprise the other one. Also, the attack / defense balance in cybersecurity has always favored the attack side, as far as I can tell. Also, given the messiness I expect in installing motivations, it’s not necessarily the case that the two opposing AGIs will really have exactly opposite motivations—for example, maybe they both wind up wanting to “get the Win-The-Debate signal”, rather than wanting to win the debate. Then they can collaborate on hacking into the judge and wireheading.

So debate is feeling less promising than other things, and I’m currently not spending time thinking about it. (See also factored cognition in Section 6.)

5. AGIs with human-like social instincts

I already mentioned this above (Section 4.1), but it’s worth reiterating because it’s kinda a whole self-contained potential path to success.

As in Section 4.1, if we want our AGIs to have human-like moral and social intuitions, including in weird out-of-distribution hypotheticals, I think the most viable and likely-to-succeed path is to understand the algorithms in the human brain that give rise to social instincts, and put similar algorithms into our AGIs. Then we get the right inductive bias for free. As mentioned earlier, we would probably not want to be blindly copy every aspect of human social instincts; instead we would take them as a starting point, and then turn off jealousy and so on. There’s some risk that, say, it’s impossible to turn off jealousy without messing everything else up, but my hunch is that it’s modular enough to be able to fiddle with it.

What do these algorithms look like? I’ve done some casual speculation here but very little research so far.

I guess I’m slightly concerned about the tractability of figuring out the answer, and much more concerned about having no feedback loop that says that the seemingly-correct algorithms are actually right, in advance of having an AGI in front of us to test them on. But I don’t know. Maybe it’s OK. We do, after all, have a wealth of constraints from psychological and neuroscience that the correct algorithm will have to satisfy. And psychologists and neuroscientists can always do more experiments if we have specific good ideas.

Another thing is: the social instinct algorithms aren’t enough by themselves. Remember, the brain is chock-full of learning algorithms. So you can build an AGI with the same underlying algorithms as humans have, but still get a different trained model.

A potential big problem in this category is: maybe the only way to get the social instincts is to take those underlying algorithms and put them in a human body growing up in a human community. That would make things harder. I don’t think that’s likely to be necessary, but I don’t have a good justification for that; it’s kinda a hunch at this point. Also, if this is a problem, it might be solvable with imitative learning as in Section 4.1.

An unrelated potential problem is the possibility that the social instinct algorithms are intimately tied up with fear-of-heights instincts and pain instincts and thousands of other things such that it’s horrifically complicated and we have no hope of reverse-engineering it. Right now I’m optimistic about the social instincts being more-or-less modular and legible, but of course it’s hard to know for sure.

Yet another potential problem is that even properly-implemented human social instincts are not going to get us what we want from AGI alignment, not even after turning off jealousy or whatever. For example, maybe with sufficient intelligence, those same human instincts lead in weird directions. I guess I’m leaning optimistic on this, because (1) the intelligence of highly-intelligent humans does not seem to systematically break their moral intuitions in ways we don’t endorse (1000× higher intelligence may be different but this is at least some evidence), (2) the expression of human morality has always grown and changed over the eons, and we kinda endorse that process, and indeed want future generations to have morals that we don’t endorse, just as our ancestors wouldn’t endorse ours; and if AGIs are somehow the continuation of that process, well maybe that’s not so bad, (3) we can also throw in conservatism (Section 2 above) to keep the AGI from drifting too far from its starting intuitions. So I'm leaning optimistic, but I don’t have great confidence; it’s hard to say.

Overall, I go back and forth a bit, but as of this writing, I kinda feel good about this general approach.

5.1 “Consolation prize” of a future with AGIs we care about

I kinda like the idea that if we go down this path, we can also go for a “consolation prize”: if the human species doesn’t survive into the post-AGI world, then I sure want those AGIs to (A) have some semblance of human-like social instincts, (B) be conscious, (C) have rich fulfilling lives (and in particular, to not suffer).

I don’t know if (A) is that important, or important at all. I’m not a philosopher, this is way above my pay-grade, it’s just that intuitively, I kinda don’t like imagining a future universe without any trace of love and friendship and connection forever and ever. The parts (B) & (C) seem very obviously more important—speaking of which, those are also on my to-do list. (For a hint of what I think progress would look like, see my old poorly-researched casual speculation on consciousness and on suffering.) But they're pretty low on the to-do list—even if I wanted to work on that, I'm missing some prerequisites.

6. Amplification / Factored cognition

I’m generally skeptical that anything in the vicinity of factored cognition will achieve both sufficient safety and sufficient capability simultaneously, for reasons similar to Eliezer’s here. For example, I’ll grant that a team of 10 people can design a better and more complex widget than any one of them could by themselves. But my experience (from having been on many such teams) is that the 10 people all need to be explaining things to each other constantly, such that they wind up with heavily-overlapping understandings of the task, because all abstractions are leaky. And you can’t just replace the 10 people with 100 people spending 10× less time, or the project will absolutely collapse, crushed under the weight of leaky abstractions and unwise-in-retrospect task-splittings and task-definitions, with no one understanding what they’re supposed to be doing well enough to actually do it. In fact, at my last job, it was not at all unusual for me to find myself sketching out the algorithms on a project and sketching out the link budget and scrutinizing laser spec sheets and scrutinizing FPGA spec sheets and nailing down end-user requirements, etc. etc. Not because I’m individually the best person at each of those tasks—or even very good!—but because sometimes a laser-related problem is best solved by switching to a different algorithm, or an FPGA-related problem is best solved by recognizing that the real end-user requirements are not quite what we thought, etc. etc. And that kind of design work is awfully hard unless a giant heap of relevant information and knowledge is all together in a single brain.

Now, I don't think I'm saying anything here that factored cognition proponents don’t already know well. In fact, my understanding is that even the most passionate advocates of factored cognition only claim that it might work and is worth a try. Well, I guess I’d be hard-pressed to disagree with that!! But anyway, I’m sufficiently skeptical that I don’t want to invest my time in thinking about it. There are other things I can do right now that seem more promising.

(That was mostly about amplification. AGI debate is also part of “factored cognition”, but I already covered it in a different section.)

7. AGI limiting

I talked about “conservatism” above separately, but this is my catch-all section for everything else that limits the power of the AGI in order to increase safety. See above (section 2.1) for why I think AGI limiting is an appropriate thing to look into.

7.1 Impact minimization

I only have one vague idea here, and it doesn’t seem to work, but I’ll mention it anyway. Maybe it can be salvaged, who knows.

I already talked above (Section 3.1) about splitting a reward into multiple components that flow into multiple value functions. Well, that seems awfully close to what we want for Alex Turner’s attainable utility preservation (AUP) idea.

Above I was saying we can have multiple reward components, all of which are things we actually want the AGI to try to do, like “follow the command I issued yesterday” vs “follow the command I issued last week”. But we can also go a different direction, and make up whatever pseudo-rewards we want, and flow those into pseudo-value functions in the same way.

Let’s say one of the 20 pseudo-rewards is “reward whenever it rains”, which (if it were a real reward) would motivate the AGI to want it to rain as much as possible. To be clear, nobody actually wants it to rain, or to not rain. We don't care! This is just one of those silly arbitrary rewards that you need in the AUP concept.

OK, and now the AGI assesses the value of a potential thought or action from the perspective of the "rain" value function and the 19 other arbitrary pseudo-value functions, as well as from the perspective of the real value function.

  • If this thought or action is very good from the perspective of all 20 pseudo-value functions, then we should be concerned that the AGI is generically increasing its power, self-replicating, and so on. And so we penalize this thought / action.
  • If this thought or action is very bad from the perspective of all 20 pseudo-value functions, then we should be concerned that the AGI is destroying things in the world or otherwise closing off future options (also cf. non-obstruction). And so, again, we penalize this thought / action.
  • If this thought or action is roughly neutral from the perspective of all 20 pseudo-value functions, then that’s promising! Looks like the AGI is narrowly pursuing its real value function in a low-impact way. So we don’t penalize this thought / action.

OK, that was my idea. And now this is why it doesn’t seem to work:

Problem #1: Reading a math textbook, learning new problem-solving strategies, etc., would presumably be useful for any possible goal, so this scheme would flag that as dangerous instrumental-convergence behavior. But we want our AGI to do those kinds of things! I think they’re essential for building a transformative AGI. Lots of normal things just happen to be more generally useful.

Problem #2: It doesn’t seem like it would actually stop instrumental convergence. For example, consider our example pseudo-motivation of “I like it when it rains”. At first glance, self-replicating around the internet would seem to be very positive from the perspective of that goal, as an AGI with extra power and backup copies etc. is in a better position to develop weather-control technology. But if the AGI has a self-model, it will realize that it is not, in fact, going to use that newfound power to develop weather-control technology, even though it “could” in some sense. So self-replicating around the internet is actually neutral from the perspective of “wanting it to rain”, as long as the AGI’s actual goals are not weather-related.

Actually it’s worse than that. I don’t think the AGI would even think the thought “with all this power, I’d be in a better position to make it rain”, because the pseudo-value function is not controlling what thoughts get thunk. So it wouldn’t even make the connection. I think RL grid-worlds give us the wrong impression here; the real world is so big and so open that the value function is going to be horrifically inaccurate if it's learned exclusively super-duper-off-policy.

So anyway, at the end of the day, I have no idea how to do impact minimization.

7.2 “Tool AI” from self-supervised learning without RL

A couple years ago I spent a month or two being enamored with the idea of tool AI via self-supervised learning, and I wrote a few posts like In Defense of Oracle ("Tool") AI Research and Self-Supervised Learning and AGI Safety. But now I’m sufficiently pessimistic that I want to spend my time elsewhere.

The main change was: I stopped thinking that self-supervised tool AI could be all that competent—like competent enough to help solve AI alignment, or competent enough that people could plausibly coordinate around never building a more competent AGI. Why not? Because I think RL is necessary to build new knowledge and answer hard questions.

So for example, sometimes you can ask an AGI (or human) a tricky question, and its answer immediately "jumps to mind". That's like what GPT-3 does. You don't need rewards for that to happen. But for harder questions, it seems to me like you need your AGI to have an ability to learn metacognitive strategies, so that it can break the problem down, brainstorm, give up on dead ends, etc. etc.

Like, compare “trying to solve a problem by breaking it into subproblems” with “trying to win a video game”. At a high level, they’re actually pretty similar!

  • In both cases, there are a bunch of possible moves you can make, and each move affects subsequent moves, in an exponentially-growing tree of possibilities.
  • In both cases, you’ll often get some early hints about whether moves were wise, but you won’t really know that you’re on the right track until you win.
  • And in both cases, I think the only reliable way to succeed is to have the capability to repeatedly try different things, and learn from experience what paths and strategies are fruitful.

…Hence we need RL, not just supervised learning.

So that’s my opinion right now. On the other hand, someone tried to talk me back into pure supervised learning a couple weeks ago, and he offered some intriguing-sounding ideas, and I mostly wound up feeling confused. So I dunno. :-P

(Same comments also apply to “Microscope AI”.)

7.3 AGIs with holes / boundaries in their cognition

I think it would be kinda nice to know how to make an AGI that reliably doesn’t think about a certain kind of thing. Like maybe we could (A) cripple its self-awareness, or (B) do the non-human-modeling STEM AI thing.

Unfortunately I don’t know how you would make an AGI with that property, at least not reliably.

Beyond that general problem, the I’m also more specifically skeptical about those two examples I just mentioned. For (A), I’m concerned that you can't remove self-awareness without also removing meta-cognition, and I think meta-cognition is necessary for capabilities reasons (see Section 7.2). For (B), I don't see how to use a STEM AI to make progress on the alignment problem, or to make such progress unnecessary.

But I dunno, I haven’t thought about it much.

8. Pre-deployment test protocols

Whatever we can safely and easily test, we don’t need to get right the first time, or even really know what we’re doing. So this seems very important.

Testing strikes me as a hard problem because the AGI can always think new thoughts and learn new things and see new opportunities in deployment that it didn’t see under testing.

I have nothing intelligent to say about pre-deployment test protocols, beyond what anyone could think of in the first 30 seconds. Sorry!

There are ideas floating around about adversarial testing, but I don’t get how I'm supposed to operationalize that, beyond the five words “We should do adversarial testing”.

9. IRL, value learning

I generally get very little out of IRL / CIRL / value learning papers, because I don’t see how we’re going to reliably point to “what the human is trying to do” in a big complicated world model that’s learned from scratch—which is the scenario I’m assuming.

And if we can point to that thing, it seems to me like the rest of the problem kinda solves itself…?

Needless to say, I’m probably missing something.

10. Out of scope for this post

As I mentioned at the top, there’s much more to Safe & Beneficial AGI than the AGI control problem, e.g.:

  • I’m not thinking about issues involving multiple humans and/or multiple AGIs cooperating and competing (example)
  • I’m not thinking about who controls the AGIs and what they do with them, or what we want the long-term future to look like, etc.
  • I’m not thinking about how to ensure that the people developing AGIs are willing and able to turn the safety-vs-capabilities dials (see above) all the way to the “safety” setting. (Ditto the safety-vs-development-speed dials.)

Not because those aren’t hard and necessary problems to solve! Just that they’re out of scope. Don't worry, you're not missing anything, because I actually have nothing intelligent to say about those problems anyway. :-P

11. Conclusion: Two end-to-end paths to AGI control

So, from where I stand right now, it seems to me like there are vaguely two end-to-end paths to solving the AGI control problem (in the AGI scenario I have in mind) with the fewest missing pieces and highest chance of success:

  • Conceptual-ish corrigibility (core ingredient) + Conservatism (core ingredient) + Transparency (probably) + Imitation (maybe) + Testing (probably). The biggest missing pieces for this path are:
    • Developing the “conservatism” idea (section 2)—I think I know how to make progress here
    • Solving the 1st-person problem (section 1.1.3)—I have a thing to look into that might help, but it also might not help, and I have no other ideas.
    • Solving transparency (section 3)—No idea
    • Coming up with test protocols (section 8)—No idea
    • Tying it all together and WCGW—May be hard until those other giant holes are filled.
  • Human-like social instincts (core ingredient) + Imitation (probably) + Transparency (probably) + Conservatism (maybe) + Testing (probably). The biggest missing pieces for this path are:
    • Reverse-engineering the algorithms underlying human social instincts (section 5)—I think I know how to make progress here
    • Solving transparency (section 3)—No idea
    • Coming up with test protocols (section 8)—No idea
    • Tying it all together and WCGW—May be hard until those other giant holes are filled.

Looking forward to ideas & criticisms!

Doing research and building knowledge is all well and good, but I'm trying to periodically ask myself: is this really part of a viable end-to-end success story?? (image source)
New Comment
7 comments, sorted by Click to highlight new comments since:

Ben Goertzel comments on this post via twitter:

1) Nice post ... IMO the "Human-Like Social Instincts" direction has best odds of success; the notion of making AGIs focused on compassion and unconditional love (understanding these are complex messy human concept-plexes) appears to fall into this category as u loosely define it

2) Of course to make compassionate/loving AGI actually work, one needs a reasonable amount of corrigibility in one's AGI cognitive architecture, many aspects of which seem independent of whether compassion/love or something quite different is the top-level motivation/inspiration


One comment: for a realtime control system, the trolley problem isn't even an ethical dilemna.

At design time, you made your system to consider the minimum[expected harm done(possible options)].

In the real world, harm done is never zero.  For a system calculating the risks of each path taken, every possible path has a non zero amount of possible harm.  

And every timestep [30-1000 times a second generally] the system must output a decision. "leaving the lever alone" is also a decision and there is no reason to privilege it over "flipping it".  

So a properly engineered system will, the instant it is able to observe the facts of the trolley problem (and maybe several frames later for filtering reasons), switch to the path with a single person tied to the tracks.

It has no sense of empathy or guilt and for the programmers looking at the decision later, well, it worked as intended.

Stopping the system when this happens has the consequence of killing everyone on the other track and is incorrect behavior and a bug you need to fix.

Thanks! I think this is a case where good design principles for AGI diverge from good design principles for, say, self-driving cars.

"Minimizing expected harm done" is a very dangerous design principle for AGIs because of Goodhart's law / specification gaming. For example, if you define "harm" as "humans dying or becoming injured", then the AGI will be motivated to imprison all the humans in underground bunkers with padded walls! Worse, how do you write machine code for "expected harm"? It's not so straightforward…and if there's any edge case where your code messes up the calculation—i.e., where the expected-harm-calculation-module outputs an answer that diverges from actual expected harm—then the AGI may find that edge-case, and do something even worse than imprison all the humans in underground bunkers with padded walls!

I agree that "being paralyzed by indecision" can be dangerous if the AGI is controlling a moving train. But first of all, this is "everyday dangerous" as opposed to "global catastrophic risk dangerous", which is what I'm worried about. And second of all, it can be mitigated by, well, not putting such an AGI in control of a moving train!! You don't need a train-controlling AI to be an AGI with cross-domain knowledge and superhuman creative problem-solving abilities. We can just use normal 2021-type narrow AIs for our trains and cars. I mean, there were already self-driving trains in the 1960s. They're not perfect, but neither are humans, it's fine.

Someday we may really want to put an AGI in charge of an extremely complicated fast-moving system like an electric grid—where you'd like cross-domain knowledge and superhuman problem-solving abilities (e.g. to diagnose and solve a problem that's never occurred before), and where stopping a cascading blackout can require making decisions in a split-second, too fast to put a human in the loop. In that case, "being paralyzed by indecision" is actually a pretty bad failure mode (albeit not necessarily worse than the status quo, since humans can also be paralyzed by indecision).

I would say: We don't have to solve that problem right now. We can leave humans in charge of the electric grid a bit longer! Instead, let's build AGIs that can work with humans to dramatically improve our foresight and reasoning abilities. With those AGI assistants by our sides, then we can tackle the question of what's the best way to control the electric grid! Maybe we'll come up with a redesigned next-generation AGI that can be safe without being conservative. Or maybe we'll design a special-purpose electric-grid-controlling narrow AI. Or maybe we'll even stick with humans! I don't know, and we don't have to figure it out now.

In other words, I think that the goal right now is to solve the problem of safe powerful human-in-the-loop AGI, and then we can use those systems to help us think about what to do in cases where we can't have a human in the loop.

I agree that there's a sense in which "leave the lever alone" is a decision. However, I'm optimistic that we can program an AGI to treat NOOP as having a special status, so it's the default output when all options are unpalatable. To be crystal-clear: I'm not claiming that this happens naturally, I'm proposing that this a thing that we deliberately write into the AGI source code. And the reason I want NOOP to have a special status is that an AGI will definitely not cause an irreversible global catastrophe by NOOP'ing. :-)


I think you are missing something critical.

What do we need AGI for that mere 2021 narrow agents can't do?

The top item we need is for a system that can keep us biologically and mentally alive as long as possible.

Such an AGI is constrained by time and will constantly be in situations where all choices cause some harm to a person.

Regarding conservatism, there seems to be an open question of just how robust Goodhart effects are in that we all agree Goodhart is a problem but it's not clear how much of a problem it is and when. We have opinions ranging from mine, which is basically that Goodharting happens the moment you try to apply even the weakest optimization pressure and this will be a problem (or at least a problem in expectation; you might get lucky) for any system you need to never deviate, to what I read to be Paul's position: it's not that bad and we can do a lot to correct systems before Goodharting would be disastrous.

Maybe part of the problem is we're mixing up math and engineering problems and not making clear distinctions, but anyway I bring this up in the context of conservatism because it seems relevant that we also need to figure out how conservative, if at all, we need to be about optimization pressure, let alone how we would do it. I've not seen anything like a formal argument that X amount of optimization pressure, measured in whatever way is convenient, and given conditions Y produce Z% chance of Goodharting. Then at least we wouldn't have to disagree over what feels safe or not.

A fairly vague idea for corrigible motivation which I've been toying with has been something along the lines of:


1: Have the AI model human behaviour

2: Have this model split the causal nodes governing human behaviour into three boxes: Values, Knowledge and Other Stuff. (With other stuff being things like random impulses which cause behaviour, revealed but not endorsed preferences etc.) This is the difficult bit, I think using tools like psychology/neurology/evolution we can get around the no free lunch theorems.

3: Have the model keep the values, improve on the knowledge, and throw out the other stuff.

4: Enforce a reflective consistency thing. I don't know exactly how this would work but something along the lines of "Re-running the algorithm with oversight from the output algorithm shouldn't lead to a different output algorithm". This is also possibly difficult, if something ends up in the "Values" it's not clear whether it might get stuck there, so local attractors of values are a problem.


This is something like inverse reinforcement learning but with an enforced prior on humans not being fully rational or strategic. It also might require an architecture which is good at breaking down models into legible gears, which NNs often fail at unless we spend a lot of time studying the resulting NN.

Using a pointer to human values rather than human values itself suffers from issues of the AI resisting attempts to re-orient the pointer, which is what the self-consistency parts of this method are there for.

This approach was mostly borne out of considerations of the "The AI knows we will fight it and therefore knows we must have messed up its alignment but doesn't care because we messed up its alignment" situation. My hope is also that it can leverage the human-modelling parts of the AI to our advantage. Issues of modelling humans do fall prey to "mind-crime" though so we ought to be careful there too.

Thanks for sharing!

These are definitely reasonable things to think about.

For my part, I get kinda stuck right at your step #1. Like, say you give the AGI access to youtube and tell it to build a predictive model (i.e. do self-supervised learning). It runs for a while and winds up with a model of everything in the videos—people doing things, balls bouncing, trucks digging, etc. etc. Then you need to point to a piece of this model and say "This is human behavior" or "This is humans intentionally doing things". How do we do that? How do we find the right piece of the model? So I see step #1 as a quite hard and rich problem.

Then step #2 is also hard, especially if you don't have a constraint on what the causal nodes will wind up looking like (e.g. is a person a node? A person at a particular moment? A subagent? A neuron? This could tie into how step #1 works.)

#2 also seems (maybe?) to require understanding how brains work (e.g. what kind of data structure is "knowledge"?) and if you have that then you can use very different approaches (like section 5 here).

What's the motivation behind #4?

Using a pointer to human values rather than human values itself suffers from issues of the AI resisting attempts to re-orient the pointer, which is what the self-consistency parts of this method are there for.

I'm not sure I follow. If the AGI thinks "I want human flourishing, by which I mean blah blah blah", then it will by default resist attempts to make it actually want a slightly different operationalization of human flourishing. Unless "wanting human flourishing" winds up incentivizing corrigibility. Or I guess in general, I don't understand how you're defining pointer-like vs non-pointer-like goals, and why one tends to incentivize corrigibility more than the other. Sorry if I'm being dense.

By the way, I'm not deeply involved in IRL / value learning (as you might have noticed from this post). You might consider posting a top-level post with what you're thinking about, to get a broader range of feedback, not just my own idiosyncratic not-especially-well-informed thoughts.