On motivations for MIRI's highly reliable agent design research

jessicata

(this post came out of a conversation between me and Owen Cotton-Barratt, plus a follow-up conversation with Nate)

I want to clarify my understanding of some of the motivations of MIRI's highly reliable agent design (HRAD) research (e.g. logical uncertainty, decision theory, multi level models).

Top-level vs. subsystem reasoning

I'll distinguish between an AI system's top-level reasoning and subsystem reasoning. Top-level reasoning is the reasoning the system is doing in a way its designers understand (e.g. using well-understood algorithms); subsystem reasoning is reasoning produced by the top-level reasoning that its designers (by default) don't understand at an algorithmic level.

Here are a few examples:

AlphaGo

Top-level reasoning: MCTS, self-play, gradient descent, ...

Subsystem reasoning: whatever reasoning the policy network is doing, which might involve some sort of "prediction of consequences of moves"

Deep Q learning

Top-level reasoning: the Q-learning algorithm, gradient descent, random exploration, ...

Subsystem reasoning: whatever reasoning the Q network is doing, which might involve some sort of "prediction of future score"

Solomonoff induction

Top-level reasoning: selecting (Cartesian) hypotheses by seeing which make the best predictions

Subsystem reasoning: the reasoning of the consequentialist reasoners who come to dominate Solomonoff induction, who will use something like naturalized induction and updateless decision theory

Genetic selection

It is possible to imagine a system that learns to play video games by finding (encodings of) policies that get high scores on training games, and combining encodings of policies that do well to produce new policies.

Top-level reasoning: genetic selection

Subsystem reasoning: whatever reasoning the policies are doing (which is something like "predicting the consequences of different actions")

In the Solomonoff induction case and this case, if the algorithm is run with enough computation, the subsystem reasoning is likely to overwhelm the top-level reasoning (i.e. the system running Solomonoff induction or genetic selection will eventually come to be dominated by opaque consequentialist reasoners).

Good consequentialist reasoning

Humans are capable of good consequentialist reasoning (at least in comparison to current AI systems). Humans can:

make medium-term predictions about complex systems containing other humans
make plans that take months or years to execute
learn and optimize proxies for long-term success (e.g. learning skills, gaining money)
reason about how to write a computer program

and so on. Current AI systems are not capable of good consequentialist reasoning. Superintelligent AGI systems would be capable of good consequentialist reasoning (though superintelligent narrow AI systems might not in full generality).

The concern

Using these concepts, MIRI's main concern motivating HRAD research can be stated as something like:

The first AI systems capable of pivotal acts will use good consequentialist reasoning.
The default AI development path will not produce good consequentialist reasoning at the top level.
Therefore, on the default AI development path, the first AI systems capable of pivotal acts will have good consequentialist subsystem reasoning but not good consequentialist top-level reasoning.
Consequentialist subsystem reasoning will likely come "packaged with a random goal" in some sense, and this goal will not be aligned with human interests.
Therefore, the default AI development path will produce, as the first AI systems capable of pivotal acts, AI systems with goals not aligned with human interests, causing catastrophe.

Note that, even if the AI system is doing good consequentialist reasoning at the top level rather than in subsystems, this top-level reasoning must still be directed towards the correct objective for the system to be aligned. So HRAD research does not address the entire AI alignment problem.

Possible paths

Given this concern, a number of possible paths to aligned AI emerge:

Limited/tool AI

One might reject premise 1 and attempt to accomplish pivotal acts using AI systems that do not use good consequentialist reasoning. Roughly, the proposal is to have humans do the good consequentialist reasoning, and to use AI systems as tools.

The main concern with this proposal is that a system of humans and limited AIs might be much less effective (for a given level of computing resources) than an AI system capable of good consequentialist reasoning. In particular, (a) a limited AI might require a lot of human labor to do the good consequentialist reasoning, and (b) human consequentialist reasoning is likely to be less effective than superintelligent AI consequentialist reasoning.

The main hope, despite this concern, is that either "general consequentialist reasoning" is not particularly important for the kinds of tasks people will want to use AI systems for (including pivotal acts), or that some sort of global coordination will make the efficiency disadvantage less relevant.

Example research topics:

Hope that top-level reasoning stays dominant on the default AI development path

Currently, it seems like most AI systems' consequentialist reasoning is explainable in terms of top-level algorithms. For example, AlphaGo's performance is mostly explained by MCTS and the way it's trained through self-play. The subsystem reasoning is subsumed by the top-level reasoning and does not overwhelm it.

One could hope that algorithms likely to be developed in the future by default (e.g. model-based reinforcement learning) continue to be powerful enough that the top-level consequentialist reasoning is more powerful than subsystem consequentialist reasoning.

The biggest indication that this might not happen by default is that we currently don't have an in-principle theory for good reasoning (e.g. we're currently confused about logical uncertainty and multi-level models), and it doesn't look like these theories will be developed without a concerted effort. Usually, theory lags behind common practice.

Despite this, a possible reason for hope is that perhaps it's possible to AI researchers to develop enough tacit understanding of these theories for practical purposes. Currently, algorithms such as MCTS are implicitly handling some subproblem of "logical uncertainty" without having a full formal theory, and this does not seem problematic yet. It's conceivable that future algorithms will be similar to MCTS and implicitly handle larger parts of these theories in a way as well-understood as MCTS, such that good consequentialist reasoning in subsystems does not overwhelm the top-level consequentialist reasoning.

"MIRI" has a strong intuition that this won't be the case, and personally I'm somewhat confused about the details; see Nate's comments below for details.

On this path, the most important research topics are those that relate to directing top-level consequentialist reasoning (implemented using algorithms on the default AI development path) towards useful objectives. (Note that these research problems are also important on other paths; goals have to be specified at some point in all cases).

Example research topics:

Avoiding negative side effects
Avoiding reward hacking
Robustness to distribution shift
Safe exploration
Inverse reinforcement learning
Neural network transparency

(research topics like these are discussed in Concrete Problems in AI Safety and Alignment for Advanced Machine Learning Systems)

Figure out some core of good consequentialist reasoning and ensure that AI is developed through this paradigm

This is the main purpose of MIRI's research in HRAD. The main hope is that there is some simple core of good reasoning that can be discovered through theoretical research.

On this pathway, it isn't currently cleanly argued that the right way to research good consequentialist reasoning is to study the particular MIRI research topics such as decision theory. One could imagine other approaches to studying good consequentialist reasoning (e.g. thinking about how to train model-based reinforcement learners). I think the focus on problems like decision theory is mostly based on intuitions that are (currently) hard to explicitly argue for.

Example research topics:

Logical uncertainty
Decision theory
Multi level models
Vingean reflection

(see the agent foundations technical agenda paper) for details)

Figure out how to align a "messy" AI whose good consequentialist reasoning is in a subsystem

This is the main part of Paul Christiano's research program. Disagreements about the viability of this approach are quite technical; I have previously written about some aspects of this disagreement here.

Example research topics:

Interaction with task AGI

Given this concern, it isn't immediately clear how task AGI fits into the picture. I think the main motivation for task AGI is that it alleviates some aspects of this concern but not others; ideally it requires knowing fewer aspects of good consequentialist reasoning (e.g. perhaps some decision-theoretic problems can be dodged), and has subsystems "small" enough that they will not develop good consequentialist reasoning independently.

Conclusion

I hope I have clarified what the main argument motivating HRAD research is, and what positions are possible to take on this argument. There seem to be significant opportunities for further clarification of arguments and disagreements, especially the MIRI intuition that there is a small core of good consequentialist reasoning that is important for AI capabilities and that can be discovered through theoretical research.

As I noted when we chatted about this in person, my intuition is less "there is some small core of good consequentialist reasoning (it has “low Kolmogorov complexity” in some sense), and this small core will be quite important for AI capabilities" and more "good consequentialist reasoning is low-K and those who understand it will be better equipped to design AGI systems where the relevant consequentialist reasoning happens in transparent boxes rather than black boxes."

Indeed, if I thought one had to understand good consequentialist reasoning in order to design a highly capable AI system, I'd be less worried by a decent margin.

The way I wrote it, I didn't mean to imply "the designers need to understand the low-K thing for the system to be highly capable", merely "the low-K thing must appear in the system somewhere for it to be highly capable". Does the second statement seem right to you?

(perhaps a weaker statement, like "for the system to be highly capable, the low-K thing must be the correct high-level understanding of the system, and so the designers must understand the low-K thing to understand the behavior of the system at a high level", would be better?)

The second statement seems pretty plausible (when we consider human-accessible AGI designs, at least), but I'm not super confident of it, and I'm not resting my argument on it.

The weaker statement you provide doesn't seem like it's addressing my concern. I expect there are ways to get highly capable reasoning (sufficient for, e.g., gaining decisive strategic advantage) without understanding low-K "good reasoning"; the concern is that said systems are much more difficult to align.

Thanks for the write-up, this is helpful for me (Owen).

My initial takes on the five steps of the argument as presented, in approximately decreasing order of how much I am on board:

Number 3 is a logical entailment, no quarrel here
Number 5 is framed as "therefore", but adds the assumption that this will lead to catastrophe. I think this is quite likely if the systems in question are extremely powerful, but less likely if they are of modest power.
Number 4 splits my intuitions. I begin with some intuition that selection pressure would significantly constrain the goal (towards something reasonable in many cases), but the example of Solomonoff Induction was surprising to me and makes me more unsure. I feel inclined to defer intuitions on this to others who have considered it more.
Number 2 I don't have a strong opinion on. I can tell myself stories which point in either direction, and neither feels compelling.
Number 1 is the step I feel most sceptical about. It seems to me likely that the first AIs which can perform pivotal acts will not perform fully general consequentialist reasoning. I expect that they will perform consequentialist reasoning within certain domains (e.g. AlphaGo in some sense reasons about consequences of moves, but has no conception of consequences in the physical world). This isn't enough to alleviate concern: some such domains might be general enough that something misbehaving in them would cause large problems. But it is enough for me to think that paying attention to scope of domains is a promising angle.

For #5, it seems like "capable of pivotal acts" is doing the work of implying that the systems are extremely powerful.
For #4, I think that selection pressure does not constrain the goal much, since different terminal goals produce similar convergent instrumental goals. I'm still uncertain about this, though; it seems at least plausible (though not likely) that an agent's goals are going to be aligned with a given task if e.g. their reproductive success is directly tied to performance on the task.
Agree on #2; I can kind of see it both ways too.
I'm also somewhat skeptical of #1. I usually think of it in terms of "how much of a competitive edge does general consequentialist reasoning give an AI project" and "how much of a competitive edge will safe AI projects have over unsafe ones, e.g. due to having more resources".

For #5, OK, there's something to this. But:

It's somewhat plausible that stabilising pivotal acts will be available before world-destroying ones;
Actually there's been a supposition smuggled in already with "the first AI systems capable of performing pivotal acts". Perhaps there will at no point be a system capable of a pivotal act. I'm not quite sure whether it's appropriate to talk about the collection of systems that exist being together capable of pivotal acts if they will not act in concert. Perhaps we'll have a collection of systems which if aligned would produce a win, or if acting together towards an unaligned goal would produce catastrophe. It's unclear if they each have different unaligned goals that we necessarily get catastrophe (though it's certainly not a comfortable scenario).

I like your framing for #1.

I agree that things get messier when there is a collection of AI systems rather than a single one. "Pivotal acts" mostly make sense in the context of local takeoff. In nonlocal takeoff, one of the main concerns is that goal-directed agents not aligned with human values are going to find a way to cooperate with each other.

What work is step #1 doing here? It seems like steps #2-5 would still hold even if the AGI in question were using "bad" consequentialist reasoning (e.g. domain-limited/high-K/exploitable/etc.).

In fact, is it necessary to assume that the AGI will be consequentialist at all? It seems highly probable that the first pivotal act will be taken by a system of humans+AI that is collectively behaving in a consequentialist fashion (in order to pick out a pivotal act from the set of all actions). If so, do arguments #2-#5 not apply equally well to this system as a whole, with "top-level" interpreted as something like "transparent to humans within the system"?

This post helped me understand HRAD a lot better. I'm quite confident that subsystems (SS) will be smarter than top-level systems (TS) (because meta-learning will work). So on that it seems we agree. Although, I'm not sure we have the same thing in mind by "smarter" (e.g., I don't mean that SSs will use some kind of reasoning which is different from model-based RL, just that we won't be able to easily/tractably identify what algo is being run at the SS level, because: 1. interpretability will be hard and 2. it will be a pile of hacks that won't have a simple, complete description).

I think this is the main disagreement: I don't believe that SS will work better because it will stumble upon some low-K reasoning core; I just think SS will be much better at rapid iterative improvements to AI algos. Actually, it seems possible that, lacking some good TS reasoning, SS will eventually hack itself to death :P.

I'm still a bit put-off by talking about TS vs. SS being "dominant", and I think there is possibly some difference of views lurking behind this language.

I kinda get your point here, but it would work better with specific examples, including some non-trivial toy examples of failures for superintelligent agents.

It seems like the Solomonoff induction example is illustrative; what do you think it doesn't cover?

Indeed, if I thought one had to understand good consequentialist reasoning in order to design a highly capable AI system, I'd be less worried by a decent margin.

The second statement seems pretty plausible (when we consider human-accessible AGI designs, at least), but I'm not super confident of it, and I'm not resting my argument on it.

Thanks for the write-up, this is helpful for me (Owen).

My initial takes on the five steps of the argument as presented, in approximately decreasing order of how much I am on board:

Number 3 is a logical entailment, no quarrel here
Number 5 is framed as "therefore", but adds the assumption that this will lead to catastrophe. I think this is quite likely if the systems in question are extremely powerful, but less likely if they are of modest power.
Number 4 splits my intuitions. I begin with some intuition that selection pressure would significantly constrain the goal (towards something reasonable in many cases), but the example of Solomonoff Induction was surprising to me and makes me more unsure. I feel inclined to defer intuitions on this to others who have considered it more.
Number 2 I don't have a strong opinion on. I can tell myself stories which point in either direction, and neither feels compelling.
Number 1 is the step I feel most sceptical about. It seems to me likely that the first AIs which can perform pivotal acts will not perform fully general consequentialist reasoning. I expect that they will perform consequentialist reasoning within certain domains (e.g. AlphaGo in some sense reasons about consequences of moves, but has no conception of consequences in the physical world). This isn't enough to alleviate concern: some such domains might be general enough that something misbehaving in them would cause large problems. But it is enough for me to think that paying attention to scope of domains is a promising angle.

For #5, it seems like "capable of pivotal acts" is doing the work of implying that the systems are extremely powerful.
For #4, I think that selection pressure does not constrain the goal much, since different terminal goals produce similar convergent instrumental goals. I'm still uncertain about this, though; it seems at least plausible (though not likely) that an agent's goals are going to be aligned with a given task if e.g. their reproductive success is directly tied to performance on the task.
Agree on #2; I can kind of see it both ways too.
I'm also somewhat skeptical of #1. I usually think of it in terms of "how much of a competitive edge does general consequentialist reasoning give an AI project" and "how much of a competitive edge will safe AI projects have over unsafe ones, e.g. due to having more resources".

For #5, OK, there's something to this. But:

It's somewhat plausible that stabilising pivotal acts will be available before world-destroying ones;
Actually there's been a supposition smuggled in already with "the first AI systems capable of performing pivotal acts". Perhaps there will at no point be a system capable of a pivotal act. I'm not quite sure whether it's appropriate to talk about the collection of systems that exist being together capable of pivotal acts if they will not act in concert. Perhaps we'll have a collection of systems which if aligned would produce a win, or if acting together towards an unaligned goal would produce catastrophe. It's unclear if they each have different unaligned goals that we necessarily get catastrophe (though it's certainly not a comfortable scenario).

I like your framing for #1.

What work is step #1 doing here? It seems like steps #2-5 would still hold even if the AGI in question were using "bad" consequentialist reasoning (e.g. domain-limited/high-K/exploitable/etc.).

I'm still a bit put-off by talking about TS vs. SS being "dominant", and I think there is possibly some difference of views lurking behind this language.

I kinda get your point here, but it would work better with specific examples, including some non-trivial toy examples of failures for superintelligent agents.

It seems like the Solomonoff induction example is illustrative; what do you think it doesn't cover?

35

On motivations for MIRI's highly reliable agent design research

35

Ω 19

Top-level vs. subsystem reasoning

Good consequentialist reasoning

The concern

Possible paths

Interaction with task AGI

Conclusion

35

Ω 19

35

Ω 19