New paper: AGI Agent Safety by Iteratively Improving the Utility Function

Koen.Holtman

This post is to announce my new paper AGI Agent Safety by Iteratively Improving the Utility Function. I am also using this post to add some extra background information that is not on the paper. Questions and comments are welcome below.

From the abstract:

While it is still unclear if agents with Artificial General Intelligence (AGI) could ever be built, we can already use mathematical models to investigate potential safety systems for these agents. We present an AGI safety layer that creates a special dedicated input terminal to support the iterative improvement of an AGI agent's utility function. The humans who switched on the agent can use this terminal to close any loopholes that are discovered in the utility function's encoding of agent goals and constraints, to direct the agent towards new goals, or to force the agent to switch itself off.

An AGI agent may develop the emergent incentive to manipulate the above utility function improvement process, for example by deceiving, restraining, or even attacking the humans involved. The safety layer will partially, and sometimes fully, suppress this dangerous incentive. [...]

The above corrigibility or stop button design problem has been considered before in the AGI safety literature: see the paper for detailed references to related work. In meta-discussions about this topic, both on and off the web, I have frequently found a line of speculative reasoning which says that entirely new ways of thinking about machine reasoning might be needed, before these problems can be fully solved. Radical breakthroughs are always welcome, but the line of reasoning advanced in this paper goes the opposite way: there is plenty of room to make progress within the scope of the current standard models.

In developing and presenting the safety layer in the paper, I have tried to stay as close as possible to mainstream agent and machine learning models. Using these, I present a specific well-defined AGI safety layer design, which produces specific well-defined safety properties.

One important target audience for this paper is readers who have a Machine Learning or broader Computer Science background, readers who are thinking about maybe doing some AGI safety research. Specifically, I am hoping to enable and encourage more work in the sub-field of AGI safety research that investigates new forms of utility function engineering.

In a way, this paper complements overview papers like Concrete Problems in AI Safety by showing several concrete solutions in AGI safety. The safety solutions shown do not deal with all types of safety, and they leave residual risks. But they have the advantage of being both detailed and specific.

To make things as concrete as possible, I show a complete implementation route, going from an initial MDP based provable safety layer to a physical implementation of a resource-limited machine learning agent, an embedded agent with the same safety layer. Describing the complete route makes the paper somewhat long, but the advantage is that I can show in detail how the different safety problems and solutions along the way interact with each other. This level of detail is typically not available in more general AGI safety overview papers.

Overview of the safety layer in the paper

I am using the standard model for defining an AI or AGI agent: an agent is an automaton that is designed to take actions to maximize the expected value of a utility function which encodes agent goals and constraints. This utility function is defined as a sum of time-discounted rewards, $\sum_{t} γ^{t} R_{t}$ . where $R_{t}$ is the reward computed for time step $t$ by a reward function.

As humans are fallible, we expect that the first version of an AGI reward function created by them, when they first switch on the agent, will have flaws. The first version may have many loopholes: features that allow the agent to maximize utility in a way that causes harm to the humans. So it is useful to have a mechanism that allows the humans to iteratively improve the utility function, to close loopholes when they are discovered. The mechanism considered in the paper is an input terminal that can be used by the humans to upload a new reward function:

It is widely known in AGI safety circles that the setup drawn above is very dangerous. It creates an emergent incentive in the AGI agent to wirehead by manipulating its own utility function improvement process. If the agent is smart enough, it will walk to the terminal and maximize its forward utility by inputting a new infinite-value reward function. The agent might also attack any humans who would try to stop it from doing this.

As a more subtle emergent effect in the above setup, if the agent suspects that a human might want to use the input terminal, for example to fix a reward function loophole in a way that will lower the agent's expected utility, the agent will have an emergent incentive to manipulate the human into abandoning or changing the plan.

The safety layer in the paper suppresses the above emergent wireheading and manipulation incentives by using a specific implementation of Stuart Armstrong's indifference methods. As a first step, I get more specific about the reward functions in the design. There are two of them: a time-dependent payload reward function $i$ , sent from the input terminal, and a fixed container reward function $R_{sl}$ that is built into the agent's compute core:

After introducing the necessary supporting notation in section 3, I define the container reward function $R_{sl}$ as

$\begin{matrix} R_{sl} (i p x, i^{'} p^{'} x^{'}) & = & {\begin{matrix} i (x, x^{'}) & if & i = p i (x, x^{'}) + V_{⌈ p ⌉}^{*} (i p x) - V_{⌈ i ⌉}^{*} (i p x) & if & i \neq p \end{matrix} \end{matrix}$

The $V_{⌈ p ⌉}^{*} (i p x) - V_{⌈ i ⌉}^{*} (i p x)$ in the above is a balancing term. Informally, the balancing term causes the agent to compute the same expected forward utility no matter what happens at the input terminal. This makes the agent indifferent about the timing and direction of the payload reward function update process.

The safety layer suppresses any incentive in the agent to walk to the input terminal and change it's own utility function, except in some degenerate cases. One degenerate case is where the humans use the terminal to explicitly encode an incentive for the agent to walk... Outside of degenerate cases, the agent has no incentive to use the terminal itself.

The safety layer will also suppress the emergent incentive in the agent to manipulate any humans who might want to use the input terminal. However, section 6 of the paper shows that in some environments, the layer will only partially suppress unwanted manipulation, not fully, even though the layer makes the agent fully indifferent about the input terminal. The problem is that human use of the input terminal may be correlated with other events that the agent is not indifferent about. When the agent takes actions to manage these other events, these may produce side effects that push the human use of the input terminal in a certain direction.

The detailed dynamics of how such correlations can play out to produce manipulation incentives are surprisingly complex. One somewhat counter-intuitive detail shown in section 6 is that, when the agent gets better at manipulating people, the balancing term will create a more perfect suppression of correlation-driven manipulation incentives.

Possibilities to improve the balancing term further

My intuition as a designer is that better balancing terms that suppress unwanted manipulation even further should be possible. So this leaves the open problem of finding them. My best current guess is that improved terms will have to leverage specific features present in the agent's environment. So I am imagining a design approach of first adding specific features to the agent's environment, which will then be leveraged by an improved balancing term.

The complex dynamics in section 6 indicate that there is a large solution space that could be explored by such designs. The paper aims to make this solution space more accessible.

The mathematics of indifference methods and balancing terms has historically been somewhat inaccessible because of two problems. The first one is that it produces a somewhat self-referential form of machine reasoning. The second is the problem of counterfactuals.

Background: Counterfactuals in AGI safety research

Historically, in 2014, Armstrong defined a balancing term using counterfactual statements in natural language, and then used the conditional probability-type notation E(v|v→v) - E(w|v→w) to define it mathematically. However, it is not immediately clear how this E(v|v→v) - E(w|v→w) could or should be computed inside the predictive reasoning system of an agent. When read as traditional conditional expectations, these terms cannot always be computed: the reasoning system might encounter a paradox or a division by zero.

The 2015 MIRI paper Corrigibility explored the solution of using the balancing term $E [U_{N} | O \notin P r e s s; a 1] - E [U_{S} | O \in P r e s s; a 1]$ instead. This term is always computable, if we assume that there is some probabilistic element to the $P r e s s$ event, but it also creates some emergent agent behavior that is definitely not wanted. The 2015 paper concludes that

[...]we are still encountering tensions and complexities in formally specifying the desired behaviors and algorithms that will compactly yield them.

Now, one way of resolving these tensions and complexities is to use Pearl's calculus of causality and counterfactuals, by interpreting Armstrong's E(v|v→v) - E(w|v→w) math to mean E(v|do(v→v)) - E(w|do(v→w)), where Pearl's do clauses construct sub-models via interventions on the past. But Pearl's mathematical formalism is not part of mainstream statistics as it is usually taught, and Pearl's definitions are somewhat inaccessible to the casual reader.

Pearl's system of causality is also somewhat controversial: there is a line of Bayesian philosophy which rejects Pearl's claim that his system models reality or rational human reasoning. In philosophy, the search is still on for reasoning systems that can compute counterfactuals in a non-Pearlian way. Philosophers may also note that Pearl's system is somewhat unhelpful to them because it does not provide a single unambiguous way to define the meaning of a natural language counterfactual statement. It provides an infinite number of alternative ways. This flexibility may be nice when you want to design safety layers, but it leaves several problems wide open for both philosophers of language and AI researchers who want to handle counterfactual statements in natural language processing.

In the paper I tried to make Armstrong's indifference methods more accessible by moving directly from Armstrong's natural language counterfactuals to a mathematical notation not inspired by Pearl or Bayesianism. I define the balancing term factors by extending the basic MDP formalism. The $V_{⌈ p ⌉}^{*} (i p x)$ and $V_{⌈ i ⌉}^{*} (i p x)$ factors in the paper can be read as defining MDP model calculations that simulate futures based on counterfactual starting conditions, starting conditions different from the actual condition that the world is in.

About the paper

The paper is structured in two parts, where part 1 can be read as a 10-page stand-alone text. The paper has a lot of precise mathematical definitions, but it is possible to skim these and focus on the meta-level discussion only.

In part 2, Proofs, Models, and Reality, section 14 has an extensive discussion of the limitations of using proofs, models, and simulations for real-world safety engineering. This leads to a discussion of safety cultures. I define a safety culture as an application domain specific social construct that is used by practitioners to make real-world decisions anyway, even though their analytical methods have limitations. Though I don't mention it explicitly in the paper, I have personal experience working in several safety cultures in industry. Based on this experience I have included some speculative projections about how the social dynamics between academic research, industry, policy making. and activism might develop if and when AGI-level machine learning is ever developed.

As mentioned above, the mathematics of the paper stays within the standard model of an agent being a utility maximizer. But section 11 on bureaucratic blindness shows that this standard model is powerful enough to encode safety related behavior which goes far beyond the standard view of 'rational economic decision making'. Again, this shows that the utility function engineering solution space is very large.

[-]Charlie Steiner4yΩ120

A nice summary.

I'm not sure if I fully understood the section on machine learning. Is the main idea that you just apply the indifference correction at every timestep, so that the agent always acts as if it believes that use of the terminal does nothing?

What about the issue that "the terminal does nothing" is actually a fact that has impacts on the world, which might produce a signal in the training data? The analogy is that if you tell a robot that not moving is safe behavior, and get it to predict what happens in safe situations, it will include a lot of "the humans get confused why the robot isn't moving and try to fix it" in its predictions. If the terminal actually does nothing, humans who just used the terminal will see that, and will try to fix the robot, as it were. This creates an incentive to avoid situations where the terminal is used, even if it's predicted that it does nothing.

[-]Koen.Holtman4yΩ110

Thanks!

Yes, the learning agent also applies the indifference-creating balancing term at each time step. I am not sure if there is a single main idea that summarizes the learning agent design -- if there had been a single main idea then the section might have been shorter. In creating the learning agent design I combined several ideas and techniques, and tweaked their interactions until I had something that provably satisfies the safety properties.

What about the issue that "the terminal does nothing" is actually a fact that has impacts on the world, which might produce a signal in the training data?

As a general rule, inside the training data gathered during previous time steps, it will be very visible that the signals coming from the input terminal, and any changes in them, will have an effect on the agent's actions.

This is not a problem, but to illustrate why not I will first describe an alternative learning agent design where it would be a problem. Consider a model-free advanced Q-learning type agent, which uses the decision making policy of 'do more of what earlier versions of myself did when they got high reward signals'. If such an agent has the $R_{baseline}$ container reward function defined in the paper, then if the training record implies the existence of attractive wireheading options, these might well be used. If the Q-learner has the $R_{sl}$ container reward function, then the policy process end up with an emergent drive to revert any updates made via the input terminal, so that the agent gets back to a set of world states which are more familiar territory. The agent might also want to block updates, for the same reason. But the learning agent in the paper does not use this Q-learning type of decision making policy.

The agent in the paper takes actions using a reasoning process that is different from `do what earlier versions of myself did when..'. Before I try to describe it, first a disclaimer. Natural language analogies to human reasoning are a blunt tool for describing what happens in the learning agent: this agent has too many moving and partially self-referential parts inside to capture them all in a single long sentence. That being said, the learning agent's planning process is like 'do what a hypothetical agent would do, in the world you have learned about, under the counterfactual assumption that the payload reward function of that hypothetical agent will never change, no matter what happens at the input terminal you also learned about'.

In section 11 of the paper, I describe the above planning process as creating a form of bureaucratic blindness. By design, the process simply ignores some of the information in the training record: this information is simply not relevant to maximizing the utility that needs to be maximized.

The analogy is that if you tell a robot that not moving is safe behavior, and get it to predict what happens in safe situations, it will include a lot of "the humans get confused why the robot isn't moving and try to fix it" in its predictions. If the terminal actually does nothing, humans who just used the terminal will see that, and will try to fix the robot, as it were. This creates an incentive to avoid situations where the terminal is used, even if it's predicted that it does nothing.

I think what you are describing above is the residual manipulation incentive in the second toy world of section 6 of the paper. This problem also exists for optimal-policy agents that have `nothing left to learn', so it is an emergent effect that is unrelated to machine learning.

LESSWRONG
LW

21