I'm pleased to announce a new paper from MIRI: Toward Idealized Decision Theory.


This paper motivates the study of decision theory as necessary for aligning smarter-than-human artificial systems with human interests. We discuss the shortcomings of two standard formulations of decision theory, and demonstrate that they cannot be used to describe an idealized decision procedure suitable for approximation by artificial systems. We then explore the notions of strategy selection and logical counterfactuals, two recent insights into decision theory that point the way toward promising paths for future research.

Following the Corrigibility paper, this is the second in a series of six papers motivating MIRI's active research areas. Also included in the series will be a technical agenda, which motivates all six research areas and describes the reasons why we have selected these topics in particular, and an annotated bibliography, which compiles a fair bit of related work. I plan to post one paper every week or two for the next few months.

I've decided to start with the decision theory paper, as it's one of the meatiest. This paper compiles and summarizes quite a bit of work on decision theory that was done right here on LessWrong. There is a lot more to be said on the subject of decision theory than can fit into a single paper, but I think this one does a fairly good job of describing why we're interested in the field and summarizing some recent work in the area. The introduction is copied below. Enjoy!

As artificially intelligent machines grow more capable and autonomous, the behavior of their decision procedures becomes increasingly important. This is especially true in systems possessing great general intelligence: superintelligent systems could have a massive impact on the world (Bostrom 2014), and if a superintelligent system made poor decisions (by human standards) at a critical juncture, the results could be catastrophic (Yudkowsky 2008). When constructing systems capable of attaining superintelligence, it is important for them to use highly reliable decision procedures.

Verifying that a system works well in test conditions is not sufficient for high confidence. Consider the genetic algorithm of Bird and Layzell (2002), which, if run on a simulated representation of a circuit board, would have evolved an oscillating circuit. Running in reality, the algorithm instead re-purposed the circuit tracks on its motherboard as a makeshift radio to amplified oscillating signals from nearby computers. Smarter-than-human systems acting in reality may encounter situations beyond both the experience and the imagination of the programmers. In order to verify that an intelligent system would make good decisions in the real world, it is important to have a theoretical understanding of why that algorithm, specifically, is expected to make good decisions even in unanticipated scenarios.

What does it mean to "make good decisions"? To formalize the question, it is necessary to precisely define a process that takes a problem description and identifies the best available decision (with respect to some set of preferences1). Such a process could not be run, of course; but it would demonstrate a full understanding of the problem of decision-making. If someone cannot formally state what it means to find the best decision in theory, then they are probably not ready to construct heuristics that attempt to find the best decision in practice.

At first glance, formalizing an idealized process which identifies the best decision in theory may seem trivial: iterate over all available actions, calculate the utility that would be attained in expectation if that action were taken, and select the action which maximizes expected utility. But what are the available actions? And what are the counterfactual universes corresponding to what "would happen" if an action "were taken"? These questions are more difficult than they may seem.

The difficulty is easiest to illustrate in a deterministic setting. Consider a deterministic decision procedure embedded in a deterministic environment. There is exactly one action that the decision procedure is going to select. What, then, are the actions it "could have taken"? Identifying this set may not be easy, especially if the line between agent and environment is blurry. (Recall the genetic algorithm repurposing the motherboard as a radio.) However, action identification is not the focus of this paper.

This paper focuses on the problem of evaluating each action given the action set. The deterministic algorithm will only take one of the available actions; how is the counterfactual environment constructed, in which a deterministic part of the environment does something it doesn't? Answering this question requires a satisfactory theory of counterfactual reasoning, and that theory does not yet exist.

Many problems are characterized by their idealized solutions, and the problem of decision-making is no exception. To fully describe the problem faced by intelligent agents making decisions, it is necessary to provide an idealized procedure which takes a description of an environment and one of the agents within, and identifies the best action available to that agent. Philosophers have studied candidate procedures for quite some time, under the name of decision theory. The investigation of what is now called decision theory stretches back to Pascal and Bernoulli; more recently decision theory has been studied by Wald (1939), Lehmann (1950), Jeffrey (1965), Lewis (1981), Joyce (1999), Pearl (2000) and many others.

Various formulations of decision theory correspond to different ways of formalizing counterfactual reasoning. Unfortunately, the standard answers from the literature do not allow for the description of an idealized decision procedure. Two common formulations and their shortcomings are discussed in Section 2. Section 3 argues that these shortcomings imply the need for a better theory of counterfactual reasoning to fully describe the problem that artificially intelligent systems face when selecting actions. Sections 4 and 5 discuss two recent insights that give some reason for optimism and point the way toward promising avenues for future research. Nevertheless, Section 6 briefly discusses the pessimistic scenario in which it is not possible to fully formalize the problem of decision-making before the need arises for robust decision-making heuristics. Section 7 concludes by tying this study of decision theory back to the more general problem of aligning smarter-than-human systems with human interests.

1: For simplicity, assume von Neumann-Morgenstern rational preferences, that is, preferences describable by some utility function. The problems of decision theory arise regardless of how preferences are encoded.

New Comment
22 comments, sorted by Click to highlight new comments since:

Posting here to let the authors know I am aware this exists, may have more to say later (to see if I can unconfuse myself on my own).

I like the way that the paper is structured! It contains the clearest motivation for studying logical counterfactuals that I've yet seen.

A possible typo:

"Consider what happens if A() = const Refuse: Then A() != s, and so A() = s implies anything"

I think you left out a "for all s != Refuse" or an equivalent. Actually, I'd just go with "if Provable(A() != s)".

Nice catch. Fixed. Thanks for the proof reading, and also for the kind words!

Suppose you look back at some past events. Can you tell if a decision-theoretic agent was at work? If so, what are the signs of optimization having been applied?

Well, since an agent is embedded within its environment, the result of its optimizing must still conform to said environment's rules. You can't look for "impossible" results since the agent cannot achieve any, being as much a slave to the system as any non-agent process. So we'd need some measure of probable/natural result, to screen for improbable-to-be-natural outcomes? However, that would be circular reasoning, since to define what's "improbable to occur without agent intervention" versus "probable to occur without agent intervention" would presuppose a way to detect agenty behavior.

Best I can come up with ad-hoc would be a small subcomponent of the environment affecting a large part of the environment, shaping what the negentropy is used on. Such a "determining nexus" would be a prime candidate for an active decision-theoretic agent.

However, this approach presuposes two conditions: 1) That the agent is powerful enough for its optimizations to [a|e]ffect its environment, and 2) that the agent's goals demand a change in said environment. An agent which either isn't powerful enough, or is content enough with the status quo to stay passive (whose utility function is in alignment with the environment's detected state), cannot be detected in principle (or so I'd surmise).

So what I propose is more like a low-sensitivity screening-test for agents than a fully general solution. I'm not sure what could be done, even in principle, to encompass agents not meeting criteria 1 and/or 2. Probably just define those as not-agents and be done with it ;-).

In case of 2, you could simply manipulate the environment to see what elicits "a reaction" (in the "small part" affecting "large part" sense, especially if that "small part" was reverting the "large part" to the previous state, as much as the available negentropy post-manipulation would allow for in any case). For each manipulation which does not elicit such a reaction, a certain class of agent (that with a utility function that would have necessitated a reaction) could be ruled out.

Really smart agents could try to evade such detection algorithms by embedding themselves more broadly. One could imagine an agent doing so until it is so ubiquituous that the gullible detectors think of it as a natural phenomenon and called it Higgs field, or somesuch. :-)

I'm onto you.

I asked the question originally because:

  • it should be easier to analyze a now static configuration laid bare before you, like a video you can wind back and forth as desired, than predict something that hasn't occurred yet.

  • there should be a way to detect agency in retrospect, otherwise it's no agency at all. For simplicity (an extremely underrated virtue on this forum) let's take an agent which does not care about evading detection.

Re your conditions 1 and 2, feel free to presuppose anything which makes the problem simpler, without cheating. By cheating I mean relying on human signs of agency, like known artifacts of humanity, such as spears or cars.


A proper solution to this problem would be an optimal decision theory. Consider the decision itself as a random variable, then take some epistemic model of the world and infer, from a third-person point of view, what an optimal agent with certain knowledge and preferences should have done. Then output that decision.

I am not talking about optimal at all. Just being able to forensically detect any sign of agency rather than... what?

Minor nitpicks:

Would have been less suspenseful to mention CDT and UDT in the introduction.

Top of second half of page 8, you say a strategy-selecting agent doesn't update its action. This confused me, since strategies can have different observations map onto different actions - probably better to say something like it doesn't update its strategy based on observations that are already in the strategy.

You switch over to talking about "algorithms" a lot, without talking about what that is. Maybe something like "An agent's decision algorithm is the abstract specification of how it maps observations onto actions." You can use this to explain the notation A()=s; both sides of the equation are maps from observations to actions, the right side is just concrete rather than abstract.

When you say "the environment is an algorithm," you could cash this out as "the environment has an abstract specification that maps outputs of a strategy onto outcomes."

I had an idea, and was wondering what its fatal flaw was. For UDT, what happens if, instead of proving theorems of the form "actionx --> utilityx" , it proves theorems of the form "PA+actionx |- utilityx"?

At a first glance, this seems to remove the problem of spurious counterfactuals implying any utility value, but there's probably something big I'm missing.

PA + "A() = x" may be inconsistent (proving, e.g., that 0=1 etc.).

What happens if it only considers the action if it both failed to find "PA+A()=x" inconsistent and found a proof that PA+A()=x proves U()=x? Do an inconsistency check first and only consider/compare the action if the inconsistency check fails.

Then you get spurious proofs of inconsistency :-)

(If PA is strong enough to prove what the agent will do, then PA + "A()=x" is only consistent for one particular action x, and the specific action x for which it is consistent will be up for grabs, depending upon which proof the inconsistency checker finds.)

Got it. Thank you!

Typo, "amplified" vs "amplify":

"on its motherboard as a makeshift radio to amplified oscillating signals from nearby computers"

Fixed, thanks!

Important typo in the Evidential Blackmail section:

"The researcher, knowing this, would send the message if there was a scandal"

should be

"The researcher, knowing this, would send the message if there was not a scandal"

Oh, yeah, that was exactly the opposite of what I meant. Fixed, thanks!


Given a sufficiently accurate description represented as a Bayesian probability distribution over propositions about the environment, the proposition “the agent takes action a” has probability zero whenever the agent will not, in fact, take action a.

That's not actually true: it assumes full, post-hoc information, the famous "completed infinity". To know what action the agent will actually take, in general, requires a Halting Oracle, as you can't know the output of an algorithm, in the general case, without running it.

But I'm guessing this is all getting back towards Logical Uncertainty again, so let me just switch to Google Groups and write things there.

The first problem is to take a full description of an environment and an agent, and identify the best action available to that agent, explicitly assuming "full, post-hoc information." At this level we're not looking for a process that can be run, we're looking for a formal description of what is meant by "best available action." I would be very surprised if the resulting function could be evaluated within the described environment in general, and yeah, it will "require a halting oracle" (if the environment can implement Turing machines). Step one is not writing a practical program, step one is describing what is meant by "good decision." If you could give me a description of how to reliably identify "the best choice available" which assumed not only a halting oracle but logical omniscience and full knowledge of true arithmetic, that would constitute great progress. (The question is still somewhat ill-posed, but that's part of the problem: a well-posed question frames its answer.) At this level we're going to need something like logical counterfactuals, but we won't necessarily need logical uncertainty.

The second problem is figuring out how to do something kinda like evaluating that function inside the environment on a computer with unlimited finite computing power. This is the level where you need logical uncertainty etc. The second problem will probably be much easier to answer given an answer to the first question, though in practice I expect both problems will interact a fair bit.

Solving these two problems still doesn't give you anything practical: the idea is that answers would reveal the solution which practical heuristics must approximate if they're going to act as intended. (It's hard to write a program that reliably selects good decisions even in esoteric situations if you can't formalize what you mean by "good decisions"; it's easier to justify confidence in heuristics if you understand the solution they're intended to approximate; etc.)


Hence my switching to the MIRIx mailing list and typing out a technical post.