MIRI Research Guide

18shminux

18Kaj_Sotala

14AlexMennen

16So8res

7AlexMennen

5So8res

2[anonymous]

9devi

9Rob Bensinger

9So8res

5devi

5[anonymous]

16So8res

3[anonymous]

18IlyaShpitser

6[anonymous]

6So8res

13Kaj_Sotala

11So8res

2Kaj_Sotala

2[anonymous]

14So8res

7[anonymous]

9So8res

1Kawoomba

1[anonymous]

9Artaxerxes

5[anonymous]

5Metus

22Manfred

1[anonymous]

5Toggle

4kagt06

1Toggle

1kagt06

4Manfred

6So8res

1IlyaShpitser

1Manfred

3Curiouskid

3[anonymous]

1[anonymous]

5[anonymous]

3kgalias

2Faustus2

3So8res

2Capla

4lukeprog

1[anonymous]

1So8res

0[anonymous]

-1TheAncientGeek

2Jayson_Virissimo

1dxu

1TheAncientGeek

5dxu

1TheAncientGeek

4Kawoomba

4EHeller

3[anonymous]

3Kawoomba

3[anonymous]

-2TheAncientGeek

New Comment

Excellent guide! Intentionally or not, it also provides a succinct summary of MIRI's achievements in a very accessible way. It should probably be, after slight editing, prominently displayed on the MIRI site under a heading like achievements/results/open problems, not just as a "research guide" link.

Great work!

Finally, a note on content: the guide below discusses a number of FAI research subfields. The goal is to overview, rather than motivate, those subfields. These sketches are not intended to carry any arguments. Rather, they attempt to convey our current conclusions to readers who are already extending us significant charity. We're hard at work producing a number of documents describing why we think these particular subfields are important. (The first was released a few weeks ago, the rest should be published over the next two months.) In the meantime, please understand that the research guide is not able nor intended to provide strong motivation for these particular problems.

This might be good to mention on the canonical guide page as well.

many expect that we may ultimately need to abandon VNM rationality in order to construct Friendly agents

Wait, what?

Yep. It arises in various forms:

- Many people expect that it's sane to discuss unbounded utility functions (which violates continuity, see e.g. this post).
- It's not yet clear that VNM agents can be corrigible, and while the VNM axioms themselves seem pretty sane, there may be some hidden assumption that prevents us from easily designing agents that exhibit corrigible (or other desirable) behavior.
- Because intelligent agents with preferences characterized by a utility function seem very prone to what Bostrom calls "perverse instantiations", some expect that we might want to put effort into figuring out how to build a system that is very tightly constrained in some way such that asking about their preference ordering over universe-histories does not make sense.
- Others variously point out that humans are incoherent, or that perfect coherence requires logical omniscience and is generally impractical.
- Finally, there are a number of people who reason that if agents tend to converge on preferences characterizable by utility functions then getting superintelligence right is really hard, and that would be bad, so therefore it would be bad to "use utility functions."

(Personally, I think the first objection is valid, the second is plausible, the third is dubious, the fourth is missing the point, and the fifth is motivated reasoning, but this is a can of worms that I was hoping to avoid in the research guide.)

I agree with you about 3, 4, and 5.

I don't think that 1 is a valid objection, because if you have non-Archimedean preferences, then it is rational to forget about your "weak" preferences (preferences that can be reversed by mixing in arbitrarily small probabilities of something else) so that you can free up computational resources to optimize for your "strong" preferences, and if you do that, your remaining preferences are Archimedean. See the "Doing without Continuity" section of http://lesswrong.com/lw/fu1/why_you_must_maximize_expected_utility/.

I suppose it's conceivable that 2 could be a reason to give up VNM-rationality, but I'm pretty skeptical. One possible alternative to addressing the corrigibility problem is to just get indirect normativity right in the first place, so you don't need the programmer to be able to tamper with the value system later. Of course, you might not want to rely on that, but I suspect that if a solution to corrigibility involves abandoning VNM-rationality, you end up increasing the chance that both the value system isn't what you wanted and the corrigibility solution doesn't work either, since now the framework for the value system is probably an awful kludge instead of a utility function.

(By the way, I like the guide as a whole and I appreciate you putting it together. I feel kind of bad about my original comment consisting entirely of a minor quibble, but I have a habit of only commenting about things I disagree with.)

Thanks! FWIW, I don't strongly expect that we'll have to abandon VNM at this point, but I do still have a lot of uncertainty in the area.

I will be much more willing to discard objection 1 after resolving my confusion surrounding Pascal muggings, and I'll be much more willing to discard objection 2 once I have a formalization of "corrigibility." The fact that I still have some related confusions, combined with the fact that (in my experience) mathematical insights often reveal hidden assumptions in things I thought were watertight, combined with a dash of outside view, leads me to decently high credence (~15%, with high anticipated variance) that VNM won't cut it.

The recommended order for the papers seems really useful. I was a bit lost about where to start last time I tried reading a chunk of MIRI's research.

The old course list mentioned many more courses, in particular ones more towards Computer Science rather than Mathematics (esp. there is no AI book mentioned). Is this change mainly due to the different aims of the guides, or does it reflect an opinion in MIRI that those areas are not more likely to be useful than what a potential researcher would have studied otherwise?

I also notice that within the subfields of Logic, Model Theory seems to be replaced by Type Theory. Is this reprioritization due to changed beliefs about which is useful for FAI, or due to differences in mathematical taste between you and Louie?

Also, if you're interested in Type Theory in the foundational sense the Homotopy Type Theory book is probably more exciting since that project explicitly has this ambition.

It looks like *Artificial Intelligence: A Modern Approach* has been re-added to the up-to-date course list, near the bottom. Some of the books removed from the old course list will get recommended in the introduction to the new Sequences eBook, where they're more relevant: *The Oxford Handbook of Thinking and Reasoning*, *Thinking and Deciding*, *Harry Potter and the Methods of Rationality*, *Good and Real*, and *Universal Artificial Intelligence*.

Boolos et al.'s *Computability and Logic* replaces Mendelson's *Introduction to Mathematical Logic*, Sipser's *Introduction to the Theory of Computation*, and Cutland's *Computability*. Jaynes' *Probability Theory* and Koller/Friedman's *Probabilistic Graphical Models* replace Mitzenmacher/Upfal's *Probability and Computing* and Feller's *Introduction to Probability Theory*. *Godel Escher Bach* and the recommendations on functional programming, algorithms, numerical analysis, quantum computing, parallel computing, and machine learning are no more. If people found some of the removed textbooks useful, perhaps we can list them on a LW wiki page like http://wiki.lesswrong.com/wiki/Programming_resources.

The recommended order for the papers seems really useful.

Thanks! :-D Let me know if you want any tips/advice if and when you start on another read-through.

The old course list mentioned many more courses ... Is this change mainly due to the different aims of the guides, or does it reflect an opinion in MIRI that those areas are not more likely to be useful than what a potential researcher would have studied otherwise?

Mostly different aims of the guides. I think Louie's criterion was "subjects that seem useful or somewhat relevant to FAI research," and was developed before MIRI pivoted towards examining the technical questions.

My criterion is "prerequisites that are directly necessary to learning and understanding our active technical research," which is a narrower target.

(esp. there is no AI book mentioned)

This is representative of the difference --- it's quite nice to know what modern AI can do, but that doesn't have too much relevance to the current open technical FAI problems, which are more geared towards things like putting foundations under fields where it seems possible to get "good enough to run but not good enough to be safe" heuristics. Knowing how MDPs work is *useful*, but it isn't really necessary to understand our active research.

I also notice that within the subfields of Logic, Model Theory seems to be replaced by Type Theory.

Not really. Rather, that particular Model Theory textbook was rather brutal, and you only need the first two or so chapters to understand our Tiling Agents research, and it's *much* easier to pick up that knowledge using an "intro to logic" textbook. The "model theory" section is still quite important, though!

if you're interested in Type Theory in the foundational sense the Homotopy Type Theory book is probably more exciting

It may be more exciting, but the HoTT book has a bad habit of sending people down the homotopy rabbit hole. People with CS backgrounds will probably find it easier to pick up other type theories. (In fact, Church's "simple type theory" paper may be enough instead of an entire textbook... maybe I'll update the suggestions.)

But yeah, HoTT certainly is pretty exciting these days, and the HoTT book is a fine substitute for the one in the guide :-)

It may be more exciting, but the HoTT book has a bad habit of sending people down the homotopy rabbit hole. People with CS backgrounds will probably find it easier to pick up other type theories. (In fact, Church's "simple type theory" paper may be enough instead of an entire textbook... maybe I'll update the suggestions.)

Yeah, it could quite easily sidetrack people. But simple type theory, simply wouldn't do for foundations since you can't do much mathematics without quantifiers, or dependent types in the case of type theory. Further, IMHO, the univalence axiom is the largest selling point of type theory as foundations. Perhaps a reading guide to the relevant bits of the HoTT book would be useful for people?

it's quite nice to know what modern AI can do, but that doesn't have too much relevance to the current open technical FAI problems.

So you claim. It is *extremely* hard to take that claim seriously when the work you are doing is so far removed from practical application and ignorant of the tradeoffs made in actual AI work. Knowing what is possible with real hardware and known algorithms, and being aware of what sort of pragmatic simplifications are made in order to make general mathmatical theories computable is of prime importance. A general theory of FAI is absolutely useless if it requires simplifications -- which all real AI programs do -- that invalidate its underlying security assumptions. At the very least *Artificial Intelligence: A Modern Approach* should be on your basis list.

Changing tacks: I'm a basement AGI hacker working on singularity technology in my admitedly less than copious spare time between a real job and my family life. But I am writing code, *trying* to create an AGI. I am MIRI's nightmare scenario. Yet I care enough to be clued into what MIRI is doing, and still find your work to be absolutely without value and irrelevant to me. This should be very concerning to you.

I think you may be overestimating the current state of FAI research.

Knowing what is possible with real hardware and known algorithms, and being aware of what sort of pragmatic simplifications are made in order to make general mathmatical theories computable is of prime importance.

Ultimately, yes. But we aren't anywhere near the stage of "trying to program an FAI." We're still *mostly* at the stage where we identify fields of study (decision theory, logical uncertainty, tiling agents, etc.) in which it seems possible to stumble upon heuristics that are "good enough" to specify an intelligent system *before* understanding the field formally on a fundamental level. (For example, it seems quite possible to develop an intelligent self-modifying system before understanding "reflective trust" in theory; but it seems unlikely that such a system will be safe.)

Our research primarily focuses on putting foundations under various fields of mathematics, so that we can understand safety *in principle* before trying to develop safe practical algorithms, and therefore is not much aided by knowledge of the tradeoffs made in actual AI work.

By analogy, consider Shannon putting the foundations under computer chess in 1950, by specifying an algorithm capable of playing perfect chess. This required significant knowledge of computer science and a lot of ingenuity, but it *didn't* require 1950's-era knowledge of game-playing heuristics, nor knowledge of what then-modern programs could or could not do. He was still trying to figure out the *idealized* case, to put foundations under the field in a world where it was not yet known that a computer *could* play perfect chess. From that state of confusion, knowledge of 1950's game-playing heuristics wasn't all that necessary.

Now, of course, it took half a century to go from Shannon's idealized algorithm to a practical program that beat Kasparov, and *that* required intensive knowledge of modern practical algorithms and the relevant tradeoffs. Similarly, I do expect that we'll get to a point in FAI research where knowledge of modern AI algorithms and the tradeoffs involved is crucial.

However, we just aren't there yet.

It may *look* like all the hard work goes into the space between 1950 and Deep Blue, but in fact, the sort of foundational work done by Shannon was very important. If you tried to write a chess program capable of beating human champions *before* anyone understood an algorithm that would play perfect chess in the ideal case, you'd probably have a bad time.

We're still trying to do the FAI-equivalent of designing an *impractical* algorithm which plays perfect chess, to put foundations under what it *means* to ask for safety from a system. Otherwise, "friendliness" is just this amorphous blob with good affect attached: I don't think it's very useful to even *talk* about "friendly practical systems" before first developing a formal understanding of what we're asking for. As such, knowledge of modern AI heuristics just isn't directly important to understanding our current research.

That said, I will again note that knowledge of modern AI capabilities is net positive, and can be an asset even when doing our brand of foundational research. I appreciate your suggestion, and I may well add "AI: A modern approach" to the "other tools" section. However, it isn't currently a prereq for understanding or contributing to our active research---FAI as a field is just *that far* behind.

And I think you are underestimating what we know about AGI and friendliness (maybe not MIRI's conception of it).

Regardless, one of the things you would learn from the first few chapters of *AI: A Modern Approach* is the principles of search, of particular note that algorithms which combine forwards search (start to goal) with backwards search (goal to start) perform best. A bidirectional search which looks from both the starting point and the ending point and meets in the middle achieves *sqrt* reduction in time compared to either unidirectional case (*2b^k/2* vs *b^k*). You are advocating starting from first principles and searching ideaspace for a workable FAI design (forwards search). You should also consider simultaneously searching through the goal space of possible AGI designs for progressively more friendly architectures (backwards search, approximately), with progress in each search informing the other -- *e.g.* theoretical exploration guided in the direction of implementable designs, and only considering practical designs which are formally analyzable.

If the *sqrt* reduction seems not obvious, one way to prime your intuition is to observe that the space of implementable FAI designs is much smaller than the space of theoretical FAI designs, and that by ignoring practical concerns you are aiming for the larger, less constrained set, and possibly landing at a region of design space which is disjoint from implementable designs.

EDIT: To give an example, further work on decision theory is pretty much a useless as advanced decision theory is super-exponential in time complexity. It’s the first thing to become a heuristic in real world implementations, and existing decision theory is sufficient for learning new heuristics from an agent’s experience.

Regardless, one of the things you would learn from the first few chapters of AI: A Modern Approach is the principles of search

I have personally studied modern AI theory (not via this specific textbook, but via others), and I happen to know a fair amount about various search algorithms. I'm confused as to why you think that knowledge of search algorithms is important for FAI research, though.

I mean, many fields teach you basic principles that are applicable outside of that field, but this is true of evolutionary biology and physics as well as modern AI.

I don't deny that some understanding of search algorithms is *useful,* I'm just confused as to why you think it's *more* useful than, say, the shifts in worldview that you could get from a physics textbook.

You are advocating starting from first principles and searching ideaspace for a workable FAI design

Hmm, it appears that I failed to get my point across. We're not currently searching for workable FAI designs. We're searching for a formal description of "friendly behavior." Once we have that, we can start searching for FAI designs. *Before* we have that, the word "Friendly" doesn't mean anything specific.

the space of implementable FAI designs is much smaller than the space of theoretical FAI designs

Yes, to be sure! "Implementable FAI designs" compose the bullseye on the much wider "all FAI designs" target. But we're not at the part yet where we're creating and aiming the arrows. Rather, we're still looking for the target!

We don't know what "Friendly" means, in a formal sense. If we did know what it meant, we would be able to specify an impractical brute force algorithm specifying a system which could take unbounded finite computing power and reliably undergo an intelligence explosion and have a beneficial impact upon humanity; because brute force is *powerful*. We're trying to figure out how to write the unbounded solutions not because they're practical, but because this is how you figure out what "friendly" means in a formal sense.

(Or, in other words, the word "Friendly" is still mysterious, we're trying to ground it out. The way you do that is by figuring out what you mean given unbounded computing power and an understanding of general intelligence. Once you have *that*, you can start talking about "FAI designs," but not before.)

By contrast, we *do* have an understanding of "intelligence" in this very weak sense, in that we can specify things like AIXI (which act very "intelligently" in our universe given a pocket universe that runs hypercomputers). Clearly, there is a huge gap between the "infinite computer" understanding and understanding sufficient to build practical systems, but *we don't even have the first type of understanding* of what it means to ask for a "Friendly" system.

I definitely acknowledge that when you're searching in FAI design space, it is very important to keep practicality in mind. But we aren't searching *in* FAI design space, we're searching *for* it.

I'm confused as to why you think that knowledge of search algorithms is important for FAI research, though.

I don't think he meant to say that "knowledge of search algorithms is important for FAI research", I think he meant to say "by analogy from search algorithms, you're going to make progress faster if you research the abstract formal theory and the concrete implementation at the same time, letting progress in one guide work in the other".

I'm personally sympathetic to your argument, that there's no point in looking at the concrete implementations before we understand the formal specification in good enough detail to know what to look for in the concrete implementations... but on the other hand, I'm *also* sympathetic to the argument that if you do not also look at the concrete implementations, you may never hit the formal specifications that are actually correct.

To stretch the chess analogy, even though Shannon didn't use any 1950s knowledge of game-playing heuristics, he presumably did use something like the knowledge of chess being a two-player game that's played by the two taking alternating turns in moving different kinds of pieces on a board. If he didn't have this information to ground his search, and had instead tried to come up with a general formal algorithm for winning in *any* game (including football, tag, and 20 questions), it seems much less likely that he would have come up with anything useful.

As a more relevant example, consider the discussion about VNM rationality. Suppose that you carry out a long research program focused on understanding how to specify Friendliness in a framework built around VNM rationality, all the while research in practical AI reveals that VNM rationality is a fundamentally flawed approach for looking at decision-making, and discovers a superior framework that's much more suited for both AI design and Friendliness research. (I don't expect this to necessarily happen, but I imagine that something like that *could* happen.) If your work on Friendliness research continues while you remain ignorant of this discovery, you'll waste time pursuing a direction that can never produce a useful result, *even* on the level of an "infinite computer" understanding.

Thanks, Kaj.

To stretch the chess analogy, even though Shannon didn't use any 1950s knowledge of game-playing heuristics, he presumably did use something like the knowledge of chess being a two-player game that's played by the two taking alternating turns in moving different kinds of pieces on a board.

I agree, and I think it is important to understand computation, logic, foundations of computer science, etc. in doing FAI research. Trying to do FAI theory with no knowledge of *computers* is surely a foolish endeavor. My point was more along the lines of "modern AI textbooks mostly contain heuristics and strategies for getting good behavior out of narrow systems, and this doesn't seem like the appropriate place to get the relevant low-level knowledge."

To continue abusing the chess analogy, I completely agree that Shannon needed to know things about *chess*, but I don't think he needed to understand 1950's-era *programming* techniques (such as the formal beginnings of assembler languages and the early attempts to construct compilers). It seems to me that the field of modern AI is less like "understanding chess" and more like "understanding assembly languages" in this particular analogy.

That said, I am *not* trying to say that this is the only way to approach friendliness research. I currently think that it's one of the most promising methods, but I certainly won't discourage anyone who wants to try to do friendliness research from a completely different direction.

The only points I'm trying to make here are that (a) I think MIRI's approach *is* fairly promising, and (b) *within* this approach, an understanding of modern AI is not a prerequisite to understanding our active research.

Are there other approaches to FAI that would make significantly more use of modern narrow AI techniques? Yes, of course. (Nick Hay and Stuart Russell are poking at some of those topics today, and we occasionally get together and chat about them.) Would it be nice if MIRI could take a number of different approaches all at the same time? Yes, of course! But there are currently only three of us. I agree that it would be *nice* to be in a position where we had enough resources to try many different approaches at once, but it is currently a factual point that, *in order to understand our active research,* you don't need much narrow AI knowledge.

Thanks, that's a good clarification. May be worth explicitly mentioning something like that in the guide, too.

Kaj seems to have understood perfectly the point I was making, so I will simply point to his sibling comment. Thank you Kaj.

However your response I think reveals an even deeper disconnect. MIRI claims not to have a theory of friendliness, yet also presupposes what that theory will look like. I’m not sure what definition of friendliness you have in mind, but mine is roughly “the characteristics of an AGI which ensure it helps humanity through the singularity rather than be subsumed by it.” Such a definition would include an oracle AI -> intelligence amplification approach, for example. MIRI on the other hand appears to be aiming towards the benevolent god model in exclusion to everything else (“turn it on and walk away”).

I’m not going to try advocating for any particular approach -- I’ve done that before to Luke without much success. What I do advocate is that you do the same thing I have done and continue to do: take the broader definition of success (surviving the singularity), look at what is required to achieve that in practice, and do whatever gets us across the finish line the fastest. This *is* a race, both against UFAI and the inaction which costs the daily suffering of the present human condition.

When I did that analysis, I concluded that the benevolent god approach favored by MIRI has both the longest lead time and the lowest probability of success. Other approaches -- whose goal states are well defined -- could be achieved in the span of time it takes just to define what a benevolent god friendly AGI would look like.

I’m curious what conclusions you came to after your own assessment, assuming you did one at all.

Hmm, I'm not feeling like you're giving me any charity here. Comments such as the following all give me an impression that you're not open to actually engaging with my points:

So you claim.

...

Yet I care enough to be clued into what MIRI is doing, and still find your work to be absolutely without value and irrelevant to me. This should be very concerning to you.

...

one of the things you would learn from the first few chapters of AI: A Modern Approach is the principles of search,

...

further work on decision theory is pretty much a useless

...

assuming you did one at all.

None of these are particularly egregious, but they are all phrased somewhat aggressively, and add up to an impression that you're mostly just trying to vent. I'm trying to interpret you charitably here, but I don't feel like that's being reciprocated, and this lowers my desire to engage with your concerns (mostly by lowering my expectation that you are trying to see my viewpoint).

I also feel a bit like you're trying to argue against other people's points through me. For example, I do *not* see MIRI's active research as a "benevolent god only" approach, and I personally place low probability on a "turn it on and walk away" scenario.

Other approaches -- whose goal states are well defined -- could be achieved in the span of time it takes just to define what a benevolent god friendly AGI would look like.

Analogy: let's say that someone is trying *really hard* to build a system that takes observations and turns them into a very accurate world-model, and that the fate of humanity rides on the resulting world-model being very accurate. If someone claimed that they had a very good model-building heuristic, while lacking an understanding of information theory and Bayesian reasoning, then I would be quite skeptical of claims like "don't worry, I'm very sure that it won't get stuck at the wrong solution." Until they have a formal understanding of what it *means* for a model to get stuck, of what it *means* to use all available information, I would not be confident in their system. (How do you evaluate a heuristic before understanding what the heuristic is intended to approximate?)

Further, it seems to me implausible that they *could* become very confident in their model-building heuristic without developing a formal understanding of information theory along the way.

For similar reasons, I would be quite skeptical of any system purported to be "safe" by people with no formal understanding of what "safety" meant, and it seems implausible to me that they *could* become confident in the system's behavior without first developing a formal understanding of the intended behavior.

My apologies. I have been fruitlessly engaging with SIAI/MIRI representatives longer than you have been involved in the organization, in the hope of seing sponsored work on what I see to far more useful lines of research given the time constraints we are all working with, e.g. work on AI boxing instead of utility functions and tiling agents.

I started by showing how many of the standard arguments extracted from the sequences used in favor of MIRI's approach of direct FAI are fallacious, or at least presented unconvincingly. This didn't work out very well for either side; in retrospect I think we mostly talked past each other.

I then argued based on timelines, showing that based on available tech and information and the weak inside view that UFAI could be as close as 5-20 years away, and MIRI's own timelines did not, and still does not to my knowledge expect practical results in that short a time horizon. The response was a citation to Stuart Armstrong's paper showing an average expert opinon of AI being 50-70 years away... which was stunning considering the thesis of the paper was about just how bad it is to ask experts about questions like the ETA for human-level AI.

I then asked MIRI to consider hiring a project manager, i.e. a professional whose job it is to keep projects on time and on budget, to help make these decisions in coordinating and guiding research efforts. This suggestion was received about as well as the others.

And then I saw an updated course list from the *machine intelligence* research institute which dropped from consideration the sixty year history of people actually writing machine intelligences, both the theory and the practice that has accumulated.

So if it seemed a bit like I was trying to argue against other people's points through you, I'm sorry I guess I was. I was arguing with MIRI, which you now represent.

Regarding your example, I understand what you are saying but I don't think you are arguing against me. One way of making sure something is safe is making it unable to take actions with irreversible consequences. You don't need to be confident in a system's behavior, if that system's behavior has no effectors on the real world.

I'm all for a trustless AI running as a "physical operating system" for a positive universe. But we have time to figure out how to do that post-singularity.

Thanks---I don't really want to get into involved arguments about overall strategy on this forum, but I can address each of your points in turn. I'm afraid I only have time to sketch my justifications rather than explain them in detail, as I'm quite pressed for time these days.

I understand that these probably won't convince you, but I hope to at least demonstrate that I have been giving these sorts of things some thought.

work on AI boxing

My conclusions:

- Most types of boxing research would be premature, given how little we know about what early AGI architectures will look like.
- That said, there is some boxing work that could be done nowadays. We
*do*touch upon a lot of this stuff (or things in nearby spaces) under the "Corrigibility" banner. (See also Stuart's "low impact" posts.) - Furthermore, some of our decision theory research does include early work on boxing problems (e.g., how do you build an oracle that does not try to manipulate the programmers into giving it easier questions? Turns out this has a lot to do with how the oracle evaluates its decisions.)
- I agree that there is more work that could be done on boxing that would be positive value, but I expect it would be more speculative than, say, the tiling work.

the weak inside view that UFAI could be as close as 5-20 years away

My thoughts: "could" is ambiguous here. What probability do you put on AGI in 5 years? My personal 95% confidence interval is 5 to 150 years (including outside view, model uncertainty, etc) with a mean around 50 years and skewed a bit towards the front, and I am certainly not shooting for a strategy that has us lose 50% of the time, so I agree that we damn well better be on a 20-30 year track.

I then asked MIRI to consider hiring a project manager

I think MIRI made the right choice here. There are only three full-time FAI researchers at MIRI right now, and we're good at coordinating with each other and holding ourselves to deadlines. A project manager would be drastic overkill.

And then I saw an updated course list from the machine intelligence research institute which dropped from consideration the sixty year history of people actually writing machine intelligences

To be clear, this is no longer a course list, it's a research guide. The fact of the matter is, modern narrow AI is not necessary to understand our active research. Is it still useful for a greater context? Yes. Is there FAI work that *would* depend upon modern narrow AI research? Of course! But the subject simply isn't a prerequisite for understanding our current active research.

I do understand how this omission galled you, though. Apologies for that.

One way of making sure something is safe is making it unable to take actions with irreversible consequences.

I'm not sure what this means or how it's safe. (I wouldn't, for example, be comfortable constructing a machine that forcibly wireheads me, just because it believes it can reverse the process.)

I think that there's something useful in the space of "low impact" / "domesticity," but suggestions like "just make everything reversible" don't seem to engage with the difficulty of the problem.

You don't need to be confident in a system's behavior, if that system's behavior has no effectors on the real world.

I also don't understand what it meas to have "no effectors in the real world."

There is no such thing as "not effecting the world." Running a processor has electrical and gravitational effects on everything in the area. Wifi exists. This paper describes how an evolved clock re-purposed the circuits on its motherboard to magnify signals from nearby laptops to use as an oscillator, stunning everyone involved. *Weak* genetic programming algorithms ended up using hardware in completely unexpected ways to do things the programmers expected was impossible. So yes, we *really do* need to worry about strong intelligent processes using their hardware in unanticipated ways in order to affect the world.

See also the power of intelligence.

(Furthermore, anything that interacts with humans is essentially using humans as effectors. There's no such thing as "not having effectors on the real world.")

I agree that there's something interesting in the space of "just build AIs that don't care about the real world" (for example, constructing an AI which is only trying to affect the platonic output of its turing machine and does not care about its physical implementation), but even this has potential security holes if you look closely. Some of our decision theory work *does* touch upon this sort of possibility, but there's definitely other work to be done in this space that we aren't looking at.

But again, suggestions like "just don't let it have effectors" fail to engage with the difficulty of the problem.

The "cookbook with practical FAI advice" is in some way an optimization of a general solution. Adding constraints corresponding to limited resources / specific approaches. Making the task harder. Like adding friction to a "calculate where the trains will collide" problem.

It seems like a good idea to have *a* solution to the general problem, *something* which provably has the properties we want it to, before we deal with how that something survives the transition to actual AGI implementations.

Skipping this step (the theoretical foundation) would be tantamount to "this trick seems like it could work, for ill-defined values of 'could'".

Also, for such a cookbook to be taken seriously, and not just be more speculating, a "this derives from a general safety theorem, see the greek alphabet soup (sterile) in the appendix" would provide a much larger incentive to take the actual guidelines seriously.

In Zen, you first must know how to grow the tree before you alpha-beta prune it.

ETA: There is a case to be made that no general solution can exist (a la Halting Problem) or is practically unattainable or cannot be 'dumbed down' to work for actual approaches, and that we therefore must focus on solving only specific problems. We're not yet at that point though, IME.

See my later reply to Nate here:

Question about the Tiling Agents subfield, based on a weird idea that hit me when I was walking around.

Is the reason for all the invocations of Model Theory (rather than computability or proof-theoretic approaches to logic) the notion that the agent uses its source code, or the "source code" of its child, as a model in which it interprets its ability to prove (certainly or probabilistically) trust in the child agent?

I am overwhelmed by the work of the people involved in spin-offs from lesswrong: MIRI, CEA, CFAR, FHI, ... and any I might have forgotten.

As a fairly intelligent person involved in a physics master's degree and not at all sure that they play in the same league as the people working full-time in the organisations mentioned above, what is a higher value proposition: Working through the material with a mixture of concern for the world at large and some interest in the subject for one self to work full-time on the problems at hand or working through the degree and finding some high-paying or extremely satisfying job and donating a significant portion of one's income to a combination of the organisations mentioned above; all assuming that they care about their personal well being and their family, desiring an upper middle class lifestyle?

There is a world full of possibilities, essentially unlimited, and only a finite amount of time for any specific time of one's lifetime.

IMO, you should pick the two things you find most interesting on this list and study them. When you have a brilliant idea that nobody else has mentioned in a well-studied area, *then* it's a good time to remember how skilled everyone else is and suggest it with curiosity rather than certainty. But this should at no point dissuade you from *reading a book*.

There are times that I wish I could give a post more than one upvote, and rare cases where +infinity would be entirely appropriate. This is one of them.

Glad to see this being updated! Having a sensible map to use really motivates me towards study and helps me gamify the task. I'm working my way through the "Discrete Math" textbook right now (at about the 25% mark), and Set Theory is next on the list.

I'm new to Less Wrong and became interested in learning some of the material in the MIRI research guide when I discovered the now outdated MIRI recommended course list. I'm currently working my way through the sequences while also going through Rosen's discrete math textbook (I'm also about 25% through the book right now).

You mention that the new guide has helped to motivate you and I can relate to this. Something else that may help with motivation (thinking of myself here but could apply to others) might be to provide a way for people who plan to go through this material on their own to interact somehow (even if it's just sharing information related to their own progress, setting goals that other people can see, etc.). Also, this could create an easy way to provide MIRI with some idea of the number of people interested, what area of the guide they're in, the kind of progress they are making, and whether or not they encounter any obstacles.

One might argue that LessWrong is such a thing. :-)

Since we seem to be in more or less the same place, would you like to orchestrate some kind of coordinated study schedule? We could very easily use this site's private messages to have a brief conversation at a prescheduled time every week- hold one another to study schedules, discuss challenging exercises.

Less Wrong Study Hall is also still a reality I think, although not one I've tried personally.

Sure, I'll go ahead and send you a private message shortly. Anyone else that comes across this and is interested in participating is welcome to send me a private message.

The Less Wrong Study Hall suggestion is worth considering for this. We can discuss other possible options as well.

For the Decision Theory section, may I suggest a reference to a causality text, like Pearl 1999?

Thanks for the suggestion! Most of the relevant knowledge from *Causality* can be picked up in the recommended chapters of *Probabilistic Graphical Models*. Pearl's textbook will still get you some additional knowledge, but I'm wary of putting it as a prereq for fear of draining resources (Pearl can be a bit dense) without too much added benefit over the PGM book (which is in the "basics" section, because causal models are important in pretty much all of the subfields, not just DT.)

That's interesting. It's hard for me to tell what's easy anymore (Pinker's curse of knowledge) in this area. I do think Pearl's book isn't very hard compared to a lot of other math books, though. For example, I find "Unified Methods for Censored Longitudinal Data and Causality" a ton harder to read.

Naturally, I have a lot of arguments for reading Pearl, but my strongest argument *against* reading Pearl is that he doesn't tackle statistical issues at all, and they are very important in practice.

Just thought add links to these other "guides":

"Atoms of Neural computation": List of promising research directions for neuro-inspired AI (IOW, tries to answer the question "Deep Learning is just regression, so what could we possibly do next?")

"Physical Principles for Scalable Neural Recording": List of promising research directions for developing tools to do live recording of the brain (a separate issue from connectomics).

The following is copy-pasta'd from an email sent to the MIRIx workshops mailing list upon receiving this link:

By the way, I think we should be making a much more concerted attack on Naturalized Induction. I've given it some thought from the epistemology angle (and I've been trying to investigate the kind of algorithmic structures that can do it... but don't have time ever), and the thing about naturalized induction is that it contains the problem of building an agent that can model the same phenomenon at multiple levels of abstraction, and soundly learn about the phenomenon at all levels of abstraction based on what it sees at any of them.

If we had an agent structure that could do this, it would get a whole lot easier to specify things like a Friendly Utility Function (aka: the Hard Problem of FAI) in terms of abstractions we do know how to write down, such as, "Do What I Mean." And by "a whole lot easier", I mean it would become possible to even begin describing such things at all, rather than treating the agent's reasoning processes as a black box subject to reinforcement learning or probabilistic value learning.

I should note it would also be a nice "prestige accomplishment", since it actually consists in AI/Cog-Sci research rather than *pure* mathematics.

"Naturalized induction" is actually a mathmatical way of saying "AGI". And requesting that there be MIRI(x) workshops on naturalized induction is requesting that MIRI include actual AGI design in its scope.

Which, for the record, I agree with. I have never donated to SIAI/MIRI, although I almost did back in the day when they were funding OpenCog (I merely lacked money at the time). I will not contribute to MIRI -- money or time -- until they do bring into scope public discussion of workable AGI designs.

"Naturalized induction" is actually a mathmatical way of saying "AGI".

Well, firstly, it's not a *complete agent* in the slightest: it lacks a decision theory *and* a utility function ;-).

And requesting that there be MIRI(x) workshops on naturalized induction is requesting that MIRI include actual AGI design in its scope.

MIRI already includes naturalized induction in its scope, quite explicitly. There's just not a lot of discussion because nobody seems to have come up with a very good attack on the problem *yet*.

A quick comment, for the segment on tiling agents, on the MIRI site the recommended reading (not counting any MIRI papers) is 'a mathematical introduction to logic' by Enderton. But on this page, it instead recommends Chang and Keislers Model theory. Can this be taken to mean that both works are important recommended reading? Are they both of equal worth or should (or rather, could) one be prioritised over the other?

The one at intelligence.org is the master copy. (Chang and Keisler is useful, but unnecessarily brutal for our purposes; Enderton alone is sufficient.)

Can I access the original version somehow? In particular, it included a quote at the beginning (something to the effect of “you may feel like your not so talented compared to your uber-amazing classmates in at you prestigious program, but we very rarely find individuals who are familiar with all or most of these fields.") that I wanted to share with a math-friend who expressed a fear (which he knows to be largely irrational) that he's not smart enough to work on interesting problems.

Thanks for this - I particularly like the suggested ordering.

Is this designed to replaced the course list , or is the course list more for future tasks in that it has some actual programming subjects?

This is really awesome. I'm currently working through the books on the list. Thanks!

Would you replace any of these recommendations with a moral philosophy text? If so, which would you remove and what would you replace it with?

There's a lot of math involved in MIRI's mission right now. Anything that moral philosophy (or any other kind of philosophy) has to offer would probably be pretty useless at this point.

There's a lot of mathematicians involved in MIRIs mission right now.

If you have a lot of C# programmers in your company, you get a lot of recommendations about the importance of working in C#, and if you employ a lot of python programmers...

I don't think so. You've got your causal arrow pointing the wrong way; MIRI has a lot of mathematicians because it's been advertising for mathematicians, and it's been doing that because it needs mathematicians, because its mission involves a lot of math right now. The causal chain goes "math is needed, *therefore* mathematicians get hired", not "mathematicians are hired, *therefore* math is needed".

Math is needed for what? MIRI produces paper after paper on unimplementable idealized systems., and produces no practical solutionsm and criticizes philosophy for producing nothing practical.

Easier to solve the idealized general case first, so that you know what a solution even looks like, then adapt that to the real world with its many caveats. Divide and conquer!

There are more benefits to MIRI's approach, such as: If you find that a certain kind of system cannot exist even in an ideal environment (such as a general Halting tester), you don't need to muck around with "implementable" solutions, never knowing whether the current roadblock you face is just a function of some real-world-constraint, or fundamental in principle.

Easier to solve the idealized general case first, so that you know what a solution even looks like

This is not generally true, in my experience. Real world problems come with constraints that often make things substantially easier. The general traveling salesman problem for high N might not generally be tractable, but a specific traveling salesman problem might have all sorts of symmetry you can exploit to make things doable.

I'm reminded of a friend in grad school who created an idealized version of some field theory problems, axiomatized it, and was demonstrating the resulting system had all sorts of really useful properties. Then his adviser showed him that the only field theory that satisfied his axioms was the trivial field theory...

Easier to solve the idealized general case first, so that you know what a solution even looks like, then adapt that to the real world with its many caveats. Divide and conquer!

While this is true in math, it's *not* necessarily true in computer science, where we need constructive reasoning, but the "idealized general case" can often include paradoxical conditions such as "the agent possesses perfect information and full certainty about all aspects of its own world-model, including of itself, but still can reason coherently about counterfactuals regarding even itself." Such a creature not only doesn't make *intuitive* sense, it's basically Laplace's Demon: a walking paradox itself.

Right. MIRI's most important paper to date, *Definability of Truth in Probabilistic Logic*, isn't constructive either. However, you take what you can get.

It does depend on the problem domain a lot.

Sometimes the special case can be much easier than the fully general case, just as a DFA is a special case of a Turing Machine. In that respect, the constraints can make life a lot easier, in regards to proving certain properties about a DFA versus proving properties of an unconstrained TM.

Sometimes the special case can be harder, like going from "program a fully functional Operating System" to "program a fully functional OS which requires only 12MB of RAM".

It's correct that fully general cases can lead to impossibility results which noone should care about, since they wouldn't translate to actually implemented systems, which break some (in any case unrealistic) "ideal" condition. We shouldn't forget, after all, that no matter how powerful our future AI overlord will be, it can still be perfectly simulated by a finite state machine (no infinite tape in the real world).

(Interesting comment on Laplace's demon. I wasn't sure why you'd call it a walking paradox (as opposed to Maxwell's, what is it with famous scientists and their demons anyways), but I see there's a recent paywalled paper proving as much. Deutsch's much older *The Fabric or Reality* has some cool stuff on that as well, not that I've read it in depth.)

Right. MIRI's most important paper to date, Definability of Truth in Probabilistic Logic, isn't constructive either. However, you take what you can get.

I think there are two different kinds of constructivity being discussed here: regarding existence theorems and regarding the values of variables. We can afford to be nonconstructive about existence theorems, but if you want to characterize the value of a variable like "the optimal action for the agent to take", your solution must necessarily be constructive in the sense of being algorithmic. You can say, "the action with the highest expected utility under the agent's uncertainty at the time the action was calculated!", but of course, *that* assumes that you know how to define and calculate expected utility, which, as the paper shows, you often don't.

I brought up Laplace's Demon because it seems to me like it might be possible to treat Omega as adversarial in the No Free Lunch sense, that it might be possible that *any* decision theory can be "broken" by some sufficiently perverse situation, when we make the paradoxical assumption that our agent has unlimited computing resources *and* our adversary has unlimited computing resources *and* we can reason perfectly about each-other (ie: that Omega is Laplace's Demon *but we can reason about Omega*).

Easier to solve the idealized general case first, so that you know what a solution even looks like, then adapt that to the real world with its many caveats.

And that's something that has happened, .or will happen, .or should happen? Well. It's not something that has happened. The best .AI isn't cut down ideal systems.

There are more benefits to MIRI's approach, such as: If you find that a certain kind of system cannot exist even in an ideal environment (such as a general Halting tester), you don't need to muck around with "implementable" solutions, never

Negative results could be valuable but are far from guaranteed.

We've recently published a guide to MIRI's research on MIRI's website. It overviews some of the major open problems in FAI research, and provides reading lists for those who want to get familiar with MIRI's technical agenda.This guide updates and replaces the MIRI course list that started me on the path of becoming a MIRI researcher over a year ago. Many thanks to Louie Helm, who wrote the previous version.This guide is a bit more focused than the old course list, and points you not only towards prerequisite textbooks but also towards a number of relevant papers and technical reports in something approximating the "appropriate order."By following this guide, you can get yourself pretty close to the cutting edge of our technical research (barring some results that we haven't written up yet). If you intend to embark on that quest, you are invited to let me know; I can provide both guidance and encouragement along the way.I've reproduced the guide below. The canonical version is at intelligence.org/research-guide, and I intend to keep that version up to date. This post will not be kept current.Finally, a note on content: the guide below discusses a number of FAI research subfields. The goal is to overview, rather than motivate, those subfields. These sketches are not intended to carry any arguments. Rather, they attempt to convey our current conclusions to readers who are already extending us significant charity.We're hard at work producing a number of documents describing why we think these particular subfields are important. (The first was released a few weeks ago, the rest should be published over the next two months.) In the meantime, please understand that the research guide is not able nor intended to provide strong motivation for these particular problems.Friendly AI theory currently isn't about implementation, it's about figuring out how to ask the right questions. Even if we had unlimited finite computing resources

anda solid understanding of general intelligence, we still wouldn't know how to specify a system that would reliably have a positive impact during and after an intelligence explosion. Such is the state of our ignorance.For now, MIRI's research program aims to develop solutions that assume access to unbounded finite computing power, not because unbounded solutions are feasible, but in the hope that these solutions will help us understand which questions need to be answered in order to the lay the groundwork for the eventual specification of a Friendly AI. Hence, our current research is primarily in mathematics (as opposed to software engineering or machine learning, as many expect).

This guide outlines the topics that one can study to become able to contribute to one or more of MIRI’s active research areas.

## Table of Contents

## How to use this guide

Perhaps the shortest path to being hired as a MIRI researcher is to study the materials below, then attend the nearest MIRIx workshop a few times, then attend a MIRI workshop or two and show an ability to contribute at the cutting edge. The same path (read these materials, then work your way through some workshops) will also help if you want to research these topics at some other institution.

You can learn most of the requisite material by simply reading all of the textbooks and papers below. However, with all of the material in this guide, please do not grind away for the sake of grinding away. If you already know the material, skip ahead. If one of the active research areas fails to capture your interest, switch to a different one. If you don’t like one of the recommended textbooks, find a better one or skip it entirely. The goal is to get yourself to the front lines with a solid understanding of what our research says. Hopefully, this guide can help you achieve that goal, but don’t let it hinder you!

## The basics

It’s important to have some basic mathematical understanding before jumping directly into the active research topics. All of our research areas are well-served by a basic understanding of computation, logic, and probability theory. Below are some introductory resources to get you started.

You don’t need to go through this section chronologically. Pick up whatever is interesting, and don’t hesitate to skip back and forth between the research areas and the basics as necessary.

## Set Theory

Most of modern mathematics is formalized in set theory, and a familiarity with set theory is a great place to start.

## Computation and Logic

Computability (and the limits posed by diagonalization) are foundational to understanding what can and can't be done by computation.

chapters 1-4## Probability Theory

Probability theory is central to an understanding of rational agency, and an understanding of probability theory is critical in all of our active research areas.

chapters 3-5 and 9## Probabilistic graphical models

This book will help flesh out an understanding of how inference can be done using probabilistic world-models.

It’s also very important to understand the concept of VNM rationality, which I recommend learning from the Wikipedia article but which can also be picked up from the original book. Von Neumann and Morgenstern showed that any agent obeying a few simple consistency axioms acts with preferences characterizable by a utility function. While many expect that we may ultimately need to abandon VNM rationality in order to construct Friendly agents, the VNM framework remains the most expressive framework we have for characterizing the behavior of sufficiently powerful agents. (For example, see the

orthogonality thesisand theinstrumental convergence thesisfrom Bostrom's "The Superintelligent Will.") The concept of VNM rationality is used throughout all our active research areas.## Corrigibility

As artificially intelligent systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences.

This field of research is basically brand-new, so all it takes in order to get up to speed is to read a paper or two:

Soares et al.'s "Corrigibility" introduces the field at large, along with a few open problems.

Armstrong's "Utility indifference" discusses one potential approach for making agents indifferent between which utility function they maximize, which is a small step towards agents that allow themselves to be modified.

Some early work in corrigibility was done in discussions on the web forum LessWrong. Most of the relevant results are captured in the above papers. However, additional historical work in this area can be read in the following blog posts:

Proper value learning through indifference outlines the original utility indifference idea. It is largely interesting for historical reasons, and is subsumed by Armstrong’s Utility Indifference paper linked above.

Our current work on corrigibility focuses mainly on a small subproblem knows as the “shutdown problem:” how do you construct an agent that shuts down upon the press of a shutdown button, and which does not have incentives to cause or prevent the pressing of the button? Within that subproblem, we currently focus on the utility indifference problem: how could you construct an agent which allows you to switch which utility function it maximizes, without giving it incentives to affect whether the switch occurs? Even if we had a satisfactory solution to the utility indifference problem, this would not yield a satisfactory solution to the shutdown problem, as it still seems difficult to adequately specify “shutdown behavior” in a manner that is immune to perverse instantiation. Stuart Armstrong has written a short series of blog posts about the specification of “reduced impact” AGIs:

## Tiling agents

An agent undergoing an intelligence explosion may need to execute many self-modifications to its core algorithms and agent architecture, one after the next. Even if the agent at the beginning of this process functioned exactly as planned, if it made a single crucial mistake in choosing one of these rewrites, the end result might be very far from the intended one.

In order to prevent this, we expect that a Friendly system would need some way to limit itself to executing self-modifications only after it has gained extremely high confidence that the resulting system would still be Friendly. Self-confidence of this form, done naively, runs afoul of mathematical problems of self-reference, and it currently seems that formal systems which can gain high self-confidence must walk a fine line between self-trust and unsoundness.

(What do we mean by “high confidence”, “self-confidence”, “self-trust”, and “formal systems”? We don’t quite know yet. Part of the problem is figuring out how to formalize these intuitive concepts in a way that avoids Gödelian pitfalls.)

The study of tiling agents is the study of agents which are able to self-modify in a highly reliable way, specifically via the study of formal systems that can gain some form of confidence in similar systems.

Recommended reading:

Chang & Keisler,

Model Theory, chs. 1-2.MIRI’s existing toy models for studying tiling agents are largely based on first order logic. Understanding this logic and its nuances is crucial in order to understand the existing tools we have developed for studying formal systems capable of something approaching confidence in similar systems.

Yudkowsky & Herreshoff, "Tiling Agents for Self-Modifying AI"

This paper introduces the field of tiling agents, and some of the toy formalisms and partial solutions that MIRI has developed thus far. The paper is a little choppy, but my walkthrough might help make it go down easier.

Yudkowsky, "The procrastination paradox"

The Löbian obstacle (a problem stemming from too little “self trust”) described in the tiling agents paper turns out to be only half the problem: many solutions to the Löbian obstacle run afoul of unsoundnesses that come from too

muchself-trust. Satisfactory solutions will need to walk a fine line between these two problems.Christiano et al., "Definability of Truth in Probabilistic Logic"

This describes an early attempt to create a formal system that can reason about itself while avoiding paradoxes of self-reference. It succeeds, but has ultimately been shown to be unsound. My walkthrough for this paper may help put it into a bit more context.

Fallenstein & Soares, "Problems of self-reference in self-improving space-time embedded intelligence"

This describes our simple suggester-verifier model used for studying tiling agents, and demonstrates a toy scenario in which sound agents can successfully tile to (e.g. gain high confidence in) other similar agents.

If you’re really excited about this research topic, there are a number of other relevant tech reports. Unfortunately, most of them don't explain their motivations well, and have not yet been put into the greater context.

Fallenstein's "Procrastination in Probabilistic Logic" illustrates how Christiano et al’s probabilistic reasoning system is unsound and vulnerable to the procrastination paradox.

Fallenstein's "Decreasing mathematical strength…" describes one unsatisfactory property of Parametric Polymorphism, a partial solution to the Lobian obstacle.

Soares' "Fallenstein's monster" describes a hackish formal system which avoids the above problem. It also showcases a mechanism for restricting an agent’s goal predicate which can also be used by Parametric Polymorphism to create a less restrictive version of PP than the one explored in the tiling agents paper.

Fallenstein's "An infinitely descending sequence of sound theories…" describes a more elegant partial solution to the Lobian obstacle, which is now among our favored partial solutions.

Yudkowsky's "Distributions allowing tiling…" takes some early steps towards probabilistic tiling settings.

An understanding of recursive ordinals provides a useful context from which to understand these results, and can be gained by reading Franzén's "Transfinite progressions: a second look at completeness."

## Logical Uncertainty

Imagine a black box, with one input chute and two output chutes. A ball can be put into the input chute, and it will come out of one of the two output chutes. Inside the black box is a Rube Goldberg machine which takes the ball from the input chute to one of the output chutes. A perfect probabilistic reasoner can be uncertain about which output chute will take the ball, but only insofar as they are uncertain about which machine is inside the black box: it is assumed that if they knew the machine (and how it worked) then they would know which chute the ball would come out. It is assumed that probabilistic reasoners are

logically omniscient, that they know all logical consequences of the things they know.In reality, we are not logically omniscient: we can know precisely which machine the box implements and precisely how the machine works, and just not have the time to deduce where the ball comes out. We reason under

logical uncertainty. A formal understanding of reasoning under logical uncertainty does not yet exist, but seems necessary in the construction of a safe artificial intelligence. (Self modification involves reasoning about the unknown output of two known programs; it seems difficult to gain confidence in any reasoning system intended to do this sort of reasoning under logical uncertainty before gaining a formal understanding of idealized reasoning under logical uncertainty.)Unfortunately, the field of logical uncertainty is not yet well-understood, and I am not aware of good textbooks introducing the material. A solid understanding of probability theory is a must; consider augmenting the first few chapters of Jaynes with Feller, chapters 1, 5, 6, and 9.

An overview of the subject can be gained by reading the following papers.

Gaifman, "Concerning measures in first-order calculi." Gaifman started looking at this problem many years ago, and has largely focused on a relevant subproblem, which is the assignment of probabilities to different models of a formal system (assuming that once the model is known, all consequences of that model are known). We are now attempting to expand this approach to a more complete notion of logical uncertainty (where a reasoner can know what the model is but not know the implications of that model), but work by Gaifman is still useful to gain a historical context and an understanding of the difficulties surrounding logical uncertainty. See also

Gaifman & Snir, "Probabilities over rich languages…"

Gaifman, "Reasoning with limited resources and assigning probabilities to arithmetic statements"

Hutter et al., "Probabilities on sentences in an expressive logic" largely looks at the problem of logical uncertainty assuming access to infinite computing power (and many levels of halting oracles). Again, we take a slightly different approach, asking how an idealized reasoner should handle logical uncertainty given unlimited but finite amounts of computing power. Nevertheless, understanding Hutter’s approach (and what can be done with infinite computing power) helps flesh out one’s understanding of where the difficult questions lie.

Demski, "Logical prior probability" provides an approximately computable logical prior. Following Demski, our work largely focuses on the creation of a prior probability distribution over logical sentences, in the hopes that understanding the creation of logical priors will lead us to a better understanding of how they could be updated, and from there a better understanding of logical uncertainty more generally.

Christiano, "Non-omniscience, probabilistic inference, and metamathematics" largely follows this approach. This paper provides some early practical considerations about the generation of logical priors, and highlights a few open problems.

## Decision theory

We do not yet understand a decision algorithm which would, given access to unlimited finite computing power and an arbitrarily accurate world-model, always take the best available action. Intuitively, specifying such an algorithm (with respect to some VNM-rational set of preferences) may seem easy: simply loop through available actions and evaluate the expected utility achieved by taking that action, and then choose the action that yields the highest utility. In practice, however, this is quite difficult: the algorithm is in fact going to choose only one of the available actions, and in order to evaluate what “would have” happened if the agent instead took a different action that it “could have” taken requires a formalization of “would” and “could”: what does it mean to say that a deterministic algorithm “could have had” a different output, and how is this circumstance (which runs counter to the laws of logic and/or physics) evaluated?

Solving this problem will require a better understanding of counterfactual reasoning; this is the domain of decision theory. Modern decision theories do not provide satisfactory methods for counterfactual reasoning, and are insufficient for use in a superintelligence. Existing methods of counterfactual reasoning turn out to be unsatisfactory both in the short term (in the sense that they fail systematically on certain classes of problems) and in the long term (in the sense that self-modifying agents reasoning using bad counterfactuals would, according to those broken counterfactuals, decide that they should not in fact fix all of their flaws): see my talk "Why aint you rich?"

We are currently in the process of writing up an introduction to decision theory as an FAI problem. In the interim, I suggest the following resources in order to understand MIRI’s decision theory research:

Peterson's

An Introduction to Decision Theoryor Muehlhauser's "Decision Theory FAQ" explain the field of normative decision theory in broad strokes.Hintze's "Problem class dominance in predictive dilemmas" contrasts four different normative decision theories: CDT, EDT, TDT, and UDT, and argues that UDT dominates the others on a certain class of decision problems.

Several posts by Yudkowsky and Soares explain why causal counterfactual reasoning is not sufficient for use in an intelligent agent: "Newcomb's problem and the regret of rationality," "Causal decision theory is unsatisfactory," "An introduction to Newcomblike problems," "Newcomblike problems are the norm."

Alternative decision theories have been developed which are by no means solutions, but which constitute progress. The most promising of these is Updateless Decision Theory, developed by Wei Dai and Vladimir Slepnev among others:

Dai's "Towards a New Decision Theory" introduces UDT.

Slepnev's "A model of UDT with a halting oracle" provides an early first formalization.

Fallenstein's alternative formalization provides a probabilistic formalization.

UDT has a number of problems of its own. Unfortunately, satisfactory write-ups detailing these problems do not yet exist. Two of the open problems have been outlined in blog posts by Vladimir Slepnev:

"An example of self-fulfilling spurious proofs in UDT" explains how UDT can achieve sub-optimal results due to spurious proofs.

"Agent simulates predictor" describes a strange problem wherein it seems as if agents are rewarded for having less intelligence.

A somewhat unsatisfactory solution is discussed in the following write-up by Tsvi Benson-Tilsen:

"UDT with known search order" contains a formalization of UDT with known proof-search order and demonstrates the necessity of playing a technique known as “playing chicken with the universe” in order to avoid spurious proofs.

In order to study multi-agent settings, Patrick LaVictoire has developed a modal agents framework, which has also allowed us to make some novel progress in the field of decision theory. To understand this, you’ll first need to understand provability logic:

chapters 1-5(+6-9 if enthusiastic)

## Provability Logic

In logical toy models of agents reflecting upon systems similar to themselves, the central question is what the parent system can prove about the child system. Our Tiling Agent research makes heavy use of provability logic, which can elegantly express these problems.

This should be sufficient to help you understand the modal agents framework:

Barasz et al's "Robust cooperation in the Prisoner's dilemma": roughly, this allows us to consider agents which decide whether or not to cooperate with each other based only upon what they can

proveabout each other’s behavior. This prevents infinite regress, and in fact, the behavior of two agents which act only according to what they can prove about the behavior of the other can be determined in quadratic time using results from provability logic.Many open problems in decision theory involve multi-agent settings, and in order to contribute to cutting-edge research it is also important to understand game theory. I have heard good things about the following textbook, but have not read it myself:

Tadelis'

Game Theory: An Introduction.You also may have luck with Yvain's Game Theory sequence on LessWrong.

## Value learning

Perhaps the most promising approach for loading values into a powerful AI is to specify a criterion for

learningwhat to value. While this problem dominates the public mindspace with regards to Friendly AI problems (if you could build an FAI, what would you have it do?), we actually find that it is somewhat less approachable than many other important problems (how do you build something stable, how do you verify its decision-making behavior, etc.). That said, a number of papers on value learning exist, and can be used to understand the current state of value learning:Dewey's "Learning what to value" discusses the difficulty of the problem.

The orthogonality thesis further motivates why the problem will not be solved by default.

One approach to value learning is Bostrom & Ord's "parliamentary model," which suggests that value learning is somewhat equivalent to a voter aggregation problem, and that many value learning systems can be modeled as parliamentary voting systems (where the voters are possible utility functions).

MacAskill's "Normative Uncertainty" provides a framework for discussing normative uncertainty. Be warned, the full work, while containing many insights, is very long. You can get away with skimming parts and/or skipping around some, especially if you’re more excited about other areas of active research.

Fallenstein & Stiennon's "Loudness" discusses a concern with aggregating utility functions stemming from the fact that the preferences encoded by utility functions are preserved under positive affine transformation (e.g. as the utility function is scaled or shifted). This implies that special care is required in order to normalize the set of possible utility functions.

Owen Cotton-Barratt's "Geometric reasons for normalising…" discusses the normalization of utility functions.

De Blanc's "Ontological crises in artificial agents' value systems" discusses a separate problem in the space of value learning: how are values retained as the system’s model of reality changes drastically? It seems likely that explicit resolution mechanisms will be required, but it is not yet clear how to have an agent learn values in a manner that is robust to ontological shifts.

## Naturalized induction

How should an agent treat itself as if it is a part of the world? How should it learn as if it (and its sensors and its memory) are embedded in the environment, rather than sitting outside the environment? How can an agent make choices when its world-model stops modeling its own action as a fundamentally basic causal node and starts modelling it as a deterministic process resulting from a collection of transistors following physics? Many narrow AI systems assume an agent/environment separation, and we still have some confusion surrounding the nature of learners and actors that treat themselves as part of their environment.

We’ve been referring to this as the problem of “naturalized induction,”. While there has been little research done in this space, here is some reading that can help you better understand the problems:

Bensinger, "Naturalized induction"

## Other tools

Mastery in any subject can be a very powerful tool, especially in the realm of mathematics, where seemingly disjoint topics are actually deeply connected. Many fields of mathematics have the property that if you understand them very very well, then that understanding is useful no matter where you go. With that in mind, while the subjects listed below are not necessary in order to understand MIRI’s active research, an understanding of each of these subjects constitutes an additional tool in the mathematical toolbox that will often prove quite useful when doing new research.

## Discrete Math

Most math studies either continuous or discrete structures. Many people find discrete mathematics more intuitive, and a solid understanding of discrete mathematics will help you gain a quick handle on the discrete versions of many other mathematical tools such as group theory, topology, and information theory.

## Linear Algebra

Linear algebra is one of those tools that shows up almost everywhere in mathematics. A solid understanding of linear algebra will be helpful in many domains.

## Type Theory

Set theory commonly serves as the foundation for modern mathematics, but it's not the only available foundations. Type theory can also serve as a foundation for mathematics, and in many cases, type theory is a better fit for the problems at hand. Type theory also bridges much of the theoretical gap between computer programs and mathematical proofs, and is therefore often relevant to certain types of AI research.

## Category Theory

Category theory studies many mathematical structures at a very high level of abstraction. This can help you notice patterns in disparate branches of mathematics, and makes it much easier to transfer your mathematical tools from one domain to another.

## Topology

Topology is another one of those subjects that shows up pretty much everywhere in mathematics. A solid understanding of topology turns out to be helpful in many unexpected places.

## Computability and Complexity

MIRI's math research is working towards solutions that will eventually be relevant to computer programs. A good intuition for what computers are capable of is often essential.

## Program Verification

Program verification techniques allow programmers to become confident that a specific program will actually act according to some specification. (It is, of course, still difficult to validate that the specification describes the intended behavior.) While MIRI's work is not currently concerned with verifying real-world programs, it is quite useful to understand what modern program verification techniques can and cannot do.

## Understanding the mission

Why do this kind of research in the first place? (The first book below is the most important.)

## Superintelligence

This guide largely assumes that you’re already on board with MIRI’s mission, but if you’re wondering why so many people think this is an important and urgent area of research in the first place,

Superintelligenceprovides a nice overview.## Global Catastrophic Risks

But what about other global risks? How does AI compare to them? This book provides an introductory overview of the global risk landscape.

## The Sequences

This long series of blog posts, soon to be compiled as a book, explains much of the philosophy and cognitive science motivating MIRI's paradigm for Friendly AI research.