Original Post:

We present an algorithm [updated version], then show (given four assumptions) that in the limit, it is human-level intelligent and benign.

Will MacAskill has commented that in the seminar room, he is a consequentialist, but for decision-making, he takes seriously the lack of a philosophical consensus. I believe that what is here is correct, but in the absence of feedback from the Alignment Forum, I don't yet feel comfortable posting it to a place (like arXiv) where it can get cited and enter the academic record. We have submitted it to IJCAI, but we can edit or revoke it before it is printed.

I will distribute at least min($365, number of comments * $15) in prizes by April 1st (via venmo if possible, or else Amazon gift cards, or a donation on their behalf if they prefer) to the authors of the comments here, according to the comments' quality. If one commenter finds an error, and another commenter tinkers with the setup or tinkers with the assumptions in order to correct it, then I expect both comments will receive a similar prize (if those comments are at the level of prize-winning, and neither person is me). If others would like to donate to the prize pool, I'll provide a comment that you can reply to.

To organize the conversation, I'll start some comment threads below:

  • Positive feedback
  • General Concerns/Confusions
  • Minor Concerns
  • Concerns with Assumption 1
  • Concerns with Assumption 2
  • Concerns with Assumption 3
  • Concerns with Assumption 4
  • Concerns with "the box"
  • Adding to the prize pool

Edit 30/5/19: An updated version is on arXiv. I now feel comfortable with it being cited. The key changes:

  • The Title. I suspect the agent is unambitious for its entire lifetime, but the title says "asymptotically" because that's what I've shown formally. Indeed, I suspect the agent is benign for its entire lifetime, but the title says "unambitious" because that's what I've shown formally. (See the section "Concerns with Task-Completion" for an informal argument going from unambitious -> benign).
  • The Useless Computation Assumption. I've made it a slightly stronger assumption. The original version is technically correct, but setting is tricky if the weak version of the assumption is true but the strong version isn't. This stronger assumption also simplifies the argument.
  • The Prior. Rather than having to do with the description length of the Turing machine simulating the environment, it has to do with the number of states in the Turing machine. This was in response to Paul's point that the finite-time behavior of the original version is really weird. This also makes the Natural Prior Assumption (now called the No Grue Assumption) a bit easier to assess.

Edit 17/02/20: Published at AAAI. The prior over world-models is now totally different, and much better. There's no "amnesia antechamber" required. The Useless Computation Assumption and the No Grue Assumption are now obselete. The argument for unambitiousness now depends on the "Space Requirements Assumption", which we probed empirically. The ArXiv link is up-to-date.

Asymptotically Unambitious AGI
New Comment
217 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Thanks for a really productive conversation in the comment section so far. Here are the comments which won prizes.

Comment prizes:

Objection to the term benign (and ensuing conversation). Wei Dei. Link. $20

A plausible dangerous side-effect. Wei Dai. Link. $40

Short description length of simulated aliens predicting accurately. Wei Dai. Link. $120

Answers that look good to a human vs. actually good answers. Paul Christiano. Link. $20

Consequences of having the prior be based on K(s), with s a description of a Turing machine. Paul Christiano. Link. $90

Simulated aliens converting simple world-models into fast approximations thereof. Paul Christiano. Link. $35

Simulating suffering agents. cousin_it. Link. $20

Reusing simulation of human thoughts for simulation of future events. David Krueger. Link. $20

Options for transfer:

1) Venmo. Send me a request at @Michael-Cohen-45.

2) Send me your email address, and I’ll send you an Amazon gift card (or some other electronic gift card you’d like to specify).

3) Name a charity for me to donate the money to.

I would like to exert a bit of pressure not to do 3, and spend the money on something frivolous instead :) I want to reward your consciousness, more than... (read more)

If I have a great model of physics in hand (and I'm basically unconcerned with competitiveness, as you seem to be), why not just take the resulting simulation of the human and give it a long time to think? That seems to have fewer safety risks and to be more useful.

More generally, under what model of AI capabilities / competitiveness constraints would you want to use this procedure?

3michaelcohen
I know I don't prove it, but I think this agent would be vastly superhuman, since it approaches Bayes-optimal reasoning with respect to its observations. ("Approaches" because MAP -> Bayes). For the asymptotic results, one has to consider environments that produce observations with the true objective probabilities (hence the appearance that I'm unconcerned with competitiveness). In practice, though, given the speed prior, the agent will require evidence to entertain slow world-models, and for the beginning of its lifetime, the agent will be using low-fidelity models of the environment and the human-explorer, rendering it much more tractable than a perfect model of physics. And I think that even at that stage, well before it is doing perfect simulations of other humans, it will far surpass human performance. We manage human-level performance with very rough simulations of other humans. That leads me to think this approach is much more competitive that simulating a human and giving it a long time to think.
5paulfchristiano
I'm keen on asymptotic analysis, but if we want to analyze safety asymptotically I think we should also analyze competitiveness asymptotically. That is, if our algorithm only becomes safe in the limit because we shift to a super uncompetitive regime, it undermines the use of the limit as analogy to study the finite time behavior. (Though this is not the most interesting disagreement, probably not worth responding to anything other than the thread where I ask about "why do you need this memory stuff?")
3michaelcohen
Definitely agree. I don't think it's the case that a shift to super uncompetitiveness is actually an "ingredient" to benignity, but my only discussion of that so far is in the conclusion: "We can only offer informal claims regarding what happens before BoMAI is definitely benign..."
4paulfchristiano
Surely that just depends on how long you give them to think. (See also HCH.)
3michaelcohen
By competitiveness, I meant usefulness per unit computation.
5paulfchristiano
The algorithm takes an argmax over an exponentially large space of sequences of actions, i.e. it does 2^{episode length} model evaluations. Do you think the result is smarter than a group of humans of size 2^{episode length}? I'd bet against---the humans could do this particular brute force search, in which case you'd have a tie, but they'd probably do something smarter.
3michaelcohen
I obviously haven't solved the Tractable General Intelligence problem. The question is whether this is a tractable/competitive framework. So expectimax planning would naturally get replaced with a Monte-Carlo tree search, or some better approach we haven't thought of. And I'll message you privately about a more tractable approach to identifying a maximum a posteriori world-model from a countable class (I don't assign a very high probability to it being a hugely important capabilities idea, since those aren't just lying around, but it's more than 1%). It will be important, when considering any of these approximations, to evaluate whether they break benignity (most plausibly, I think, by introducing a new attack surface for optimization daemons). But I feel fine about deferring that research for the time being, so I defined BoMAI as doing expectimax planning instead of MCTS. Given that the setup is basically a straight reinforcement learner with a weird prior, I think that at that level of abstraction, the ceiling of competitiveness is quite high.
5paulfchristiano
I'm sympathetic to this picture, though I'd probably be inclined to try to model it explicitly---by making some assumption about what the planning algorithm can actually do, and then showing how to use an algorithm with that property. I do think "just write down the algorithm, and be happier if it looks like a 'normal' algorithm" is an OK starting point though Stepping back from this particular thread, I think the main problem with competitiveness is that you are just getting "answers that look good to a human" rather than "actually good answers." If I try to use such a system to navigate a complicated world, containing lots of other people with more liberal AI advisors helping them do crazy stuff, I'm going to quickly be left behind. It's certainly reasonable to try to solve safety problems without attending to this kind of competitiveness, though I think this kind of asymptotic safety is actually easier than you make it sound (under the implicit "nothing goes irreversibly wrong at any finite time" assumption).
3michaelcohen
Starting a new thread on this: here.

From Paul:

I think the main problem with competitiveness is that you are just getting "answers that look good to a human" rather than "actually good answers."

The comment was here, but I think it deserves its own thread. Wei makes the same point here (point number 3), and our ensuing conversation is also relevant to this thread.

My answers to Wei were two-fold: one is that if benignity is established, it's possible to safely tinker with the setup until hopefully "answers that look good to a human" resembles good answers (we ... (read more)