We present an algorithm [updated version], then show (given four assumptions) that in the limit, it is human-level intelligent and benign.
Will MacAskill has commented that in the seminar room, he is a consequentialist, but for decision-making, he takes seriously the lack of a philosophical consensus. I believe that what is here is correct, but in the absence of feedback from the Alignment Forum, I don't yet feel comfortable posting it to a place (like arXiv) where it can get cited and enter the academic record. We have submitted it to IJCAI, but we can edit or revoke it before it is printed.
I will distribute at least min($365, number of comments * $15) in prizes by April 1st (via venmo if possible, or else Amazon gift cards, or a donation on their behalf if they prefer) to the authors of the comments here, according to the comments' quality. If one commenter finds an error, and another commenter tinkers with the setup or tinkers with the assumptions in order to correct it, then I expect both comments will receive a similar prize (if those comments are at the level of prize-winning, and neither person is me). If others would like to donate to the prize pool, I'll provide a comment that you can reply to.
To organize the conversation, I'll start some comment threads below:
Edit 30/5/19: An updated version is on arXiv. I now feel comfortable with it being cited. The key changes:
Edit 17/02/20: Published at AAAI. The prior over world-models is now totally different, and much better. There's no "amnesia antechamber" required. The Useless Computation Assumption and the No Grue Assumption are now obselete. The argument for unambitiousness now depends on the "Space Requirements Assumption", which we probed empirically. The ArXiv link is up-to-date.
Thanks for a really productive conversation in the comment section so far. Here are the comments which won prizes.
Objection to the term benign (and ensuing conversation). Wei Dei. Link. $20
A plausible dangerous side-effect. Wei Dai. Link. $40
Short description length of simulated aliens predicting accurately. Wei Dai. Link. $120
Answers that look good to a human vs. actually good answers. Paul Christiano. Link. $20
Consequences of having the prior be based on K(s), with s a description of a Turing machine. Paul Christiano. Link. $90
Simulated aliens converting simple world-models into fast approximations thereof. Paul Christiano. Link. $35
Simulating suffering agents. cousin_it. Link. $20
Reusing simulation of human thoughts for simulation of future events. David Krueger. Link. $20
Options for transfer:
1) Venmo. Send me a request at @Michael-Cohen-45.
2) Send me your email address, and I’ll send you an Amazon gift card (or some other electronic gift card you’d like to specify).
3) Name a charity for me to donate the money to.
I would like to exert a bit of pressure not to do 3, and spend the money on something frivolous instead :) I want to reward your consciousness, more than... (read more)
If I have a great model of physics in hand (and I'm basically unconcerned with competitiveness, as you seem to be), why not just take the resulting simulation of the human and give it a long time to think? That seems to have fewer safety risks and to be more useful.
More generally, under what model of AI capabilities / competitiveness constraints would you want to use this procedure?
I think the main problem with competitiveness is that you are just getting "answers that look good to a human" rather than "actually good answers."
The comment was here, but I think it deserves its own thread. Wei makes the same point here (point number 3), and our ensuing conversation is also relevant to this thread.
My answers to Wei were two-fold: one is that if benignity is established, it's possible to safely tinker with the setup until hopefully "answers that look good to a human" resembles good answers (we ... (read more)
Here is an old post of mine on the hope that "computationally simplest model describing the box" is actually a physical model of the box. I'm less optimistic than you are, but it's certainly plausible.
From the perspective of optimization daemons / inner alignment, I think like the interesting question is: if inner alignment turns out to be a hard problem for training cognitive policies, do we expect it to become much easier by training predictive models? I'd bet against at 1:1 odds, but not 1:2 odds.
Given that you are taking limits, I don't see why you need any of the machinery with forgetting or with memory-based world models (and if you did really need that machinery, it seems like your proof would have other problems). My understanding is:
Comment thread: positive feedback
Comment thread: general concerns/confusions
To clarify, I'm looking for:
"We're talking about what you do, not what you do."
"Suppose you give us a new toy/summarized toy, something like a room, an inside-view view thing, and ask them to explain what you desire."
"Ah," you reply, "I'm asking what you think about how your life would go if you liv
Comment thread: concerns with Assumption 1
Comment thread: concerns with "the box"
Comment thread: concerns with Assumption 4
Comment thread: concerns with Assumption 3
Comment thread: concerns with Assumption 2
Comment thread: adding to the prize pool
If you would like to contribute, please comment with the amount. If you have venmo, please send the amount to @Michael-Cohen-45. If not, we can discuss.
Comment thread: minor concerns