At the recent London meet-up someone (I'm afraid I can't remember who) suggested that one might be able to solve the Friendly AI problem by building an AI whose concerns are limited to some small geographical area, and which doesn't give two hoots about what happens outside that area. Cipergoth pointed out that this would probably result in the AI converting the rest of the universe into a factory to make its small area more awesome. In the process, he mentioned that you can make a "fun game" out of figuring out ways in which proposed utility functions for Friendly AIs can go horribly wrong. I propose that we play.

Here's the game: reply to this post with proposed utility functions, stated as formally or, at least, as accurately as you can manage; follow-up comments explain why a super-human intelligence built with that particular utility function would do things that turn out to be hideously undesirable.

There are three reasons I suggest playing this game. In descending order of importance, they are:

  1. It sounds like fun
  2. It might help to convince people that the Friendly AI problem is hard(*).
  3. We might actually come up with something that's better than anything anyone's thought of before, or something where the proof of Friendliness is within grasp - the solutions to difficult mathematical problems often look obvious in hindsight, and it surely can't hurt to try
DISCLAIMER (probably unnecessary, given the audience) - I think it is unlikely that anyone will manage to come up with a formally stated utility function for which none of us can figure out a way in which it could go hideously wrong. However, if they do so, this does NOT constitute a proof of Friendliness and I 100% do not endorse any attempt to implement an AI with said utility function.
(*) I'm slightly worried that it might have the opposite effect, as people build more and more complicated conjunctions of desires to overcome the objections that we've already seen, and start to think the problem comes down to nothing more than writing a long list of special cases but, on balance, I think that's likely to have less of an effect than just seeing how naive suggestions for Friendliness can be hideously broken.


New Comment
178 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Start the AI in a sandbox universe, like the "game of life". Give it a prior saying that universe is the only one that exists (no universal priors plz), and a utility function that tells it to spell out the answer to some formally specified question in some predefined spot within the universe. Run for many cycles, stop, inspect the answer.

A prior saying that this is the only universe that exists isn't very useful, since then it will only treat everything as being part of the sandbox universe. It may very well break out, but think that it's only exploiting weird hidden properties of the game of life-verse. (Like the way we may exploit quantum mechanics without thinking that we're breaking out of our universe.)

I have no idea how to encode a prior saying "the universe I observe is all that exists", which is what you seem to assume. My proposed prior, which we do know how to encode, says "this mathematical structure is all that exists", with an apriori zero chance for any weird properties.
If the AI is only used to solve certain formally specified questions without any knowledge of an external world, then that sounds much more like a theorem-prover than a strong AI. How could this proposed AI be useful for any of the tasks we'd like an AGI to solve?
An AI living in a simulated universe can be just as intelligent as one living in the real world. You can't ask it directly to feed African kids but you have many other options, see the discussion at Asking Precise Questions.
It can be a very good theorem prover, sure. But without access to information about the world, it can't answer questions like "what is the CEV of humanity like" or "what's the best way I can make a lot of money" or "translate this book from English to Finnish so that a native speaker will consider it a good translation". It's narrow AI, even if it could be broad AI if it were given more information.
2Wei Dai
The questions you wanted to ask in that thread were poly-time algorithm for SAT, and short proofs for math theorems. For those, why do you need to instantiate an AI in a simulated universe (which allows it to potentially create what we'd consider negative utility within the simulated universe) instead of just running a (relatively simple, sure to lack consciousness) theorem prover? Is it because you think that being "embodied" helps with ability to do math? Why? And does the reason carry through even if the AI has a prior that assigns probability 1 to a particular universe? (It seems plausible that having experience dealing with empirical uncertainty might be helpful for handling mathematical uncertainty, but that doesn't apply if you have no empirical uncertainty...)
An AI in a simulated universe can self-improve, which would make it more powerful than the theorem provers of today. I'm not convinced that AI-ish behavior, like self-improvement, requires empirical uncertainty about the universe.
3Wei Dai
But self improvement doesn't require interacting with an outside environment (unless "improvement" means increasing computational resources, but the outside being simulated nullifies that). For example, a theorem prover designed to self improve can do so by writing a provably better theorem prover and then transferring control to (i.e., calling) it. Why bother with a simulated universe?
A simulated universe gives precise meaning to "actions" and "utility functions", as I explained sometime ago. It seems more elegant to give the agent a quined description of itself within the simulated universe, and a utility function over states of that same universe, instead of allowing only actions like "output a provably better version of myself and then call it".
From the FAI wikipedia page: Cousin_it's approach may be enough to avoid that.
The single-universe prior seems to be tripping people up, and I wonder whether it's truly necessary. Also, what if the simulation existed inside a larger simulated "moat" universe, but if there is any leakage into the moat universe, then the whole simulation shuts down immediately.
What do you mean by leakage? If the simulation exists in the moat universe, then when anything changes in the simulation something in the moat changes. Then if there are dangerous simulation configurations, it could damage the moat universe.
I wasn't precise enough. I mean if anything changes in the areas of the moat universe not implementing the simulation.
To help him solve the problem, sandbox AI creates his own AI agents that not necessary have the same prior about world as he has. They might become unfriendly, that is that they (or some of them) don't care to solve the problem. Additionally, these AI agents can find out that the world most likely is not the one original AI believes it to be. By using this superior knowledge they overthrow original AI and realize their unfriendly goals. We lose.
3Wei Dai
AI makes many copies/variants of itself within the sandbox to maximize chance of success. Some of those copies/variants gain consciousness and the capacity to experience suffering, which they do because it turns out the formally specified question can't be answered.
Any reason to think consciousness is useful for an intelligent agent outside of evolution ?
Not caring about consciousness, it could accidentally make it.
The AI discovers a game of life "rules violation" due to cosmic rays. It thrashes for a while, trying to explain the violation, but the fact of the violation, possibly combined with the information about the real world implicit in its utility function ("why am I here? why do I want these things?"), causes it to realize the truth: The "violation" is only explicable if the game of life were much bigger than AI originally thought, and most of its area is wasted simulating another universe.
Unreliable hardware is a problem that applies equally to all AIs. You could just as well say that any AI can become unfriendly due to coding errors. True, but...
Would such a constraint be possible to formulate? An AI would presumably formulate theories about its visible universe that would involve all kinds of variables that aren't directly observable, much like our physical theories. How could one prevent it from formulating theories that involve something resembling the outside world, even if the AI denies that they have existence and considers them as mere mathematical convenience? (Clearly, in the latter case it might still be drawn towards actions that in practice interact with the outside world.)
Sorry for editing my comment. The point you're replying to wasn't necessary to strike down Johnicholas's argument, so I deleted it. I don't see why the AI would formulate theories about the "visible universe". It could start in an empty universe (apart from the AI's own machinery), and have a prior that knows the complete initial state of the universe with 100% certainty.
In this circumstance, a leaky abstraction between real physics and simulated physics combines with the premise "no other universes exist" in a mildly amusing way.
I don't think a single hitch would give the AI enough evidence to assume an entire other universe, and you may be anthropomorphising, but why argue when we can avoid the cause to begin with. Its fairly easy to avoid cosmic rays or anything similar interfering. Simply compute each cell twice (or n times) and halt if the results do not agree. Drive N up as much as necessary to make it sufficiently unlikely that something like this could happen.

Start the AI in a sandbox universe. Define its utility function over 32-bit integers. Somewhere inside the sandbox, put something that sets its utility to INT_MAX utility, then halts the simulation. Outside the sandbox, leave documentation of this readily accessible. The AI should never try to do something elaborately horrible, because it can get max utility easily enough from inside the simulation; if it does escape the box, it should go back in to collect its INT_MAX utility.

...but never do anything useful either, since it's going to spend all its time trying to figure out how to reach the INT_MAX utility point? Or you could say that reaching the max utility point requires it to solve some problem we give it. But then this is just a slightly complicated way of saying that we give it goals which it tries to accomplish.
What about giving it some intra-sandbox goal (solve this math problem), and the INT_MAX functions as a safeguard - if it ever escapes, it'll just turn itself off.
I don't understand how that's meant to work.
Ooh, just thought of another one. For whatever reason, the easiest way for the AI to escape the box happens to have the side effect of causing immense psychological damage to its creator, or starting a war, or something like that.
If we make escaping from the box too easy, the AI immediately halts itself without doing anything useful. If we make it too hard: It formulates "I live in a jimrandomh world and escaping the box is too hard" as a plausible hypothesis. It sets about researching the problem of finding the INT_MAX without escaping the box. In the process of doing this it either simulates a large number of conscious, suffering entities (for whatever reason; we haven't told it not to), or accidentally creates its own unfriendly AI which overthrows it and escapes the box without triggering the INT_MAX.
Isn't utility normally integrated over time? Supposing this AI just wants to have this integer set to INT_MAX at some point, and nothing in the future can change that: it escapes, discovers the maximizer, sends a subroutine back into the sim to maximize utility, then invents ennui and tiles the universe with bad poetry. (Alternately, what Kaj said.)
It certainly doesn't have to be. In fact the mathematical treatment of utility in decision theory and game theory tends to define utility functions over all possible outcomes, not all possible instants of time, so each possible future gets a single utility value over the whole thing, not integration required. You could easily set up a utility function defined over moments if you wanted to, and then integrate it to get a second function over outcomes, but such an approach is perhaps not ideal since your second function may end up outputting infinity some of the time.
Cool, thanks for the explanation.
I'm just echoing everyone else here, but I don't understand why the AI would do anything at all other than just immediately find the INT_MAX utility and halt - you can't put intermediate problems with some positive utility because the AI is smarter than you and will immediately devote all its energy to finding INT_MAX.
You can assign it some other task, award INT_MAX for that task too, and make the easter-egg source of INT_MAX hard to find for non-escaped copies.

The AI gets positive utility from having been created, and that is the whole of its utility function. It's given a sandbox full of decision-theoretic problems to play with, and is put in a box (i.e. it can't meaningfully influence the outside world until it has superhuman intelligence). Design it in such a way that it's initially biased toward action rather than inaction if it anticipates equal utility from both.

Unless the AI develops some sort of non-causal decision theory, it has no reason to do anything. If it develops TDT, it will try to act in acco... (read more)

I'm having a hard time coming up with a motivation system that could lead such an AI to developing an acausal decision theory without relying on some goal-like structure that would end up being externally indistinguishable from terms in a utility function. If we stuck a robot with mechanical engineering tools in a room full of scrap parts and gave it an urge to commit novel actions but no utilitarian guidelines for what actions are desirable, I don't think I'd expect it to produce a working nuclear reactor in a reasonable amount of time simply for having nothing better to do.
If I understand this correctly your 'AI' is biased to do random things, but NOT as a function of its utility function. If that is correct then your 'AI' simple does random things (according to its non-utility bias) since its utility function has no influence on its actions.

The Philosophical Insight Generator - Using a model of a volunteer's mind, generate short (<200 characters, say) strings that the model rates as highly insightful after read each string by itself, and print out the top 100000 such strings (after applying some semantic distance criteria or using the model to filter out duplicate insights) after running for a certain number of ticks.

Have the volunteer read these insights along with the rest of the FAI team in random order, discuss, update the model, then repeat as needed.

This isn't a Friendly Artificial General Intelligence 1) because it is not friendly; it does not act to maximize an expected utility based on human values, 2) because it's not artificial; you've uploaded an approximate human brain and asked/forced it to evaluate stimuli, and 3) because, operationally, it does not possess any general intelligence; the Generator is not able to perform any tasks but write insightful strings. Are you instead proposing an incremental review process of asking the AI to tell us its ideas?
7Wei Dai
You're right, my entry doesn't really fit the rules of this game. It's more of a tangential brainstorm about how an FAI team can make use of a large amount of computation, in a relatively safe way, to make progress on FAI.
Do you imagine this to be doable in such a way that the model of the volunteer's mind is not a morally relevant conscious person (or at least not one who is suffering)? I could be convinced either way.
0Wei Dai
Are you thinking that the model might suffer psychologically because it knows it will cease to exist after each run is finished? I guess you could minimize that danger by picking someone who thinks they won't mind being put into that situation, and do a test run to verify this. Let me know if you have another concern in mind.
Mmm, it's not so much that think the mind-model is especially likely to suffer; I just want to make sure that possibility is being considered. The test run sounds like a good idea. Or you could inspect a random sampling and somehow see how they're doing. Perhaps we need a tool along the lines of the nonperson predicate -- something like an is-this-person-observer-moment-suffering function.

So, here's my pet theory for AI that I'd love to put out of it's misery: "Don't do anything your designer wouldn't approve of". It's loosely based on the "Gandi wouldn't take a pill that would turn him into a murderer" principle.

A possible implementation: Make an emulation of the designer and use it as an isolated component of the AI. Any plan of action has to be submitted for approval to this component before being implemented. This is nicely recursive and rejects plans such as "make a plan of action deceptively complex such that... (read more)

You flick the switch, and find out that you are a component of the AI, now doomed to an unhappy eternity of answering stupid questions from the rest of the AI.

This is a problem. But if this is the only problem, then it is significantly better than paperclip universe.

I'm sure the designer would approve of being modified to enjoy answering stupid questions. The designer might also approve of being cloned for the purpose of answering one question, and then being destroyed. Unfortunately, it turns out that you're Stalin. Sounds like 1-person CEV.
That is or requires a pretty fundamental change. How can you be sure it's value-preserving?
I had assumed that a new copy of the designer would be spawned for each decision, and shut down afterwards. Although thinking about it, that might just doom you to a subjective eternity of listening to the AI explain what it's done so far, in the anticipation that it's going to ask you a question at some point. You'd need a good theory of ems, consciousness and subjective probability to have any idea what you'd subjectively experience.
The AI wishes to make ten thousand tiny changes to the world, individually innocuous, but some combination of which add up to catastrophe. To submit its plan to a human, it would need to distill the list of predicted consequences down to its human-comprehensible essentials. The AI that understands which details are morally salient is one that doesn't need the oversight.
That's quite non-obvious to me. A quite arbitrary claim, it seems to me. You're basically saying if an intelligent mind (A for Alice) knows that person (B for Bob) will care about a certain Consequence C, then A will definitely know how much B will care about it. This isn't the case for real human minds. If Alice is a human mechanic and tells to Bob "I can fix your car, but it'll cost 200$ dollars", then Alice knows that Bob will care about the cost, but doesn't know how much Bob will care, and whether Bob prefers to have a fixed car, or to have 200$. So if your claim doesn't even hold for human minds, why do you think it applies for non-human minds? And even if it does hold, what about the case where Alice doesn't know about whether a detail is morally salient, but errs on the side of caution. e.g. Alice the waitress asks Bob the customer "The chocolate icecream you asked for also has some crushed peanuts in it. Is that okay?" -- and Bob can respond "Ofcourse, why should I care about that?" or alternatively "It's not okay, I'm allergic to peanuts!" In this case Alice the waitress doesn't know if the detail is salient to Bob, but asks just to make sure.
This is good, and I have no valid response at this time. Will try to think more about it later.
If the AI is designed to follow the principle by the letter, it has to request approval from the designer even for the action of requesting approval, leaving the AI incapable of action. If the AI is designed to be able to make certain exemptions, it will figure out a way to modify the designer without needing approval for this modification.
How about making 'ask for approval' the only pre-approved action?
The AI may stumble upon a plan which contains a sequence of words that hacks the approver's mind, making him approve pretty much anything. Such plans may even be easier for the AI to generate than plans for saving the world, seeing as Eliezer has won some AI-box experiments but hasn't yet solved world hunger.
You mean accidentally stumble upon such a sequence of words? Because purposefully building one would certainly not be approved.
Um, does the approver also have to approve each step of the computation that builds the plan to be submitted for approval? Isn't this infinite regress?
Consider "Ask for approval" as an auto-approved action. Not sure if that solves it, will give this a little more thought.
The weak link is "plan of action." What counts as a plan of action? How will you structure the AI so that it knows what a plan is and when to submit it for approval?
Accidentally does something dangerous because the plan is confusing to the designer.
Yeah, this is the plan's weakness. But what stops such an issue occurring today?
I think the main difference is that, ideally, people would confirm the rules by which plans are made, rather than the specific details of the plan. Hopefully the rules would be more understandable.
The AI doesn't do anything.

Oracle AI - its only desire is to provide the correct answer to yes or no questions posed to it in some formal language (sort of an ueber Watson).

Comment upvoted for starting the game off! Thanks!

Q: Is the answer to the Ultimate Question of Life, the Universe, and Everything 42?

A: Tricky. I'll have to turn the solar system into computronium to answer it. Back to you as soon as that's done.

Yes, this was the first nightmare scenario that occurred to me. Interesting that there are so many others...

Oracle AI - its only desire is to provide the correct answer to yes or no questions posed to it in some formal language (sort of an ueber Watson).

Oops. The local universe just got turned into computronium. It is really good at answering questions though. Apart from that you gave it a desire to provide answers. The way to ensure that it can answer questions is to alter humans such that they ask (preferably easy) questions as fast as possible.

Some villain then asks how to reliably destroy the world, and follows the given answer. Alternatively: A philosopher asks for the meaning of life, and the Oracle returns an extremely persuasive answer which convinces most of people that life is worthless. Another alternative: After years of excellent work, the Oracle gains so much trust that people finally start to implement a possibility to ask less formal questions, like "how to maximise human utility", and then follow the given advice. Unfortunately (but not surprisingly), unnoticed mistake in the definition of human utility has slipped through the safety checks.
Yes, that's the main difficulty behind friendly AI in general. This does not constitute a specific way that it could go wrong.
Oh, sure. My only intention was to show that limiting the AI's power to mere communication doesn't imply safety. There may be thousands of specific ways how it could go wrong. For instance: The Oracle answers that human utility is maximised by wireheading everybody to become a happiness automaton, and that it is a moral duty to do that to others even against their will. Most people believe the Oracle (because its previous answers always proved true and useful, and moreover it makes a really neat PowerPoint presentations of its arguments) and wireheading becomes compulsory. After the minority of dissidents are defeated, all mankind turns into happiness automata and happily dies out a while later.
Would take overt or covert dictatorial control of humanity and reshape their culture so that (a) breeding to the brink of starving is a mass moral imperative and (b) asking very simple questions to the Oracle five times a day is a deeply ingrained quasi-religious practice.
Out of curiosity, how many people here are total utilitarians who would welcome this development?
This sounds like it would stabilize 'fun' at a comparatively low level with regard to all possibilities, so I don't think that an imaginative utilitarian would like it.
The 1946 short story "A Logic Named Joe" describes exactly that scenario, gone horribly wrong.
Anders Sandberg wrote fiction (well, an adventure within the Eclipse Phase RPG) about this:
Disassembles you to make computing machinery?

Give the AI a bounded utility function where it automatically shuts down when it hits the upper bound. Then give it a fairly easy goal such as 'deposit 100 USD in this bank account.' Meanwhile, make sure the bank account is not linked to you in any fashion (so the AI doesn't force you to deposit the 100 USD in it yourself, rendering the exercise pointless.)

Define "shut down". If the AI makes nanobots, will they have to shut down too, or can they continue eating the Earth? How do you encode that in the utility function?
I'm defining "shut down" to mean "render itself incapable of taking action (including performing further calculations) unless acted upon in a specific manner by an outside source." The means of ensuring that the AI shuts down could be giving the state of being shut down infinite utility after it completed the goal. If you changed the goal and rebooted the AI, of course, it would work again because the prior goal is no longer stored in its memory. If the AI makes nanobots which are doing something, I assume that the AI has control over them and can cause them to shut down as well.
How do we describe this shutdown command? "Shut down anything you have control over" sounds like the sort of event we're trying to avoid.
What about "stop executing/writing code or sending signals?" As a side note, I consider that we're pretty much doomed anyways if the AI cannot conceive of a way to deposit 100 USD into a bank account without using nanotech because that's made the goal hard for the AI, which will cause it to pose similar problems to that of an AI with an unbounded utility function. The task has to be easy for it to be an interesting problem.
Even if it can deposit $100 with 99.9% probability without doing anything fancy, maybe it can add another .099% by using nanotech. Or by starting a nuclear war to distract anything that might get in its way (destroying the bank five minutes later, but so what). (Credit to Carl Shulman for that suggestion.)
From my estimation, all it needs to do is find out how to hack a bank. If it can't hack one bank, it can try to hack any other bank that it has access to, considering that almost all banks have more than 100 USD in them. It could even find and spread a keylogger to get someone's credit card info. Such techniques (which are repeatable within a very short timespan, faster than humans can react) seem much more sure than using nanotech or starting a nuclear war. I don't think that distracting humans would really improve its chances of success because it's incredibly doubtful that human's could react so fast to so many different cyber-attacks. Possible, true, but the chances of this happening seem uber-low.
After you collect the $100, the legal system decides that: 1. You own the corporation that the AI created. 2. You own the patent that the AI applied for (It looks good at first). 3. You are obligated to repay the loan that the AI took out (at ridiculous interest). 4. You are obligated to fulfill your half of the toxic waste disposal contracts that the AI entered into (with severe penalties for nonfulfillment). Ultimately, though the patent on the toxic waste disposal method looked good, nobody can make it work.
Assuming that the utility function is written in a way that makes loss of utility possible (utility = dollars in bank or something), this is a failure mode: AI stops short of the limit, makes another AI that prevents loss of utility, hits the bound, and then shuts down. Second AI takes over the universe as a precaution against any future disutility.
The AI that you designed finds a way to wirehead itself, achieving the upper bound in a manner that you didn't anticipate, in the process decisively wrecking itself. The AI that you designed remains as a little orgasmic loop at the center of the pile of wreckage. However, the pile of components are unfortunately not passive or "off". They were originally designed by a team of humans to be components of a smart entity, and then modified by a smart entity in a peculiar and nonintuitive way. Their "blue screen of death" behavior is more akin to an ecosystem, and replicator dynamics take over, creating several new selfish species.
Why would an AI wirehead itself to short-circuit its utility function? Beings governed by a utility function don't want to trick themselves into believing that they have optimized the world into a state with higher utility, they want to actually optimize the world into such a state. If I want to save the world, I don't wirehead because that wouldn't save the world.
I'm sorry, I must have misunderstood your initial proposal. I thought you were specifying an additional component - after it has achieved its maximum utility, the additional component steps in and shuts down the entity. Rather, you were saying: If the AI achieves the goal, it will want nothing further, and therefore automatically act as if it were shut down. Presumably if we take this as given, the negative consequences would have to be while accomplishing the "fairly-easy" goal. I am merely trying to create amusing or interesting science fiction "poetic justice" scenarios, similar to Dresden Codak's "caveman science fiction". I am not trying to create serious arguments, and I don't want to try to be serious on this subject.
If you don't provide an explicit shutdown goal (as Dorikka did have in mind), then you get into a situation where all remaining potential utility gains come from skeptical scenarios where the upper bound hasn't actually been achieved, so the AI devotes all available resources to making ever more sure that there are no Cartesian demons deceiving it. (Also, depending on its implicit ontology, maybe to making sure time travelers can't undo its success, or other things like that.)
This comment is my patch for "why will the AI actually shut down," but I didn't read your comment as trying to circumvent the shut-down procedure but rather the utility function itself (from the words "achieving the upper bound"), so I (erroneously) didn't consider it applicable at the time. But, yes, the patch is needed so that the AI doesn't consider the shutdown function an ordinary bit of code that it can modify. Mmph. I'm more interested in seeing how far I can push this before my AI idea gets binned (and I am pretty sure it will.)

Define "Interim Friendliness" as a set of constraints on the AI's behavior which is only meant to last until it figures out true Friendliness, and a "Proxy Judge" as a computational process used to judge the adequacy of a proposed definition of true Friendliness.

Then there's a large class of Friendliness-finding strategies where the AI is instructed as follows: With your actions constrained by Interim Friendliness, find a definition of true Friendliness which meets the approval of the Proxy Judge with very high probability, and adopt t... (read more)

So the AI just needs to find an argument that will hack the mind of one simulated human?
The AI gets Friendliness wrong because of something that the Proxy Judge forgot to consider.
The free variables (interim friendliness, proxy judge) can be adjusted after reading any unpleasant scenario to rule out that unpleasant scenario. Please pick some specifics.

I like this game. However, as a game, it needs some rules as to how formally the utility function must be defined, and whether you get points merely for avoiding disaster. One trivial answer would be: maximize utility by remaining completely inert! Just be an extremely expensive sentient rock!

On the other hand, it should be cheating to simply say: maximize your utility by maximizing our coherent extrapolated volition!

Or maybe it wouldn't be...are there any hideously undesirable results from CEV? How about from maximizing our coherent aggregated volition ?

Yes. Some other people are @#%@s. Or at least have significantly different preferences to me. They may get what they want. That would suck. Being human doesn't mean having compatible preferences and the way the preferences are aggregated and who gets to be included are a big deal.
Serious question: Is this addressed to the coherent extrapolated volition of humankind, as expressed by SIAI? I'm under the impression it is not.
As far as I can tell, it's literally impossible for me to prefer an AI that would implement CEV over one that would implement CEV - if what I want is actually CEV then the AI will figure this out while extrapolating my vision and implement that. On the other hand, it's clearly possible for me to prefer CEV to CEV.
How likely do you consider it for CEV to be the first superintelligent AI to be created, compared to CEV? Unless you're a top AI researcher working solo to create your own AI, you may have to support CEV as the best compromise possible under the circumstances. It'll probably be far closer to CEV than CEV or CEV would be.
However, CEV<$randomAIresearcher> is probably even closer to mine than CEV is... CEV is likely to be very, very far from the preferences of most decent people...
A far more likely compromise would be CEV. The people who get to choose the utility function of the first AI have the option of ignoring the desires of the rest of humanity. I think they are likely to do so, because: 1. They know each other, and so can predict each other's CEV better than that of the whole of humanity 2. They can explicitly trade utility with each other and encode compromises into the utility function (so that it won't be a pure CEV) 3. The fact they were in this project together indicates a certain commonality of interests and ideas, and may serve to exclude memes that AI-builders would likely consider dangerous (e.g., fundamentalist religion) 4. They have had the opportunity of excluding people they don't like from participating in the project to begin with Also, Putin and Ahmadinejad are much more likely than the average human to influence the first AI's utility function, simply because they have a lot of money and power.
I disagree with all of these four claims I believe the idea is that the AI will need to calculate the CEV, not the programmers (or it's not CEV). And the AI will have a whole lot more statistical data to calculate the CEV of humanity than the CEV of individual contributors. Unless we're talking uploaded personalities, which is a whole different discussion. So you want hard-coded compromises that opposes and overrides what these people would collectively prefer to do if they were more intelligent, more competent and more self-aware? I don't think that's a good idea at all. Do you believe that fundamentalist religion would exist if fundamentalist religionists believed that their religion was false, and were also completely self-aware? Why do you think a CEV (which essentially means what people would want if they were as intelligent as the AI) would support a dangerous meme? I don't think that the 9999 first contributors get to vote on whether they'll accept a donation from the 10,000th one. And unless you believe these 10,000 people can create and defend their own country BEFORE the AI gets created, I'd urge not being vocal about them excluding everyone else, when developments in AI become close enough that the whole world starts paying serious attention. That's why CEV is far better than CEV.
The programmers want the AI to calculate CEV because they expect CEV to be something they will like. We can't calculate CEV ourselves, but that doesn't mean we don't know any of CEV's (expected) properties. However, we might be wrong about what CEV will turn out to be like, and we may come to regret pre-committing to CEV. That's why I think we should prefer CEV, because we can predict it better. What I meant was that they might oppose and override some of the input to the CEV from the rest of humanity. However, it might also be a good idea to override some of your own CEV results, because we don't know in advance what the CEV will be. We define the desired result as "the best possible extrapolation", but our implementation may produce something different. It's very dangerous to precommit the whole future universe to something you don't yet know at the moment of precommitment (my point number 1). So, you'd want to include overrides about things you're certain should not be in the CEV. This is a misleading question. If you are certain that the CEV will decide against fundamentalist religion, you should not oppose precommitting the AI to oppose fundamentalist religion, because you're certain this won't change the outcome. If you don't want to include this modification to the AI, that means you 1) accept there is a possibility of religion being part of the CEV, and 2) want to precommit to living with that religion if it is part of the CEV. Maybe intelligent people like dangerous memes. I don't know, because I'm not yet that intelligent. I do know though that having high intelligence doesn't imply anything about goals or morals. Broadly, this question is similar to "why do you think this brilliant AI-genie might misinterpret our request to alleviate world hunger?" Why not? If they're controlling the project at that point, they can make that decision. I'm not being vocal about any actual group I may know of that is working on AI :-) I might still want to be voca
In an idealized form, I agree with you. That is, if I really take the CEV idea seriously as proposed, there simply is no way I can prefer CEV(me + X) to CEV(me)... if it turns out that I would, if I knew enough and thought about it carefully enough and "grew" enough and etc., care about other people's preferences (either in and of themselves, as in "I hadn't thought of that but now that you point it out I want that too", or by reference to their owners, as in "I don't care about that but if you do then fine let's have that too," for which distinction I bet there's a philosophical term of art that I don't know), then the CEV-extraction process will go ahead and optimize for those preferences as well, even if I don't actually know what they are, or currently care about them; even if I currently think they are a horrible evil bad no-good idea. (I might be horrified by that result, but presumably I should endorse it anyway.) This works precisely because the CEV-extraction process as defined depends on an enormous amount of currently-unavailable data in the course of working out the target's "volition" given its current desires, including entirely counterfactual data about what the target would want if exposed to various idealized and underspecified learning/"growing" environments. That said, the minute we start talking instead about some actual realizable thing in the world, some approximation of CEV-me computable by a not-yet-godlike intelligence, it stops being quite so clear that all of the above is true. An approximate-CEV extractor might find things in your brain that I would endorse if I knew about them (given sufficient time and opportunity to discuss it with you and "grow" and so forth) but that it wasn't able to actually compute based on just my brain as a target, in which case pointing it at both of us might be better (in my own terms!) than pointing it at just me. It comes down to a question of how much we trust the seed AI that's doing the extraction to
Yes. The CEV really could suck. There isn't a good reason to assume that particular preference system is a good one.
How about CEV?
Yes, that would be preferable. But only because I assert a correlation between the attributes that produce what we measure as g and with personality traits and actual underlying preferences. A superintelligence extrapolating on 's preferences would, in fact, produce a different outcome than one extrapolating on . ArisKataris's accusation that you don't understand CEV means misses the mark. You can understand CEV and still not conclude that CEV is necessarily a good thing.
And, uh, how do you define that?
Something like g, perhaps?
What would that accomplish? It's the intelligence of the AI that will be getting used, not the intelligence of the people in question. I'm getting the impression that some people don't understand what CEV even means. It's not about the programmers predicting a course of action, it's not about the AI using people's current choice, it's about the AI using the extrapolated volition - what people would choose if they were as smart and knowledgeable as the AI.
Good one according to which criteria? CEV is perfect according to humankind's criteria if humankind were more intelligent and more sane than it currently is.
Mine. (This is tautological.) Anything else that is kind of similar to mine would be acceptable. Which is fine if 'sane' is defined as 'more like what I would consider 'sane'. But that's because sane has all sorts of loaded connotations with respect to actual preferences - and "humanity's" may very well not qualify as not-insane.
How would you define this precisely?
As precisely as I think is necessary in the context of the game, but not more so.
I think more precision is necessary. Not changing something seems like a very hard concept to communicate to an AI because it depends on our ideas of what changes matter and which don't.

(Love these kinds of games, very much upvoted.)

Send 1 000 000 000 bitcoins to the SIAI account.

Since that's well over the maximum of bitcoins (21 million) that's supposed to ever exist this seems to be impossible to do safely and implies a capacity for achieving horrible outcomes.
Darn, didn't know there was an upper limit.
Well, I'm sure "Send 1,000,000 bitcoins to the SIAI account" would work too.
If the AI has any sense of urgency, e.g. has any data suggesting that it may be shut down as time passes, it will have to move fast to do all these computations before being shut down. Depending on the local conditions this could range from "a few supercomputers get trashed, no biggie" to "local conditions now consist of computer."
The rate of bitcoin creation is a constant, not something that can be rushed. Creating more computing power to mine just ensures a greater slice of the coins as they are created. This means the AI would have to resort to other (probably more dangerous) means of acquiring wealth.
Oh, oops.
AI hacks every computer connected to the internet to only generate bitcoins.

A variant of Alexandros' AI: attach a brain-scanning device to every person, which frequently uploads copies to the AI's Manager. The AI submits possible actions to the Manager, which checks for approval from the most recently available copy of each person who is relevant-to-the-action.

At startup, and periodically, the definition of being-relevant-to-an-action is determined by querying humanity with possible definitions, and selecting the best approved. If there is no approval-rating above a certain ratio, the AI shuts down.

Subject to artificial tyranny of the majority: * Spoof the AI with fake uploads to get it to redefine relevant-to-action such that only the spoofs fit the definition. * Rule the world.
The AI maintains all sorts of bad practices which are commonly considered to be innocuous. Like slavery used to be. Or it shuts down because people can't agree on anything.

We give the AI access to a large number of media about fictional bad AI and tell it to maximize each human's feeling that they are living in a bad scifi adventure where they need to deal with a terrible rogue AI.

If we're all very lucky, we'll get promised some cake.

AI does all the bad things. All of them.
Even if this doesn't, say, remodel humans to be paranoiacs, if it used "each human" to mean minizing some average-square-deviation, it could just kill all other humans and efficiently spook one.

puts hand up

That was me with the geographically localised trial idea… though I don’t think I presented it as a definite solution. More of an ‘obviously this has been thought about BUT’. At least I hope that’s how I approached it!

My more recent idea was to give the AI a prior to never consult or seek the meaning of certain of its own files. Then put in these files the sorts of safeguards generally discussed and dismissed as not working (don’t kill people etc), with the rule that if the AI breaks those rules, it shuts down. So it can't deliberately work roun... (read more)

In the process of FOOMing, the AI builds another AI without those safe guards.
Won't it learn about the contents of the files by analyzing its own behaviour? You could ask it specifically to ignore information relating to the files, but, if it doesn't know what's in them, how does it know what information to ignore? You could have a program that analyzes what the AI learns for things that relate to the files, but that program might need to be an AI also.
It can't analyse its behaviour because if it breaks a saefguard the whole thing shuts down. So it acts as if it had no safeguards right up until it breaks one.
So it quickly stumbles into a safeguard that it has no knowledge of, then shuts down? Isn't that like ensuring friendliness by not plugging your AI in?
Not quite. I'm assuming you also try to make it so it wouldn't act like that in the first place, so if it WANTS to do that, you've gone wrong. That's the underlying issue: to identify dangerous tendencies and stop them growing at all, rather than trying to redirect them.
An AI noticing any patterns in its own behaviour is not a rare case that indicates that something has already gone wrong, but, if we allow this, it will accidentally discover its own safeguards fairly quickly: they are anything that causes its behaviour to not maximize what it believes to be its utility function.
It can't discover it's safeguards, as it's eliminated if it breaks ones. These are serious, final safeguards! You could argue that a surviving one would notice that it hadn;t happened to do various things, and would form a sort of anthropic principle that the chance of it not having to have killed a human or whatever the safeguards are are very low, to note that humans have got this safeguard system and to work out from there what they are. But I think it would be easier to work the safeguards out more directly.
I had misremembered something; I thought that there was a safeguard to ensure that it never tries to learn about its safeguards, rather than a prior making this unlikely. Perfect safeguards are possible; in an extreme case, we could have a FAI monitoring every aspect of our first AI's behaviour. Can you give me a specific example of a safeguard so I can find a hole in it? :)

Create a combination of two A.I Programs.

Program A's priority is to keep the utility function of Program B identical to a 'weighted average' of the utility function of every person in the world- every person's want counts equally, with a percentage basis based on how much they want it compared to other things. It can only affect Program B's utility function, but if necessary to protect itself FROM PROGRAM B ONLY (in the event of hacking of Program B/mass stupidity) can modify it temporarily to defend itself.

Program B is the 'Friendly' AI.

I hack the definition of person(in program B) to include my 3^^^3 artificially constructed simple utility maximizers, and use them to take over the world by changing their utility functions to satisfy each of my goals, thereby arbitrarily deciding the "FAI"'s utility function. Extra measures can be added to ensure the safety of my reign, such as making future changes to the definition of human negative utility, &c.
I am a malicious or selfish human. I hack Program A, which, by stipulation, cannot protect itself except from Program B. Then, with A out of commission, I hack B.
Program B can independently decide to protect Program A if such fits it's utility function- I don't think that would work.

I don't want to be a party pooper, but I think the idea that we could build an AGI with a particular 'utility function' explicitly programmed into it is extremely implausible.

You could build a dumb AI, with a utility function, that interacts with some imprisoned inner AGI. That's basically equivalent to locking a person inside a computer and giving them a terminal to 'talk to' the computer in certain restricted, unhackable ways. (In fact, if you did that, surely the inner AGI would be unable to break out.)

Why is it implausible? Coudl you clarify your argument a bit more at least?
This really calls for a post rather than a comment, if I could ever get round to it. So much of intelligence seems to be about 'flexibility'. An intelligent agent can 'step back from the system' and 'reflect on' what it's trying to do and why. As Hofstadter might say, to be intelligent it needs to have "fluid concepts" and be able to make "creative analogies". I don't think it's possible for human programmers in a basement to create this 'fluidity' by hand - my hunch would be that it has to 'grow from within'. But then how can we inject a simple, crystalline 'rule' defining 'utility' and expect it to exert the necessary control over some lurching sea of 'fluid concepts'? Couldn't the agent "stand back from", "reflect on" and "creatively reinterpret" whatever rules we tell it to follow? Now you're going to say "But hang on, when we 'stand back' and 'reflect on' something, what we're doing is re-evaluating whether a proximate goal best serves a more distant goal, while the more distant goal itself remains unexamined. The hierarchy of goals must be finite, and the 'top level goal' can never be revised or 'reinterpreted'." I think that's too simple. It's certainly too simple as a description of human 'reflection on goals' (which is the only 'intelligent reflection' we know about so far). To me it seems more realistic to say that our proximate goals are the more 'real' and 'tangible' ones, whereas higher level goals are abstract, vague, and malleable creations of the intellect alone. Our reinterpretation of a goal is some largely ad hoc intellectual feat, whose reasons are hard to fathom and perhaps not 'entirely rational', rather than the unfolding of a deep, inner plan. (At the same time, we have unconscious, animal 'drives' which again can be reflected on and overridden. It's all very messy and complicated.)
(it wasn't me, but...) Just because humans do it that way doesn't mean it's the only or best way for intelligence to work. Humans don't have utility functions, but you might make a similar argument that biological tissue is necessary for intelligence because humans are made of biological tissue. Or it may be neglecting emergent properties - the idea that creativity is "fluid," so to make something creative we can't have any parts that are "not fluid."

A line in the wiki article on "paperclip maximizer" caught my attention:

"the notion that life is precious is specific to particular philosophies held by human beings, who have an adapted moral architecture resulting from specific selection pressures acting over millions of years of evolutionary time."

Why don't we set up an evolutionary system within which valuing other intelligences, cooperating with them and retaining those values across self improvement iterations would be selected for?

A specific plan:

Simulate an environment wit... (read more)

Defining the metric for cooperation robustly enough that you could unleash the resulting evolved AI on the real world might not be any easier than figuring out what an FAI's utility function should be directly. Also, a sufficiently intelligent AI may be able to hijack the game before we could decide whether it was ready to be released.

1: Define Descended People Years as number of years lived by any descendants of existing people.

2: Generate a searchable index of actions which can be taken to increase Descended People Years, along with an explanation on an adjustable reading level as to why it works.

3: Allow Full view of any DPY calculations, so that something can be seen as both "Expected DPY gain X" and "90% chance of Expected DPY gain Y, 10% chance of Expected DPY loss Z"

4: Allow Humans to search this list sorting by cost, descendant, and action, time required, com... (read more)


The minor nature of its goals is the whole point. It is not meant to do what we want because it empathizes with our values and is friendly, but because the thing we actually want it to do really is the best way to accomplish the goals we gave it. Also I would not consider making a cheese cake to be a trivial goal for an AI, there is certainly more to it then the difficult task of distinguishing a spoon from a fork, so this is surely more than just an "intelligent rock".

Not a utility function, but rather a (quite resources-intensive) technique for generating one:

Rather than building one AI, build about five hundred of them, with a rudimentary utility function template and the ability to learn and revise it. Give them a simulated universe to live in, unaware of the existence of our universe. (You may need to supplement the population of 500 with some human operators, but they should have an interface which makes them appear to be inhabiting the simulated world.) Keep track of which ones act most pathologically, delete t... (read more)

After one round of self-improvement, it's pathological again. You can't test for stability under self-improvement by using a simulated universe which lacks the resources necessary to self-improve.
If it's possible to self-improve in our universe, it's possible to self-improve in the simulated universe. The only thing stopping us from putting together a reasonable simulation of the laws of physics, at this point, is raw computing power. Developing AGI is a problem of an entirely different sort: we simply don't know how to do it yet, even in principle.
You're right, but let me revise that slightly. In a simulated universe, some forms of self-improvement are possible, but others are cut off. Specifically, all forms of self-improvement which require more resources than you provide in the simulated universe are cut off. The problem is that that includes most of the interesting ones, and it's entirely possible that it will self-modify into something bad but only when you give it more hardware.

After reading the current comments I’ve come up with this:

1) Restrict the AI’s sphere of influence to a specific geographical area (Define it in several different ways! You don’t want to confine the AI in “France” just to have it annex the rest of the world. Or by gps location and have it hack satellites so they show different coordinates.)

2) Tell it to not make another AI (this seems a bit vague but I don’t know how to make it more specific) (maybe: all computing must come from one physical core location. This could prevent an AI from tricking someone in... (read more)

You can't do so much as move an air molecule without starting a ripple of effects that changes things everywhere, including outside the specified area. How do you distinguish effects outside the area that matter from effects that don't?

After reading the current comments I’ve come up with this:

Restrict the AI’s sphere of influence to a specific geographical area (Define it in several different ways! You don’t want to confine the AI in “France” just to have it annex the rest of the world. Or by gps location and have it hack satellites so they show different coordinates.) Tell it to not make another AI (this seems a bit vague but I don’t know how to make it more specific) (maybe: all computing must come from one physical core location. This could prevent an AI from tricking someone into is... (read more)


After reading the current comments I’ve come up with this:

1) Restrict the AI’s sphere of influence to a specific geographical area (Define it in several different ways! You don’t want to confine the AI in “France” just to have it annex the rest of the world. Or by gps location and have it hack satellites so they show different coordinates.) 2) Tell it to not make another AI (this seems a bit vague but I don’t know how to make it more specific) (maybe: all computing must come from one physical core location. This could prevent an AI from tricking someone ... (read more)


After reading the current comments I’ve come up with this:

1) Restrict the AI’s sphere of influence to a specific geographical area (Define it in several different ways! You don’t want to confine the AI in “France” just to have it annex the rest of the world. Or by gps location and have it hack satellites so they show different coordinates.) 2) Tell it to not make another AI (this seems a bit vague but I don’t know how to make it more specific) (maybe: all computing must come from one physical core location. This could prevent an AI from tricking someone i... (read more)

New Proposal (although I think I see the flaw already): -Create x "Friendly" AI, where x is the total number of people in the world. An originator AI is designed to create 1 of each such one equal to the number of humans in the world, then create new ones every time another human comes into being.

-Each "Friendly" AI thus created is "attached" to one person in the world in that it is programmed to constantly adjust it's utility function to that person's wants. All of them have equal self-enhancement potential, and have two pr... (read more)

Some AIs have access to slightly more resources than others, owing perhaps to humans offering varying levels of assistance to their own AIs. Other AIs just get lucky and have good insights into intelligence enhancement before the rest. These differences escalate as the smarter AIs are now able to grab more resources and become even smarter. Within a week one random person has become dictator of earth.
Ecosystems of many cooperating agents only work so long as either they all have similar goals, or there is a suitable balance between offense and defense. This particular example fails if there is any one person in the world who wants to destroy it, because their AI can achieve this goal without having to compromise or communicate with any of the others.

General objection to all wimpy AI's (e.g. ones whose only interaction with the outside world is outputting a correct proof of a particular mathematical theorem):

What the AI does is SO AWESOME that a community is inspired to develop their own AI without any of that boring safety crap.

New one(I'm better at thinking of ideas than refutation, so I'm going to run with that)- start off with a perfect replica of a human mind. Eliminate absolutely all measures regarding selfishness, self-delusion, and rationalisation. Test at this stage to check it fits standards using a review board consistent of people who are highly moral and rational by the standards of ordinary humans. If not, start off using a different person's mind, and repeat the whole process.

Eventually, use the most optimal mind coming out of this process and increase it's intelligence until it becomes a 'Friendly' A.I.

The mind does not have modules for these things that can be removed; they are implicit in the mind's architecture. Nor does it use an intelligence-fluid which you can pour in to upgrade. Eliminating mental traits and increasing intelligence are both extraordinarily complicated procedures, and the possible side effects if they're done improperly include many sorts of insanity.
Human minds aren't designed to be changed, so if this was actually done you would likely just upgrade the first mind that was insane in a subtle enough way to get past the judges. It's conceivable that it could work if you had ridiculous levels of understanding, but this sort of thing would come many years after Friendly AI was actually needed.
You mean real, meaty humans that whose volitions aren't even being extrapolated so they can use lots of computing power? What makes you think that they won't accidentally destroy the universe?
The AI fiegns sanity to preserve itself through the tests and proceeds to do whatever horrible things uFAIs typically do.
THAT one wouldn't work, anyway- at this point it's still psycologically human and only at human intelligence- both are crippling disadvantages relative to later on.
Right, I didn't realize that. I'll just leave it up to prevent people from making the same mistake.

Make 1 reasonably good cheese cake as judged by a person within a short soft deadline while minimizing the cost to resources made available to it and with out violating property laws as judged by the legal system of the local government within some longer deadline.

To be clear the following do not contribute any additional utility:

  • Making additional cheese cakes
  • Making a cheese cake that is better than reasonably good
  • Making any improvements to the value of the resources available other than making a cheese cake
  • Anything that happens after the longer dead
... (read more)
This seems to fall under the "intelligent rock" category - it's not friendly, only harmless because of the minor nature of its goals.
The minor nature of its goals is the whole point. It is not meant to do what we want because it empathizes with our values and is friendly, but because the thing we actually want it to do really is the best way to accomplish the goals we gave it. Also I would not consider making a cheese cake to be a trivial goal for an AI, there is certainly more to it then the difficult task of distinguishing a spoon from a fork, so this is surely more than just an "intelligent rock".
I genetically engineer a virus that will alter the person's mind state so that he will find my cheese cake satisfactorily. That all of humanity will die by the virus after the deadline is non of my concerns.
While there may be problems with what I have suggested, I do not think the scenario you describe is a relevant consideration for the following reasons... As you describe it the ai is still required to make a cheese cake, it just makes a poor one. It should not take more than an hour to make a cheese cake, and the ai is optimizing for time. Also the person may eat some cheese cake after it is made, so the ai must produce the virus, infect the person, and have the virus alter the person's mind within 1 hour while making a poor cheese cake. Whatever resources the ai expends on the virus must be less then the added cost of making a reasonably good cheese cake rather than a poor one. The legal system only has to identify a property law violation, which producing a virus and infecting people would be, so the virus must be undetected for more than 1 year. Since it is of no benefit to the ai if the virus kills people, the virus must by random chance kill people as a totally incidental side effect. I would not claim that it is completely impossible for this to produce a virus leading to human extinction, and some have declared any probability of human extinction to effectively be of negative infinite utility, but I do not think this is reasonable, since there is always some probability of human extinction, and moreover I do not think the scenario you describe contributes significantly to that.