by [anonymous]5 min read26th Apr 201229 comments


Personal Blog

"All that is necessary for evil to triumph is that good men do nothing."


155,000 people are dying, on average, every day.  For those of us who are preference utilitarians, and also believe that a Friendly singularity is possible, and capable of ending this state of affairs, it also puts a great deal of pressure on us.  It doesn't give us leave to be sloppy (because human extinction, even multiplied by a low probability, is a massive negative utility).  But, if we see a way to achieve similar results in a shorter time frame, the cost to human life of not taking it is simply unacceptable.

I have some concerns about CEV on a conceptual level, but I'm leaving those aside for the time being.  My concern is that most of the organizations concerned with a first-mover X-risk are not in a position to be that first mover -- and, furthermore, they're not moving in that direction.  That includes the Singularity Institute.  Trying to operationalize CEV seems like a good way to get an awful lot of smart people bashing their heads against a wall while clever idiots trundle ahead with their own experiments.  I'm not saying that we should be hasty, but I am suggesting that we need to be careful of getting stuck in dark intellectual forests with lots of things that are fun to talk about until an idiot with the tinderbox burns it down.

My point, in short, is that we need to be looking for better ways to do things, and to do them extremely quickly.  We are working on a very, very, existentially tight schedule.    


So, if we're looking for quicker paths to a Friendly, first-mover singularity, I'd like to talk about one that seems attractive to me.  Maybe it's a useful idea.  If not, then at least I won't waste any more time thinking about it.  Either way, I'm going to lay it out and you guys can see what you think.  


So, Friendliness is a hard problem.  Exactly how hard, we don't know, but a lot of smart people have radically different ideas of how to attack it, and they've all put a lot of thought into it, and that's not a good sign.  However, designing a strongly superhuman AI is also a hard problem.  Probably much harder than a human can solve.  The good news is, we don't expect that we'll have to.  If we can build something just a little bit smarter than we are, we expect that bootstrapping process to take off without obvious limit.

So let's apply the same methodology to Friendliness.  General goal optimizers are tools, after all.  Probably the most powerful tools that have ever existed, for that matter.  Let's say we build something that's not Friendly.  Not something we want running the universe -- but, Friendly enough.  Friendly enough that it's not going to kill us all.  Friendly enough not to succumb to the pedantic genie problem.  Friendly enough we can use it to build what we really want, be it CEV or something else.  

I'm going to sketch out an architecture of what such a system might look like.  Do bear in mind this is just a sketch, and in no way a formal, safe, foolproof design spec.  

So, let's say we have an agent with the ability to convert unstructured data into symbolic relationships that represent the world, with explicitly demarcated levels of abstraction.  Let's say the system has the ability to build Bayesian causal relationships out of its data points over time, and construct efficient, predictive models of the behavior of the concepts in the world.  Let's also say that the system has the ability to take a symbolic representation of a desired future distribution of universes, a symbolic representation of the current universe, and map between them, finding valid chains of causality leading from now to then, probably using a solid decision theory background.  These are all hard problems to solve, but they're the same problems everyone else is solving too.  

This system, if you just specify parameters about the future and turn it loose, is not even a little bit Friendly.  But let's say you do this: first, provide it with a tremendous amount of data, up to and including the entire available internet, if necessary.  Everything it needs to build extremely effective models of human beings, with strongly generalized predictive power.  Then you incorporate one or more of those models (say, a group of trusted people) as a functional components: the system uses them to generalize natural language instructions first into a symbolic graph, and then into something actionable, working out the details of what it meant, rather than what is said.  Then, when the system is finding valid paths of causality, it takes its model of the state of the universe at the end of each course of action, feeds them into its human-models, and gives them a veto vote.  Think of it as the emergency regret button, iterated computationally for each possibility considered by the genie.  Any of them that any of the person-models find unacceptable are disregarded.

(small side note: as described here, the models would probably eventually be indistinguishable from uploaded minds, and would be created, simulated for a short time, and destroyed uncountable trillions of times -- you'd either need to drastically limit the simulation depth of a models, or ensure that everyone who you signed up to be one of the models knew the sacrifice they were making)

So, what you've got, plus or minus some spit and polish, is a very powerful optimization engine that understands what you mean, and disregards obviously unacceptable possibilities.  If you ask it for a truly Friendly AI, it will help you first figure out what you mean by that, then help you build it, then help you formally prove that it's safe.  It would turn itself off if you asked it too, and meant it.  It would also exterminate the human species if you asked it to and meant it.  Not Friendly, but Friendly enough to build something better.  

With this approach, the position of the Friendly AI researcher changes.  Instead of being in an arms race with the rest of the AI field with a massive handicap (having to solve two incredibly hard problems against opponents who only have to solve one), we only have to solve a relatively simpler problem (building a Friendly-enough AI), which we can then instruct to sabotage unFriendly AI projects and buy some time to develop the real deal.  It turns it into a fair fight, one that we might actually win.  

Anyone have any thoughts on this idea?        


New Comment
29 comments, sorted by Click to highlight new comments since: Today at 3:55 PM

Question 1: How long ago did you have this idea?

Question 2: How sure are you that the idea makes sense, that it would work, and that it's actually feasible?

Question 3: What have you done to check and improve your idea before you wrote this post?

When you make a post that says "hey guys, what you're doing is definitely wrong, here's the totally obvious thing to do instead, and I'll bet that you never thought of this, let alone figured out a problem with it", you'd better have excellent answers to the above questions.

If not, I suggest rewriting the post in the form "hey guys, it seems to me like the obvious thing to do is this; is there a reason why we're not doing that?", which tends to go over a lot better with most human beings.

Upvoted. I should remember to express my critique in a more constructive way, like this.


I really didn't mean to be that critical. I just wanted to put an idea I couldn't see any holes in out there and see if anyone else saw any obvious problems.

Maybe it's a useful idea. If not, then at least I won't waste any more time thinking about it. Either way, I'm going to lay it out and you guys can see what you think.

This proposal looks like Oracle AI, except the people querying the Oracle are uploads.


Well, this was a disaster. Not to whine, but I'm somewhat disappointed that nobody even addressed the content of the post. I honestly thought it was a good idea.

I downvoted on the general principle that I think LessWrong should be reserved for talk of rationality. For what it's worth I think your AI idea was much better that most (and most get upvoted, so that confuses me). In particular the idea of learning human values from the internet (as opposed to random humans that the AGI has at hand) is a good idea.

If you're going to use abbreviations/jargon like CEV, please link to a definition.

Downvoted for, among other things, overusing mind-killing group identification (us vs them), and thinking in generally militant and alarmist ways.


Provided that strong artificial intelligence is possible, and hard takeoff is possible, there is a very real chance, well over 50% that we are all going to get killed by an unFriendly singularity. If you don't think that justifies a little bit of alarmism, I'm not sure what would.

If this were something that we had a pattern-match for, like a war, nobody would look twice at someone like me looking at even mildly extreme options to try to save human lives.

This isn't a war. It's a lot worse. There are people out there, right now, developing AI with absolutely no intention of developing anything remotely safe. Those people, while not bad people in their own right, are incredibly dangerous to the continued existence of the human species. Refusing to consider ways of dealing with that problem because they're 'militant' or 'alarmist' is profoundly unhelpful.

EDIT: If you think I'm wrong, and have reasons why, that's something I'm happy to listen to. I'll even rewrite it with a more reserved tone if you think that would help, but do try to appreciate that you, personally, might get killed or worse because of these issues, and take the matter appropriately seriously.

Provided that strong artificial intelligence is possible, and hard takeoff is possible, there is a very real chance, well over 50% that we are all going to get killed by an unFriendly singularity.

Where is this (conditional) probability estimate coming from?


Here's the way I look at it::

There is a lot of money in intelligence research. It's valuable, powerful, and interesting. We're just starting to get to the point again where there's some good reason to get excited about the hype. I consider IBM Watson system to be a huge step forward, both in terms of PR and actual advancement. A lot of groups of smart people are working on increasingly broad artificial intelligence systems. They're not talking about building full agents yet, because that's a low-status idea due to the overpromises of the 70's, but it's just a matter of time.

Eliezer's put a lot of blog posts into the idea that the construction of a powerful optimization process that isn't explicitly Friendly is an incredibly bad scenario (see: paperclip maximizers). I tend to agree with him here.

Most AI researchers out there are either totally unaware of, or dismissive of Friendly AI concerns, due at least partially to its somewhat long-term and low status nature.

This is not a good recipe. Things not going badly relies on a Friendly AI researcher developing bootstrapping AI first, which is unlikely, and successfully making it Friendly, which is also unlikely. I put the joint probability at less than 50%, based on the information currently available to me. It's not a good position to be in.

You're assuming that people who don't understand Friendly AI have enough competence to actually build a functioning agent AI in the first place.

Either friendliness is still a hard thing compared to an entire AGI project, or not.

In the first case, ignoring friendliness gives significant advantage in speed, and there can be someone who cuts enough corners to win the race and builds an uFAI.

In the second case, trying to create a mostly-friendly AI and getting any partial progress wins some reputation in the AI community. This can help to sell friendliness considerations to other researchers.

So, aside from whoever this was being a raging asshole, I see a few problems with this idea:

First, I'm reminded of the communist regimes that were intended to start as dictatorships, and then transition over to stateless communism when the transition period was over. Those researchers would need to be REALLY trustworthy.

Second, what conditions are the conscience simulations in? How do they cast their votes? Does this produce any corner cases that may render the emulations unable to cast a veto vote when they want to? Will it try to bribe the models? Are the models aware that they're emulations? If not, what kind of environment do they think they're in?

Third, I'm a little terrified by the suggestion of killing trillions of people as part of the normal operating of our AI...

what conditions are the conscience simulations in? How do they cast their votes? Does this produce any corner cases that may render the emulations unable to cast a veto vote when they want to? Will it try to bribe the models?

This. The voters must be given enough time to explore the environment. That is, to interact with the environment. What if some part of this interaction changes their votes to something they would not agree with in advance? Wireheading is too obvious, so they would avoid it, but what if the environment does with them something equivalent to wireheading, just slow, gradual. At the end, they will report they like this solution.

So the proposed machine solves the task of inconspicuous wireheading. (However, maybe that exactly is what humans ultimately want.)

Hmm. I think what he was proposing was that the AI's actions be defined as the union of the set of things the AI thinks will accomplish the goal, and the set of things the models won't veto. Assuming there's a way to prevent the AI from intentionally compromising the models, the question is how often the AI will come up with a plan for accomplishing goal X which will, completely coincidentally, break the models in undesirable ways.

It's not about AI intentionally doing something, but simply AI not sharing our unspoken assumptions. That means, unless you explicitly forbid something, AI considers it a valid solution, because AI is not human.

For example, you are in a building on a 20th floor and say to AI "get me out of this building, quickly", and AI throws you out of the window. Not because AI hates you or tries to subvert your commands, but because this is what AI think is a solution to your command. A human would not do it, because a human shares a lot of assumptions with you, such as you probably want to get out of the building alive. But AI only knows that you wanted a quick solution.

This specific case could be fixed by telling AI "don't ever harm humans". There are two problems to this. First, what else did we forget to tell AI? It is OK if the AI gets you out of the building quickly by destroying half of the building, assuming humans will not be harmed? Second, how exactly do you define "harm"? Too strict definition would drive the AI mad, because every second many cells in your body are destroyed, and the AI cannot prevent this. Or if you say that such slow damages are acceptable, then the AI may prevent doctors from operating your tumor, because tumor kills you slowly in a long time, while the operation would harm you in a short term.

What we humans consider "obvious" or "normal" is very unintuitive for AI. There are things we learned through evolution, and the AI did not evolve with us, it does not have the same genes, etc.

How often will AI use a surprising solution? I would think the more difficult task, the more often this would happen. Simple tasks have simple solution, but we kinda don't need a superhuman AI to do this for us. The more difficult is the task, the more chances are that the optimal (for AI) solution contains a shortcut that is not acceptable for us, but maybe we failed to communicate this.

More about this: The Hidden Complexity of Wishes.

Respectfully, I think you may have overestimated inferential distance: I have read the sequences.

My point wasn't that the AI wouldn't produce bad output. I'd say the majority of its proposed solutions would be... let's call it inadvertently hostile. However, the models would presumably veto those scenarios on sight.

Say we've got a two step process: one, generate a whole bunch of valid ways of solving a problem (this would include throwing you out the window, carrying you gently down the stairs, blowing up the building, and everything in between) and two: let the models veto those scenarios they find objectionable.

In that case, most of the solutions, including the window one, would be vetoed. So, in principle, the remaining set would be full of solutions that both satisfy the models, and accomplish the desired goal. My question is, of the solutions generated in step one, how many of them will also, coincidentally, cause the models to behave in unwanted ways (e.g. wireheading, neurolinguistic hacking, etc.)

Sorry for pattern-matching you into "probably did not read sequences".

My question is, of the solutions generated in step one, how many of them will also, coincidentally, cause the models to behave in unwanted ways (e.g. wireheading, neurolinguistic hacking, etc.)

Assumption one: pseudo-wireheading (changing model's preferences subtly enough that the model would not notice the changes, at least until it is modified enough to accept them, in a way that can finally lead the model to accept any solution) is possible. -- I think it is; can't prove it formally.

Assumption two: pseudo-wireheading has a fixed cost. -- It is not trivially true, because the model's reaction may depend on a problem that is being solved. A specific strategy of pseudo-wireheading may feel natural when solving one type of problem, but may feel suspicious when solving another type of problem. (For example if the problem is "show me a blueprint for a perfect society", love-bombing by simulated people may feel like a valid part of the answer; but it would be suspicious if the problem is "show me the best intepretation of quantum physics".) But I think there is a cost such that for any problem, there exists a pseudo-wireheading strategy with at most this cost.

Assumption three: The more complex problem, the greater chance that the cost of pseudo-wireheading plus the cost of simplest "unfriendly" solution will be less than the cost of simplest "friendly" solution. In other words, for more complex problems, the additional cost of "friendliness" is higher, and at some point it will be greater than the cost of pseudo-wireheading.

Based on these three assumptions, I think that as the complexity of the problem increases, the chance of an "unfriendly" solution increases towards 1.

Okay, after some thought (and final exams), here's what I think:

If our plan generator returns a set of plans ranked by overall expected utility, the question is basically whether the hit to expected utility provided by Friendliness moves you further down the list than the average joint improbability of whatever sequence of extraneous events would have to happen to break the models in undesirable ways. A second question is how those figures will scale with the complexity of the problem being undertaken. The answer is a firm 'I don't know.' We'd need to run some extensive experiments to find out.

That said, I can think of several ways to address this problem, and, if I were the designer, I would probably take all of them. You could, for example, start the system on easier problems, and have it learn heuristics for things that the models will veto, since you can't mindhack simple heuristics, and have it apply those rules during the plan generation stage, thus reducing the average distance to Friendliness. You could also have the models audit themselves after each scenario, and determine if they've been compromised. Still doesn't make mindhacking impossible, but it means it has to be a lot more complicated, which non-linearly increases the average distance to mindhacking. There are other strategies to pursue as well. Basically, given that we know the hardest problem we're planning on throwing this AI at, I would bet there's a way to make it safe enough to tackle that problem without going apocalyptic.

We could start from simple problems, progress to more complex ones, and gradually learn about the shortcuts AI could make, and develop related heuristics. This seems like a good plan, though there are a few things that could make it difficult:

Maybe we can't develop a smooth problem complexity scale with meaningful problems along it. (If the problem is not meaningful, it will be diffiult to tell the difference between reasonable and unreasonable solution. Or a heuristics developed on artificial problems may fail on an important problem.) The set of necessary heuristics may grow too fast; maybe for twice more complex problem we will need ten times more safeguards, which will make progress beyond some boundary extremely slow; and some important tasks may be far beyond that boundary. Or at some moment we may be unable to produce the necessary heuristic, because the heuristic itself would be too complex.

But generally, the first good use of the superhuman AI is to learn more about the superhuman AI.

I think I would be surprised to see average distance to Friendliness grow exponentially faster than average distance to mindhacking as the scale of the problem increased. That does not seem likely to me, but I wouldn't make that claim too confidently without some experimentation. Also, for clarity, I was thinking that the heuristics for weeding out bad plans would be developed automatically. Adding them by hand would not be a good position to be in.

I see the fundamental advantage of the bootstrap idea as being able to worry about a much more manageable subset of the problem: can we keep it on task and not-killing-us long enough to build something safer?

I think I would be surprised to see average distance to Friendliness grow exponentially faster than average distance to mindhacking as the scale of the problem increased.

I was thinking about something like this:

A simple problem P1 has a Friendly solution with cost 12, and three Unfriendly solutions with costs 11. We either have to add three simple heuristics, one to block each Unfriendly solution, or maybe one more complex heuristics to block them all -- but the latter option assumes that the three Unfriendly solutions have some similar essence, which can be identified and blocked.

A complex problem P9 has a Friendly solution with cost 8000, and thousand Unfriendly solutions with costs between 7996 and 7999. Three hundred of them are blocked by heuristics already developed for problems P1-P8, but there are seven hundred new ones. -- The problem is not that the distance between 7996 and 8000 is greater than between 12 and 11, but rather that within that distance the number of "creatively different" Unfriendly solutions is growing too fast. We have to find a ton of heuristics before moving on to P10.

This all is just some imaginary numbers, but my intuition is that the more complex problem may provide not only a few much cheaper Unfriendly solutions, but also extremely many little cheaper Unfriendly solutions, whose diversity may be difficult to cover by a small set of heuristics.

On the other hand, having developed enough heuristics, maybe we will see a pattern emerging, and make a better description of human utility functions. Specifying what we want, even if it proves very difficult, may still be more simple than adding all the "do not want" exceptions to the model. Maybe having a decent set of realistic "do not want" exceptions will help us discover what we really want. (By realistic I mean: really generated by AI's attempts for a simple solution, simple as in Occam's razor; not just pseudosolutions generated by an armchair philosopher.)

My intuitions for how to frame the problem run a little differently.

The way I see it, there is no possible way to block all unFriendly or model-breaking solutions, and it's foolish to try. Try framing it this way: any given solution has some chance of breaking the models, probably pretty low. Call that probability P (god I miss Latex) The goal is to get Friendliness close enough to the top of the list that P (which ought to be constant) times the distance D from the top of the list is still below whatever threshold we set as an acceptable risk to life on Earth / in our future light cone. In other words, we want to minimize the distance of the Friendly solutions from the top of the list, since each solution considered before finding the Friendly one brings a risk of breaking the models, and returning a false 'no-veto'.

Let's go back to the grandmother-in-the-burning-building problem: if we assume that this problem has mind-hacking solutions closer to the top of the list than the Friendly solutions (which I kind of doubt, really), then we've got a problem. Let's say, however, that our plan-generating system has a new module which generates simple heuristics for predicting how the models will vote (yes this is getting a bit recursive for my taste). These heuristics would be simple rules of thumb like 'models veto scenarios involving human extinction' 'models tend to veto scenarios involving screaming and blood' 'models veto scenarios that involve wires in their brains'.

When parsing potential answers, scenarios that violate these heuristics are discarded from the dataset early. In the case of the grandmother problem, a lot of the unFriendly solutions can be discarded using a relatively small set of such heuristics. It doesn't have to catch all of them, but it isn't meant to - simply by disproportionately eliminating unFriendly solutions before the models see them, you shrink the net distance to Friendliness. I see that as a much more robust way of approaching the problem.

I understand your point, and it's a good one. Let me think about it and get back to you.

New to LessWrong?