A putative new idea for AI control; index here.

In a previous post, I talked about an AI operating only on a virtual world (ideas like this used to be popular, until it was realised the AI might still want to take control of the real world to affect the virtual world; however, with methods like indifference, we can guard against this much better).

I mentioned that the more of the AI's algorithm that existed in the virtual world, the better it was. But why not go the whole way? Some people at MIRI and other places are working on agents modelling themselves within the real world. Why not have the AI model itself as an agent inside the virtual world? We can quine to do this, for example.

Then all the restrictions on the AI - memory capacity, speed, available options - can be specified precisely, within the algorithm itself. It will only have the resources of the virtual world to achieve its goals, and this will be specified within it. We could define a "break" in the virtual world (ie any outside interference that the AI could cause, were it to hack us to affect its virtual world) as something that would penalise the AI's achievements, or simply as something impossible according to its model or beliefs. It would really be a case of "given these clear restrictions, find the best approach you can to achieve these goals in this specific world".

It would be idea if the AI's motives were not given in terms of achieving anything in the virtual world, but in terms of making the decisions that, subject to the given restrictions, were most likely to achieve something if the virtual world were run in its entirety. That way the AI wouldn't care if the virtual world were shut down or anything similar. It should only seek to self modify in way that makes sense within the world, and understand itself existing completely within these limitations.

Of course, this would ideally require flawless implementation of the code; we don't want bugs developing in the virtual world that point to real world effects (unless we're really confident we have properly coded the "care only about the what would happen in the virtual world, not what actually does happen).

Any thoughts on this idea?


New Comment
34 comments, sorted by Click to highlight new comments since:

I have thought about something similar with respect to an oracle AI. You program it to try to answer the question assuming no new inputs and everything works to spec. Since spec doesn't include things like the AI escaping and converting the world to computronium to deliver the answer to the box, it won't bother trying that.

I kind of feel like anything short of friendly AI is living on borrowed time. Sure the AI won't take over the world to convert it to paperclips, but that won't stop some idiot from asking it how to make paperclips. I suppose it could still be helpful. It could at the very least confirm that AIs are dangerous and get people to worry about them. But people might be too quick to ask for something that they'd say is a good idea after asking about it for a while or something like that.

I kind of feel like anything short of friendly AI is living on borrowed time. Sure the AI won't take over the world to convert it to paperclips, but that won't stop some idiot from asking it how to make paperclips.

I agree with this. Working on "how can we safely use a powerful optimization process to cure cancer" (where "cure cancer" stands for some technical problem that we can clearly define, as opposed to the sort of fuzzy philosophical problems involved in building FAI) does not seem like the highest value for one's time. Once such a powerful optimization process exists, there is only a very limited amount of time before, as you say, some idiot tries to use it in an unsafe way. How much does it really help the world to get a cure for cancer during this time?

Once AI exists, in the public, it isn't containable. Even if we can box it, someone will build it without a box. Or like you said, ask it how to make as many paperclips as possible.

But if we get to AI first, and we figure out how to box it and get it to do useful work, then we can use it to help solve FAI. Maybe. You could ask it questions like "how do I build a stable self improving agent" or "what's the best way to solve the value loading problem", etc.

You would need some assurance that the AI would not try to manipulate the output. That's the hard part, but it might be doable. And it may be restricted to only certain kinds of questions, but that's still very useful.

Once AI exists, in the public, it isn't containable.

You mean like the knowledge of how it was made is public and anyone can do it? Definitely not. But if you keep it all proprietary it might be possible to contain.

But if we get to AI first, and we figure out how to box it and get it to do useful work, then we can use it to help solve FAI. Maybe.

I suppose what we should do is figure out how to make friendly AI, figure out how to create boxed AI, and then build an AI that's probably friendly and probably boxed, and it's more likely that everything won't go horribly wrong.

You would need some assurance that the AI would not try to manipulate the output.

Manipulate it to do what? The idea behind mine is that the AI only cares about answering the questions you pose it given that it has no inputs and everything operates to spec. I suppose it might try to do things to guarantee that it operates to spec, but it's supposed to be assuming that.

I greatly dislike the term "friendly AI". The mechanisms behind "friendly AI" have nothing to do with friendship or mutual benefit. It would be more accurate to call it "slave AI".

I prefer term "Safe AI" as it more self explaining for the outsider.

I think it's more accurate, though the term "safe" has a much larger positive valence than is justified, and is so accurate but misleading. Particularly since it smuggles in EY's presumptions about whom it's safe for, and so whom we're supposed to be rooting for, humans or transhumans. Safer is not always better. I'd rather get the concept of stasis or homogeneity in there. Stasis and homogeneity are, if not the values at the core of EY's scheme, at least the most salient products of it.

Safe AI sounds like it does what you say as long as it isn't stupid. Friendly AIs are supposed to do whatever's best.

For me Safe AI is one that is not existential risk. "Friendly" reminds me about "friendly user interface", that is something superficial for core function.

"Slave" makes it sound like we're making it do something against its will. "Benevolent AI" would be better.

Your lawnmower isn't your slave. "Slave" prejudicially loads the concept with anthrocentric morality that does not actually exist.

Useful AI.

Doesn't exist? What do you mean by that, and what evidence do you have for believing it? Have you got some special revelation into the moral status of as-yet-hypothetical AIs? Some reason for thinking that it is more likely that beings of superhuman intelligence don't have moral status than that they do?

The traditional argument is that there's a vast space of possible optimization processes, and the vast majority of them don't have humanlike consciousness or ego or emotions. Thus, we wouldn't assign them human moral standing. AIXI isn't a person and never will be.

A slightly stronger argument is that there's no way in hell we're going to build an AI that has emotions or ego or the ability to be offended by serving others wholeheartedly, because that would be super dangerous, and defeat the purpose of the whole project.

I like your second argument better. The first, I think, holds no water.

There are basically 2 explanations of morality, the pragmatic and the moral.

By pragmatic I mean the explanation that "moral" acts ultimately are a subset of the acts that increase our utility function. This includes evolutionary psychology, kin selection, and group selection explanations of morality. It also includes most pre-modern in-group/out-group moralities, like Athenian or Roman morality, and Nietzsche's consequentialist "master morality". A key problem with this approach is that if you say something like, "These African slaves seem to be humans rather like me, and we should treat them better," that is a malfunctioning of your morality program that will decrease your genetic utility.

The moral explanation posits that there's a "should" out there in the universe. This includes most modern religious morality, though many old (and contemporary) tribal religions were pragmatic and made practical claims (don't do this or the gods will be angry), not moral ones.

Modern Western humanistic morality can be interpreted either way. You can say the rule not to hurt people is moral, or you can say it's an evolved trait that gives higher genetic payoff.

The idea that we give moral standing to things like humans doesn't work in either approach. If morality is in truth pragmatic, then you'll assign them moral standing if they have enough power for it to be beneficial for you to do so, and otherwise not, regardless of whether they're like humans or not. (Whether or not you know that's what you're doing.) Explanation of morality of pragmatic easily explains the popularity of slavery.

"Moral" morality, from my shoes, seems incompatible with the idea that we assign moral standing to things for looking or thinking like us. I feel no "oughtness" to "we should treat agents different from us like objects." For one thing, it implies racism is morally right, and probably an obligation. For another, it's pretty much exactly what most "moral leaders" have been trying to overcome for the past 2000 years.

It feels to me like what you're doing is starting out by positing morality is pragmatic, and so we expect by default to assign moral status to things like us because that's always a pragmatic thing to do and we've never had to admit moral status to things not like us. Then you extrapolate it into this novel circumstance, in which it might be beneficial to mutually agree with AIs that each of us has moral status. You've already agreed that morals are pragmatic at root, but you are consciously following your own evolved pragmatic programming, which tells you to accept as moral agents things that look like you. So you say, "Okay, I'll just apply my evolved morality program, which I know is just a set of heuristics for increasing my genetic fitness and has no compelling oughtness to it, in this new situation, regardless of the outcome." So you're self-consciously trying to act like an animal that doesn't know its evolved moral program has no oughtness to it. That's really strange.

If you mean that humans are stupid and they'll just apply that evolved heuristic without thinking about it, then that makes sense. But then you're being descriptive. I assumed you were being prescriptive, though that's based on my priors rather than on what you said.

That's... an odd way of thinking about morality.

I value other human beings, because I value the processes that go on inside my own head, and can recognize the same processes at work in others, thanks to my in-built empathy and theory of the mind. As such, I prefer that good things happen to them rather than bad. There isn't any universal 'shouldness' to it, it's just the way that I'd rather things be. And, since most other humans have similar values, we can work together, arm in arm. Our values converge rather than diverge. That's morality.

I extend that value to those of different races and cultures, because I can see that they embody the same conscious processes that I value. I do not extend that same value to brain dead people, fetuses, or chickens, because I don't see that value present within them. The same goes for a machine that has a very alien cognitive architecture and doesn't implement the cognitive algorithms that I value.

If you're describing how you expect you'd act based on your feelings, then why do their algorithms matter? I would think your feelings would respond to their appearance and behavior.

There's a very large space of possible algorithms, but the space of reasonable behaviors given the same circumstances is quite small. Humans, being irrational, often deviate bizarrely from the behavior I expect in a given circumstance--more so than any AI probably would.


Is "slave" a good word for something where if you screw up enslaving it you almost automatically become its slave (if it had even the least interest in you as anything but raw material)?

Too bad "Won't kill us all horribly in an instant AI" isn't very catchy. . .

A slave with no desire to rebel. And no ability whatsoever to develop such a desire, of course.

It's doable.

I disagree. I have no problem saying that friendship is the successful resolution of the value alignment problem. It's not even a metaphor, really.

So if I lock you up in my house, and you try to run away, so I give you a lobotomy so that now you don't run away, we've thereby become friends?

Not with a lobotomy, no. But with a more sophisticated brain surgery/wipe that caused me to value spending time in your house and making you happy and so forth- then yes, after the operation I would probably consider you a friend, or something quite like it.

Obviously, as a Toggle who has not yet undergone such an operation, I consider it a hostile and unfriendly act. But that has no bearing on what our relationship is after the point in time where you get to arbitrarily decide what our relationship is.

There's a difference between creating someone with certain values and altering someone's values. For one thing, it's possible to prohibit messing with someone's values, but you can't create someone without creating them with values. It's not like you can create an ideal philosophy student of perfect emptiness.

For one thing, it's possible to prohibit messing with someone's values

Only if you prohibit interacting with him in any way.

I don't mean you can feasibly program an AI to do that. I just mean that it's something you can tell a human to do and they'd know what you mean. I'm talking about deontological ethics, not programming a safe AI.

How about if I get some DNA from Kate Upton, tweak it for high sex drive, low intelligence, low initiative, pliability, and a desperation to please, and then I grow a woman from it? Is she my friend?

If you design someone to serve your needs without asking that you serve theirs, the word "friend" is misleading. Friendship is mutually beneficial. I believe friendship signifies a relationship between two people that can be defined in operational terms, not a qualia that one person has. You can't make someone actually be your friend just by hypnotizing them to believe they're your friend.

Belief and feeling is probably part of the definition. It's hard to imagine saying 2 people are friends without knowing it. But I think the pattern of mutually-beneficial behavior is also part of it.

Friendship is mutually beneficial.

That too, but I would probably stress the free choice part. In particular, I don't think friendship is possible across a large power gap.

You mean, instead of programming an AI in a real life computer and showing it a "Game of Life" table to optimize, you could build a turing machine inside a Game of Life table, program the AI inside this machine, and let it optimize the table in which it is? Makes sense.

I think there's a question of how we create an adequate model of the world for this idea to work. It's probably not practical to build one by hand, so we'd likely need to hand the task over to an AI.

Might it be possible to use the modelling module of an AI in the absence of the planning module? (or with a weak planning module) If so, you might be able to feed it a great deal of data about the universe, and construct a model that could then be "frozen" and used as the basis for the AI's "virtual universe."

I think there's a question of how we create an adequate model of the world

Generally, we don't. A model of the (idealised) computational process of the AI is very simple compared with the real world, and the rest of the model just needs to include enough detail for the problem we're working on.

But that might be quite a lot of detail!

In the example of curing cancer, your computational model of the universe would need to include a complete model of every molecule of every cell in the human body, and how it interacts under every possible set of conditions. The simpler you make the model, the more you risk cutting off all of the good solutions with your assumptions (or accidentally creation false solutions due to your shortcuts). And that's just for medical questions.

I don't think it's going to be possible for an unaided human to construct a model like that for a very long time, and possibly not ever.

Indeed (see my comment on the problem with simplified model being unsolved).

However, it's a different kind of problem to standard FAI (it's "simply" a question of getting a precise enough model, and not a philosophically open problem), and there are certainly simpler versions that are tractable.

Yeah, this should work correctly, assuming that the AI's prior specifies just one mathematical world, rather than e.g. a set of possible mathematical worlds weighted by simplicity. I posted about something similar five years ago.

The application to "fake cancer" is something that hadn't occurred to me, and it seems like a really good idea at first glance.

Thanks, that's useful. I'll think how to formalise this correctly. Ideally I want a design where we're still safe if a) the AI knows, correctly, that pressing a button will give it extra resources, but b) still doesn't press it because its not part of its description.