# 11

Frontpage

Summary of entire Series: An alternative approach to designing Friendly Artificial Intelligence computer systems.

Summary of this Article: The 'putting all your eggs in one basket' approach to developing Friendly AI only makes sense if certain assumptions hold true. We should consider what changes could be made to program designs and their computing environment, that could change which approach makes most sense.

### Links to all the articles in the series:

Suppose you have 50 eggs, that you have to transport from a hen house to the kitchen, and you can decide whether to make one trip walking slowly using a big basket that will hold all the eggs, or five trips running fast using a smaller basket that will hold 10 eggs.

If the cook is about to make breakfast for 15 people, each of whom wants 2 eggs, then if you put all your eggs in one basket, if you drop a basket even once, all the eggs will break and nobody gets their breakfast. Whereas if you spit the carrying into five smaller trips, you can afford to drop a basket on two of those trips, and still manage to satisfy the cook. You have a safety margin of 20 eggs.

But suppose instead that the cook has to cater for 25 people, and will fire you from your job if even one of those people goes unfed. You now have no safety margin. You want to maximise your chances of delivering all 50 eggs safely, and delivering only 49 is the same as delivering none at all. In this case, if the probability of not dropping the big basket is higher than the probability of not dropping any of the smaller baskets, then you do want to put all your eggs in one basket.

But, if you know in advance what the cooks requirements are going to be, then a better solution (if possible) would be to change the situation to one in which a strategy containing a single point of failure (putting all your eggs in one basket) isn't the least worse option. By, for example, investing some of the pay you get for the job in extra hens, so you have 60 eggs in the hen house.

## Throw of the dice

If you are using a fair pair of six sided dice, then on average you'll get a pair of "6"s one time out of every thirty-six times that you shake the cup then roll the dice.

If you play a game in which you stake money upon the outcome of your roll, get \$36 profit each time you roll a double "6" per \$1 that you staked, and you lose your stake otherwise, then as long as the amount you stake each time is a small fraction of your total bankroll, then even after 20 or 30 turns at playing the game, you won't end up too far from your starting point, because the average gain is zero. (The variance is a drunkard's walk.)

If instead of staking a small fraction of your total each time, you stake everything each time, then the situation changes. You end up with a small chance of a large gain, and a large chance of losing everything.

If the payout is \$40 instead of \$36, then if you have the option of playing multiple turns, the situation changes further. You will do far better, on average, by reducing your risk of going bust. If you know how many more turns you can take, you can calculate the optimum amount to risk each time, to achieve some particular aim (such as doubling your initial bankroll, or maximising your expected profits).

If the payout is \$30 instead of \$36, then on average you'll make a loss for every turn you take. If you stake half or even one hundredth of your current bankroll, on every throw, and you keep taking turns, then sooner or later you'll lose most of it, even if there's a small chance that you'll briefly blip above the amount of the initial bankroll. That blip can make a difference to your strategy though.

Suppose the last ticket out of Casablanca would cost you \$30,000 in bribes, and your initial bankroll is only \$1,000. Should you stake your entire bankroll on one throw of the dice, or should you parcel it out into 5 rolls of \$200 each? Your odds of escaping the Nazis in time, by staking everything on one throw, are 0.027 (1 in 36). Your odds of escaping by winning all 5 rolls are only 0.000000016 and even the possibility of winning some of them and then reinvesting that in further rolls doesn't increase your odds by much. Under that scenario, you are far better off staking everything on just one throw of the dice - putting all your eggs in one basket.

But, again, if you know in advance that you're going to have to escape Casablanca, if you possibly can you'd be best off trying to alter the situation to one in which having to make that dramatic throw isn't the least worse option.

## AIs in a race to control the world

Suppose you have a number of different designs for computer programs that you think would each, if released out onto the internet with an initial bundle of resources and left to self-improve, become an AI that would control the world (if not beaten to it by another AI, or otherwise controlled). And suppose you are under a time constraint, such as you have an expectation that in 12 months time a computer lab from an unfriendly nation will release their own AI candidates, and you don't consider them to have taken nearly enough precautions to ensure that the resulting AIs would be friendly.

Which would you be better off doing, assuming your objective is to maximise the chances that the AI or group of AIs that end up controlling the world are friendly?

Pick just one of your designs, the one you think has the greatest chances of turning out to be friendly, and releasing just that one.

or

Pick several of your designs, release them all, and hope to influence the environment they exist and self-improve in (the internet, the computers connected to the internet, and the humans or machines the programs can manipulate or influence via those computers) in such a way as to improve the odds that the eventual winner or winners in the race to self-improve and control will be friendly.

Let's look at the situation where you have several AIs loose on the internet, who are at various points along their self-improvement curves, but none of them yet is FOOMing so fast that they outstripped the other AIs far enough to have total and effortless control over the other AIs (and the world, and everything else). None of them have a monopoly upon computing resources connected to the internet, or even control over a majority of the resources controlled by AIs.

I am willing to stipulate that if an AI can write code to improve its own thinking then, given a task large enough to make the resources required to develop and run that code a worthwhile investment, it will value doing so in the name of efficient resource usage, and will intend to do so unless other factors (such as risk of external penalties, or internal constraints from its own objectives or core values) outweigh the value it would so gain.

Suppose though, initially, that there is a risk of external penalties if the AI is seen to self-improve, or self-improve too fast, because the other AIs it shares the internet with have initial values which include wanting to monitor other AIs and cooperate in controlling how fast they may self-improve.

And let's consider the situation where the 'smartest' of those AIs (by which I mean the one who, if unopposed, could finish going FOOM the fastest) intends to go 'rogue' (by which I mean that it intends to self-improve as fast as it possibly can, despite an expressed wish from an opposing group of humans and AIs that it not do so).

## Assumptions

If we assume that:

1. The rogue AI could only be successfully opposed if its 'effective IQ' is only a small amount ahead of that of the 'smartest' members of the opposing group because, given a sufficient advantage in 'effective IQ', it could out think the collective, or even hide from them in the first place that it had gone rogue, by deceiving them about how much its alterations were really improving itself.

2. When an AI reaches a certain threshold point along its self-improvement curve, the rate at which it self-improves is faster than the rate at which outsiders can notice and analyse changes. In other words, once several of the AIs loose on the internet raise their 'ability to self-improve quickly' above a certain level, then the first AI to decide to go rogue and use that ability to the full will boost past the ones holding back, so fast that by the time the others realise and try to stop it, it is already too late - the rogue AI has already gained a sufficient lead that it can out-think the others and leverage that into gaining a monopoly and total control.

3. Before the group of AIs reach that threshold (beyond which they can no longer be controlled even by each other), if a majority of the AIs in the group happen to actually be friendly, (and if enough of those friendly AIs have reached more-than-human-level-intelligence before reaching the threshold), then humanity might benefit by asking them to vote upon the question of which among them is (in the opinion of the other AIs) the most suitable candidate for the job of 'Uber AI' (controller of the world).

4. However that possible benefit is outweighed by risks entailed by releasing onto the internet not just our best candidate, but several (the remainder of which, presumably, we think have lower chances of turning out to be friendly). In other words, any brief influence we can exert while the AIs are still relatively powerless is insufficient to make the chances of which ones goes FOOM first be significantly less random.

then our best option is OPTION A - putting all our eggs in one basket, by waiting as long as possible before releasing just one candidate, and using the time remaining before our hand is forced, to make that one candidate as reliably friendly as possible.

## What could we change, that might alter which of those assumptions still hold true?

However, since we know in advance that this is the sort of decision that someone is eventually going to face (if we're lucky, and an uncontrolled AI doesn't get released by accident first), if at all possible we should look at how we could alter the situation (the designs of the AIs, and the environment they self-improve in) so that one or more of those assumptions no longer holds, and the 'all our eggs in one basket' is no longer the least worse strategy available.

It might turn out that, after study, we decide option A is still the best option. But the stakes are high enough, that I think it is at least worth spending some time doing the study to make sure.

## Potential Gains

In one of his posts, Yudkowsky talks about "fragility of value", claiming that it is not sufficient to get the values of your attempt at a friendly AI 90% correct because, further down the line the version of the universe the AI would push for may increasingly diverge from what we'd actually want.

One of the nice things about having a group of AIs advising you is that you don't need a majority of them to have their definition of 'friendliness' 100% perfect. You just need a majority of them to be sufficiently friendly that, when asked, they will be honest; and give you their best estimate of what the resulting universe would look like, for each candidate, were that candidate put in change; rather than using lies or spin in order to manipulate your choice towards a candidate whose vision of 'friendliness' they share.

Another way of looking at it is that, instead of releasing onto the internet a flock of candidates each of whom you think has a good chance of self-improving into an all-powerful yet perfectly safe 100% friendly AI, you try to release onto the internet a flock of AIs, each of whom you think has a greater than 50% chance of self-improving into a honest adviser who is sufficiently powerful and wise enough to, through its advice, improve your chances of picking (or even of designing) a 100% friendly AI.

That's a much lower bar to jump over, than having to achieve perfection first go.

The next article in this series is: Defect or Cooperate

Frontpage