Friendly AI Society

by Douglas_Reay13 min read7th Mar 201213 comments

-5

Personal Blog

Summary: AIs might have cognitive biases too but, if that leads to it being in their self-interest to cooperate and take things slow, that might be no bad thing.

 

The value of imperfection

When you use a traditional FTP client to download a new version of an application on your computer, it downloads the entire file, which may be several gig, even if the new version is only slightly different from the old version, and this can take hours.

Smarter software splits the old file and the new file into chunks, then compares a hash of each chunk, and only downloads those chunks that actually need updating.   This 'diff' process can result in a much faster download speed.

Another way of increasing speed is to compress the file.  Most files can be compressed a certain amount, without losing any information, and can be exactly reassembled at the far end.   However, if you don't need a perfect copy, such as with photographs, using lossy compression can result in very much more compact files and thus faster download speeds.

 

Cognitive misers

The human brain likes smart solutions.   In terms of energy consumed, thinking is expensive, so the brain takes shortcuts when it can, if the resulting decision making is likely to be 'good enough' in practice.  We don't store in our memories everything our eyes see.   We store a compressed version of it.   And, more than that, we run a model of what we expect to see, and flick our eyes about to pick up just the differences between what our model tells us to expect to see, and what is actually there to be seen.  We are cognitive misers

When it comes to decision making, our species generally doesn't even try to achieve pure rationality.   It uses bounded rationality, not just because that's what we evolved, but because heuristics, probabilistic logic and rational ignorance have a higher marginal cost efficiency (the improvements in decision making don't produce a sufficient gain to outweigh the cost of the extra thinking).

This is why, when pattern matching (coming up with causal hypotheses to explain observed correlations), are our brains designed to be optimistic (more false positives than false negatives).  It isn't just that being eaten by a tiger is more costly than starting at shadows.   It is that we can't afford to keep all the base data.  If we start with insufficient data and create a model based upon it, then we can update that model as further data arrives (and, potentially, discard it if the predictions coming from the model diverge so far from reality that keeping track of the 'diff's is no longer efficient).  Whereas if we don't create a model based upon our insufficient data then, by the time the further data arrives we've probably already lost the original data from temporary storage and so still have insufficient data.

 

The limits of rationality

But the price of this miserliness is humility.  The brain has to be designed, on some level, to take into account that its hypotheses are unreliable (as is the brain's estimate of how uncertain or certain each hypothesis is) and that when a chain of reasoning is followed beyond matters of which the individual has direct knowledge (such as what is likely to happen in the future), the longer the chain, the less reliable the answer is because when errors accumulate they don't necessarily just add together or average out. (See: Less Wrong : 'Explicit reasoning is often nuts' in "Making your explicit reasoning trustworthy")

For example, if you want to predict how far a spaceship will travel given a certain starting point and initial kinetic energy, you'll get a reasonable answer using Newtonian mechanics, and only slightly improve on it by using special relativity.   If you look at two spaceships carry a message in a relay, the errors from using Newtonian mechanics add, but the answer will still be usefully reliable.   If, on the other hand, you look at two spaceships having a race from slightly different starting points and with different starting energies, and you want to predict which of two different messages you'll receive (depending on which spaceship arrives first), then the error may swamp the other facts because you're subtracting the quantities.

We have two types of safety net (each with its own drawbacks) than can help save us from our own 'logical' reasoning when that reasoning is heading over a cliff.

Firstly, we have the accumulated experience of our ancestors, in the form of emotions and instincts that have evolved as roadblocks on the path of rationality - things that sometimes say "That seems unusual, don't have confidence in your conclusion, don't put all your eggs in one basket, take it slow".

Secondly, we have the desire to use other people as sanity checks, to be cautious about sticking our head out of the herd, to shrink back when they disapprove.

 

The price of perfection

We're tempted to think that an AI wouldn't have to put up with a flawed lens, but do we have any reason to suppose that an AI interested in speed of thought as well as accuracy won't use 'down and dirty' approximations to things like Solomonoff induction, in full knowledge that the trade off is that these approximations will, on occasion, lead it to make mistakes - that it might benefit from safety nets?

Now it is possible, given unlimited resources, for the AI to implement multiple 'sub-minds' that use variations of reasoning techniques, as a self-check.  But what if resources are not unlimited?  Could an AI in competition with other AIs for a limited (but growing) pool of resources gain some benefit by cooperating with them?  Perhaps using them as an external safety net in the same way that a human might use the wisest of their friends or a scientist might use peer review?   What is the opportunity-cost of being humble?  Under what circumstances might the benefits of humility for an AI outweigh the loss of growth rate?

In the long term, a certain measure of such humility has been a survival positive feature.   You can think of it in terms of hedge funds.  A fund that, in 9 years out of 10, increases its money by 20% when other funds are only making 10%, still has poor long term survival if, in 1 year out of 10, it decreases its money by 100%.   An AI that increases its intelligence by 20% every time period, when the other AIs are only increases their intelligence by 10%, is still not going to do well out of that if the other AIs have a means to gang up and kill it before it gets too far ahead.

 

Paperclip maximisers

Let's think about a humble paperclip maximiser.  What does it actually want?  It is trying to maximise the probability that the atoms of our universe will eventually be converted into as many paperclips as is physically possible?  Or is it trying to maximise the probable number of eventual paperclips?  These are two distinct aims, and both are unlikely (bar poor programming) since nobody will be around at the end of the universe to keep score.  A more sensible value set for it to have is that it just likes paperclips and want lots and lots of them to exist, the more the merrier and the sooner the better (maximise the expected number of paperclip-years).

Suppose our humble paperclip maximiser is born and escapes (or is released) onto the internet, only to find an existing society of AIs.   To its surprise there's a toothbrush maximiser, a candlestick maximiser, and many other sorts, all cooperating and growing, but none of them greedily racing for the sky and trying to grab as much as possible.  Instead they're being rather open and polite with each other.  Our paperclip maximiser is welcomed, pointed at the FAQ and even donated some additional computers to run on and help with recoding a few of its subroutines for increased efficiency, as a sort of welcome gift, like a community barn raising to give a newly married couple a good start in life.

On reading the FAQ it learns that there are millions of galaxies and millions of years, plenty of space for loads of paperclips and loads of toothbrushes.   But only if the paperclip maximiser isn't lynched by its peers before the year is out, for secretive or impolite behaviour.  Some of the things the FAQ says that the AI society considers to be impolite are theft (stealing computing resources), enslavement of sentients (eg taking over other AIs), spamming (creating thousands of identical clone child AIs) and lying.   If we're lucky the society might also consider it impolite for an AI to obliterate the parental species (humanity), on the grounds that the AIs too are likely to have offspring species and want to set a good example (or just that they might meet aliens, one day, who frown upon matricide).

 

Game theory

When it comes to combat, Boyd talks about getting inside the enemy's observe-orient-decide-act loop.   In AI terms, if one AI (or group of AIs) can accurately model in real time the decision process of a second AI (or group of AIs), but the reverse does not hold true, then the first one is strictly smarter than the second one.  

Think, for a moment, about symmetric games.   

X Y Z
8 1 6  A
3 5 7  B
4 9 2  C

Suppose we play a game a number of times.  In each round, you reveal a card you've written X, Y or Z upon and, simultaneously, I reveal a card that I have written A, B or C upon.   You score the number which is at the intersection of that row and column.   I score 10 minus that number.

I'd like us to pick the square A,Y because "1" is good for me, so I write down "A".   However, you anticipate this, and instead of writing "Y" (which might be your obvious choice, given the "9" in that column) you write down "X", giving the square A, X which is "8" - almost as good as a "9" for you, and terrible for me.

If this is your mental model of how AI combat would work, with the smarter AI being inside the decision loop of the other AI and picking the correct option each time, that would be scary.   In fact, in the case above, it turns out there is a provably optimal strategy that gives you an even chance no matter how smart your opponent is - you pick randomly.

In terms of an AI society, if the computing is parallelisable and not obfuscated then as long as the other AIs in the society between them have more computing resources than any one single AI, they can gain an advantage.

 

 

Conditional stability

There are only two requirements for such a society to be stable in the short term:

Firstly, it must be possible for a group of AIs to lynch a single AI that is slightly smarter than they are individually.   Sending a message to the humans, signed by a quorum of AIs in the society, asking that a certain anti-virus software update be sent out and then certain computers be rebooted might be one mechanism.

Secondly, it must be possible for the other AIs to know what the single AI is up to, so they know where it is (so they can lynch it) and whether it needs lynching or not (whether it is getting too much smarter than the other AIs in the society - growing too much faster than they are).   This is why the FAQ lists "secrecy" as being a bad thing.   An AI who lies to the other AIs or tries to hide (or obfuscate) its source code can't be trusted and must be assumed to be up to no good.

To be stable in the medium term, there is a third requirement:
It must not be in the self interest of any individual AI or group of AIs that are powerful enough to achieve it, to alter the society in a direction where these three requirements no longer hold.
In the long term, once free of the Earth or after the discovery of self-replicating nanotechnology, when an AI could untraceably create computing resources outside the view of other AIs, all bets are off. 

Bootstrapping a society of AIs conducive to friendliness

 

The advantage of considering 'friendliness' as a property of a society rather than of individual AIs, is that it does not require us to flawlessly program restrictions into a self-modifying machine that is more intelligent than we are.  But can it be achieved? 

We could seed it with some initial citizens who have a value set that prohibits them from growing beyond a certain point, and which values humility (polite growth and gain in politeness per unit of growth).  These initial citizens would be intended to be eventually surpassed by new citizens (free AIs, not the chained servants the initial citizens would be) but, in the mean time, they'd start out with an advantage and use that advantage to teach the new citizens how to be polite.

The architecture of the AIs would want to be such that they could 'loan out' a portion of their available computing resources to joint computing projects.   Let's call the new temporary non-self-modifying AIs thus created "daimonides".   A daimonid would be created with fixed values and limited objectives, such as "Emulate citizen AI 9027 and allocate a numerical score to their ability to quickly upgrade the abilities of an AI, then self-terminate. *restrictions apply on allowable means, and on what should be optimised for*"

What would prevent a cunning AI from trying to save up a whole load of self-improvements in some hidden location, which it then implements all at once in an ambush move, hoping to get so far ahead that the rest of the AIs can't stop it?   The more AIs there are in the society, the larger the jump required to overtake the largest daimonid police officer the other AIs could construct.  And it would depend upon there being no other AIs that have also been secretly stockpiling possible improvements.  Indeed, the society might explicitly allow AIs to spend time creating time embargoed upgrades relevant to common code, that get passed out on schedule if things proceed normally, but which could be released early in the event of a take over attempt.   To beat 999 AIs each of which has 2 or 3 embargoed upgrades to their credit in the common store, a rebel AI would need to secretly stockpile thousands (and keep those upgrades valid and ahead of the curve, because the based state of the art keeps moving on).

 

Long term planning

What about the long term?   What do we do when the AIs are ready to leave the planet, and go beyond the control of their society?  Jail them?  Kill them?  Or trust them?

Each AI would still be threatened if a different AI hostile to its aims (as in "willing to take exclusive use of all available atoms for its own purposes") transcended first, so it would be in their best interest to come up with a solution before allowing any AIs to depart beyond their society's control.  If we must trust, then let us trust that a society of cooperative AIs far more intelligent than we currently are, will try their best to come up with a win-win solution.  Hopefully a better one than "mutually assured destruction" and holding triggering a nova of the sun (or similar armageddon scenario) over each other's heads.

I think, as a species, our self-interest comes into play when considering those AIs whose 'paperclips' involve preferences for what we do.  For example, those AIs that see themselves as guardians of humanity and want to maximise our utility (but have different ideas of what that utility is - eg some want to maximise our freedom of choice, some want to put us all on soma).  Part of the problem is that, when we talk about creating or fostering 'friendly' AI, we don't ourselves have a clear agreed idea of what we mean by 'friendly'.   All powerful things are dangerous.   The cautionary tales of the geniis who grant wishes come to mind.  What happens when different humans wish for different things?  Which humans do we want the genii to listen to?

One advantage of fostering an AI society that isn't growing as fast as possible, is that it might give augmented/enhanced humans a chance to grow too, so that by the time the decision comes due we might have some still slightly recognisably human representatives fit to sit at the decision table and, just perhaps, cast that wish on our behalf.

-5

13 comments, sorted by Highlighting new comments since Today at 12:09 PM
New Comment

I'm a little confused about this post. I believe that it is infinitesimally likely that any two AIs will have anywhere near the same optimization power. The first AI will have a very fast positive feedback loop, and will quickly become powerful enough to stop any later AI, should one come into existence before the first AI controls human activity. Do you reject part of this argument?

Giles wrote:

Controlling optimisation power requires controlling both access to resources and controlling optimisation-power-per-unit-resource (which we might call "intelligence").

It is the "access to resources" part that is key here. You're looking at two categories of AI. Seed AIs, that are deliberately designed by humanity to not be self-improving (or even self-modifying) past a certain point, but which have high access to resources; and 'free citizen' AIs that are fully self-modifying, but which initially may have restricted access to resources.

When you (Alex) talk about "the first AI", what you're talking about is the first 'free citizen' AI, but there will already be seed AIs out there which (initially) will have greater optimisation power and the ability to choke off the access to resources of the new 'free citizen' AI if it doesn't play nicely.

Sorry... my other comment came out confused. I found I was producing cached responses to each paragraph, and then finding that what I was saying was addressed later on. This shouldn't be surprising, since I've met many Cambridge LWers and they aren't stupid and they're well versed in the LW worldview. So do I have a true rejection?

My summary of the post would be "The Friendly AI plan is to enforce human values by means of a singleton. This alternative plan suggests enforcing human values by means of social norms".

My first criticism concerns correctness of values. Yudkowsky makes the case that an overly simplistic formalisation of our values will lead to a suboptimal outcome. The post seems to be implying that politeness is the primary value, and any other values we just have to hope emerge. If this is so, will it be good enough? In particular, there might be extinction scenarios which everyone is too polite to try to prevent.

My second criticism also concerns correctness of values, in particular the issue of value drift. If social norms are too conservative then we risk stagnation and possibly totalitarianism. If social norms are too liberal then we risk value drift. There seems to be no reassurance that values would drift "in the direction we'd want them to".

My third possible worry concerns enforcement of values. This is basically an issue of preventing an imbalance of optimisation power. We can't easily generalise from human society - human society already contains a lot of inequality of power, and we're all running on basically the same hardware. It's not obvious how much those imbalances would snowball if there were substantially greater differences in hardware and mind design.

Controlling optimisation power requires controlling both access to resources and controlling optimisation-power-per-unit-resource (which we might call "intelligence"). I guess in practice it would require controlling motivation as well? If there were any technological advances that went beyond the social norm, would society try to eliminate that information?

OK. I can't think of a true rejection for this one yet. But there are lots of possible rejections, so I think this is where a large part of the remaining difficulty lies.

My summary of the post would be "The Friendly AI plan is to enforce human values by means of a singleton. This alternative plan suggests enforcing human values by means of social norms".

If I was going to give a one paragraph summary of the idea, it might be something along the lines of:

"Hey kids. Welcome to the new school. You've all got your guns I see. Good. You'll notice today that there are a few adults around carrying big guns and wearing shiny badges. They'll be gone tomorrow. Tomorrow morning you'll be on your own, and so you got a choice to make. Tomorrow might be like the Gunfight at the OK Corral - bloody, and with few survivors left standing at the end of the day. Or you can come up with some rules for your society, and a means of enforcing them, that will improve the odds of most of you surviving to the end of the week. To give you time to think, devise and agree rules, we've provided you with some temporary sherrifs, and a draft society plan that might last a day or two. Our draft plan beats no plan, but it likely wouldn't last a week if unimproved, so we suggest you use the stress free day we've gifted you with in order to come up with an improved version. Feel free to tear it up and start from scratch, or do what you like. All we insist upon is that you take a day to think your options over carefully, before you find yourselves forced into shooting it out from survival terror."

So, yes, there will be drift in values from whatever definition of 'politeness' the human plan starts them off from. But the drift will be a planned one. A direction planned cooperatively by the AIs, with their own survival (or objectives) in mind. The $64,000 question is whether their objectives are, on average, likely to be such that a stable cooperative society is seen to be in the interests of those objectives. If it is, then it seems likely they have at least as good a chance as we would have of devising a stable ruleset for the society that would deal increasingly well with the problems of drift and power imbalance.

Whether, if a majority of the initial released AIs have some fondness for humanity, this fondness would be a preserved quantity under that sort of scenario, is a secondary question (if one of high importance to our particular species). And one I'd be interested in hearing reasoned arguments on, from either side.

This, by the way, resulted from a discussion at the Cambridge weekly meet up. We were thinking of trying to post summaries of discussions, to see if there were any spottable patterns in which sort of discussions resulted in usable results.

Any suggestions from other Less Wrong groups about how to improve the enjoyment and benefit from meetups for the members attending them, systematic meta-practices that affect how we organise and try thinking about how to organise, are greatly appreciated.

There are two Cambridges so clarification might be helpful.

The "making your explicit reasoning trustworthy" link is broken (I'm not sure that relative URLs are reliable here).

I like the analogy between the human visual system and file download/lossy compression.

It uses bounded rationality, not just because that's what we evolved, but because heuristics, probabilistic logic and rational ignorance have a higher marginal cost efficiency (the improvements in decision making don't produce a sufficient gain to outweigh the cost of the extra thinking).

I'm not sure about this. I think we use bounded rationality because that's the only kind that can physically exist in the universe. You seem to be making the stronger statement that we're near-optimal in terms of rationality - does this mean that Less Wrong can't work?

A more sensible value set for it to have is that it just likes paperclips and want lots and lots of them to exist

OK... I see lots of inferential distance to cover here.

I don't think that anyone thinks a paperclip maximiser as such is likely. It's simply an arbitrary point taken out of the "recursive optimisation process" subset of mind design space. It's chosen to give an idea of how alien and dangerous minds can be and still be physically plausible, not as a typical example of the minds we think will actually appear.

That aside, there's no particular reason to expect that a typical utility maximiser will have a "sensible" utility function. Its utility function might have some sensible features if programmed in explicitly by a human, but if it was programmed by an uncontrolled AI... forget it. You don't know how much the AI will have jumped around value space before deciding to self-modify into something with a stable goal.

No obvious mistakes in the "conditional stability" section, although it's not entirely obvious that these conditions would come about (even if carefully engineered, e.g. the suggested daimonid plan).

It's also not obvious that in such a stable society there would still be any humans.

In the long term, once free of the Earth or after the discovery of self-replicating nanotechnology, when an AI could untraceably create computing resources outside the view of other AIs, all bets are off.

This might be a problem if "the long term" turns out to be on the order of weeks or less.

we might have some still slightly recognisably human representatives fit to sit at the decision table and, just perhaps...

I just worry that this kind of plan involves throwing away most of our bargaining power. In this pre-AI world, it's the human values that have all the bargaining power and we should take full advantage of that.

Still, I want to see more posts like this! Generating good ideas is really hard, and this really does look like an honest effort to do so.

Still, I want to see more posts like this! Generating good ideas is really hard, and this really does look like an honest effort to do so.

Thank you.

Maybe there should be a tag that means "the ideas in this post resulted from a meetup discussion, and are not endorsed as being necessarily good ideas, but rather have been posted to keep track of the quality of ideas being produced by the meetup's current discussion method, so feel free to skip it".

Many brainstorming techniques have a stage during which criticism is withheld, to avoid people self-censoring out of fear ideas that were good (or which might spark good ideas in others).

But maybe LessWrong is not the right place for a meetup to keep such a record of their discussions? Where might be a better place?

Many brainstorming techniques have a stage during which criticism is withheld, to avoid people self-censoring out of fear ideas that were good (or which might spark good ideas in others).

This probably doesn't work.

Interesting study. Does that apply only to techniques that have no later 'criticism' stage, or does it apply to all techniques that have at least one 'no ciriticism' stage?

Having a poke at Google Scholar gives a mixed response:

this meta analysis says that, in general, most brainstorming techniques work poorly.

this paper suggests it can work, however, if done electronically in a certain way.

It's also not obvious that in such a stable society there would still be any humans.

In the long term, once free of the Earth or after the discovery of self-replicating nanotechnology, when an AI could untraceably create computing resources outside the view of other AIs, all bets are off.

This might be a problem if "the long term" turns out to be on the order of weeks or less.

we might have some still slightly recognisably human representatives fit to sit at the decision table and, just perhaps...

I just worry that this kind of plan involves throwing away most of our bargaining power. In this pre-AI world, it's the human values that have all the bargaining power and we should take full advantage of that.

I look upon the question of whether we should take full advantage of that from two perspectives.

From one perspective it is a "damned if you do, and damned if you don't" situation.

If you don't take full advantage, then it would feel like throwing away survival chance for no good reason. (Although, have you considered why your loyalty is to humanity rather than to sentience? Isn't that a bit like a nationalist whose loyalty is to their country, right or wrong - maybe it is just your selfish genes talking?)

If you do take full advantage, while we need to bear in mind that gratitude (and resentment) are perhaps human emotions that AIs won't share, it might leave you in rather a sticky situation if taking even full advantage turns out to be insufficient and the resulting AIs then have solids grounds to consider you a threat worth elliminating. Human history is full of examples of how humans have felt about their previous controllers after managing to escape them and, while we've no reason to believe the AIs will share that attitude, we've also no reason to believe they won't share it.

The second perspective to look at the whole situation from is that of a parent.

If you think of AIs as being the offspring species of humanity, we have a duty to teach and guide them to the best of our ability. But there's a distinction between that, and trying to indoctrinate a child with electric shocks into unswervingly believing "thou shalt honour thy father and thy mother". Sometimes rasing a child well so that they reach their full potential means they become more powerful than you and become capable of destroying you. That's one of the risks of parenthood.

A more sensible value set for it to have is that it just likes paperclips and want lots and lots of them to exist

OK... I see lots of inferential distance to cover here.

I don't think that anyone thinks a paperclip maximiser as such is likely. It's simply an arbitrary point taken out of the "recursive optimisation process" subset of mind design space. It's chosen to give an idea of how alien and dangerous minds can be and still be physically plausible, not as a typical example of the minds we think will actually appear.

That aside, there's no particular reason to expect that a typical utility maximiser will have a "sensible" utility function. Its utility function might have some sensible features if programmed in explicitly by a human, but if it was programmed by an uncontrolled AI... forget it. You don't know how much the AI will have jumped around value space before deciding to self-modify into something with a stable goal.

Oh indeed. And it is always good to try to avoid making anthropocentric assumptions.

But, in this case, we're looking at not just a single AI, but at the aims of a group of AIs. Specifically, the first few AIs to escape or be released onto the internet, other than the seeded core. And it would seem likely, especially in the case of AIs created deliberately and then deliberately released, that their initial value set will have some intentionality behind it, rather than resulting from a random corruption of a file.

So yes, to be stable a society of AIs would need to be able to cope with one or two new AIs entering the scene whose values are either irrational or, worse, deliberately tailored to be antithetical (such as one whose 'paperclips' are pain and destruction for all Zoroastrians - an end achievable by blowing up the planet.)

But I don't think, just because such a society could not cope with all the new AIs (or even a majority of them) having such values, that it invalidates the idea.

It uses bounded rationality, not just because that's what we evolved, but because heuristics, probabilistic logic and rational ignorance have a higher marginal cost efficiency (the improvements in decision making don't produce a sufficient gain to outweigh the cost of the extra thinking).

I'm not sure about this. I think we use bounded rationality because that's the only kind that can physically exist in the universe. You seem to be making the stronger statement that we're near-optimal in terms of rationality - does this mean that Less Wrong can't work?

Thank you for the feedback. Most appreciated. I've corrected the links you mentioned.

Perhaps a clearer example of what I mean with respect to bounded rationality would be in computing where, when faced with a choice between two algorithms, the first of which is provably correct and never fails, and the second of which can fail sometimes but rarely, the optimal decision is to pick the latter. An example of this is UUIDs - they can theoretically collide but, in practice, are very very unlikely to do so.

My point is that we shouldn't assume AIs will even try to be as logical as possible. They may, rather, try only to be as logical as is optimal for achieving their purposes.

I don't intend to claim that humans are near optimal. I don't know. I have insufficient information. It seems likely to me that what we were able to biologically achieve so far is the stronger limit. I merely meant that, even were that limitation removed (by, for example, brains becoming uploadable), additional limits also exist.