Consider the following scenario. MIRI succeeds beyond my wildest expectations. It comes up with a friendliness theory, and then uses it to make provably friendly AGI before anyone else can make an unfriendly one. And then a year and a half later, we find that Eliezer Yudkowsky has become the designated god-emperor of the lightcone, and the rest of the major MIRI researchers are his ministers. Woops.

 

My guess for the probability of this type of scenario given a huge MIRI success along those lines is around 15%. The reasoning is straightforward. (1) We don't know what's going on inside any particular person's head. (2) Many or most humans are selfish. (3) Looking altruistic is more likely to draw support than explicitly setting out to take over the world. (5)  And human acting abilities, while limited, are likely adequate (for example, spies seem quite successful at concealing their motives). I'd say those four things are reasonably independent and sufficient for some deception to be happening, so guessing at some probabilities, it works out to something like 1×0.5×0.8×0.5 = 0.2† At least if the person is sufficiently determined to achieve their goal no matter what.

Or to put it in a more interesting and LW-referring way, it has been argued that we might not be able to distinguish friendly from unfriendly AGI during their "take off" period, since both would probably act the same way to take over the world (before optimising it in a friendly or unfriendly way). We might also expect that as a person becomes more agenty, their behaviour more closely approximates that of an AGI, since an AGI is pretty much the embodiment of agency. It certainly seems like Eliezer is quite agenty for a human, so we clearly presently find ourselves in his take off period!

 

So, supposing we don't want god-emperor!Eliezer, what do we do? First, I think it is important to recognise that this problem is almost certainly not urgent. Selfish-Eliezer!optimisations and altruistic-Eliezer!optimisations probably require nearly the same work to not end up being paperclips!optimisations, so for the near future it doesn't particularly matter if he's secretly out to take over the world or not. In addition, having Eliezer be the god-emperor of the light cone is probably not all that bad—it might not be optimal, but it's likely to be very much better than any unfriendly outcome, and probably better than the status quo. It does, however, seem like a useful issue to address in the longer term for any organisation that plans to make a friendly(-ish) AGI.

 

Off-hand, I can think of three plausible ideas:

Extreme transparency (i.e. increase acting difficulty). If we attach microphones and cameras to all the main researchers and have a team of a hundred volunteers go over every minute of recordings every day, it would be hard to conceal any deceptive plans.

Very good lie detection (i.e. find out what's in people's minds). While current lie detector technologies are likely to be inadequate, it seems possible that by the time the issue becomes relevant we might have much better technology. That said, it might also be the case that if someone keeps simulating having a particular goal set for decades, she may be able to easily generate responses based on that goal set without the responses registering as lies in any conventional sense.

Some sort of organisational solution. I'm not sure what this would look like, to be honest, but it seems like some organisation patterns should be more likely to produce friendly outcomes than others independently of transparency. For example, if the essential team is huge and organised so different members check each others' work, it's likely to be more difficult to conceal or organise any conspiracy (although there would be other problems).

What I think would not work is letting the group write all the code and then checking it over. First, there are likely to be enough talented programmers in it that they'd have a good shot at hiding whatever they wanted, and, second, it's not clear that it would be possible to stop the group from just changing some key bits at the last moment (e.g. from CEV of humans in world to CEV of humans in room) and then launching without further oversight.

 

As I said, this is by no means a priority, but I think it would be useful to start thinking about the problems sooner rather than later, so we don't end up being caught off guard. Also, while I specifically mention Eliezer in this post, he simply happens to provide the most salient example, and most points are equally (or in some cases more) relevant to the general case of anyone working on AGI.

 

† I probably picked those numbers in large part to justify my above "15%", but you get the idea.

New to LessWrong?

New Comment
28 comments, sorted by Click to highlight new comments since: Today at 5:41 AM

Extreme transparency increases the probability of government intervention at some late stage of the game (attempting to create a friendly AI constitutes attempting to create an extremely powerful weapon).

See what I did there? I changed the genre from sci-fi to political thriller.

[-][anonymous]11y10

So if the awareness that MIRI is working on an AGI in secret (or rather, without sharing details) happens to get to the government, and they consider it a powerful weapon as you say...then what? You know what they do to grandiose visionaries working on powerful weapons in their backyard who explicitly don't want to share, and whose goals pretty clearly compromise their position, right?

Related worry that I've been meaning to ask about for a while:

Given that is there is still plenty of controversy over which types of unusual human minds to consider "pathological" instead of just rare variants, how is MIRI planning to decide which ones are included in CEV? My skin in the game: I'm one of the Autistic Spectrum people who feel like "curing my autism" would make me into a different person who I don't care about. I'm still transhumanist; I still want intelligence enhancements, external boosts to my executive function and sensory processing on demand, and the ability to override the nastiest of my brain chemistry. But even with all of that I would still know myself as very different from neurotypicals. I naturally see the world in different categories that most, and I don't think in anything like words or a normal human language. Maybe more relevantly, I have a far higher tolerance---even a need---for sphexishness, than most people of comparable intelligence to me.

Fun theory for me would be a little different, and I think that there really are a lot of people who would consider what I did with eternity to be somewhat sad and pathetic, maybe even horrifying. I think it could be an empathic uncanny valley effect or just an actual basic drive people have, to make everybody be the same. I'm worried that this could be an actual terminal value for some people that would hold up under reflective equilibrium.

I'm not too freaked out because I think the consensus is that since Autistic people already exist and some are happy, we should have a right to continue to exist and even make more of ourselves. But I actually believe that if we didn't exist it would be right to create us, and I worry that most neurotypicals extrapolated volition would not create all the other variations on human minds that should exist but don't yet.

If it matters, up to $1000 for MIRI this year could be at stake in answering this concern. I say this in a blatant and open effort to incentivize Eliezer etc. to answering me. I hope that I'm not out of line for waving money around like this, because this really is a big part of my choice about whether FAI is good enough. I really want to give what I can to prevent existential threats, but I consider a singularity overly dominated by neurotypicals to be a shriek.

If it matters, up to $1000 for MIRI this year could be at stake in answering this concern. I say this in a blatant and open effort to incentivize Eliezer etc. to answering me. I hope that I'm not out of line for waving money around like this, because this really is a big part of my choice about whether FAI is good enough.

Not out of line at all. You are encouraged to use economics like that.

Did MIRI answer you? I would expect them to have answered by now, and I'm curious about the answer.

The other problem with checking the code is that an FAI's Friendliness content is also going to consist significantly or mostly of things the FAI has learned, in its own cognitive representation. Keeping these cognitive representations transparent is going to be an important issue, but basically you'd have to trust that the tool and possibly AI skill that somebody told you translates the cognitive content, really does so; and that the AI is answering questions honestly.

The main reason this isn't completely hopeless for external assurance (by a trusted party, i.e., they have to be trusted not to destroy the world or start a competing project using gleaned insights) is that the FAI team can be expected to spend effort on maintaining their own assurance of Friendliness, and their own ability to be assured that goal-system content is transparent. Still, we're not talking about anything nearly as easy as checking the code to see if the EVIL variable is set to 1.

have a team of a hundred volunteers

Or, as they came to be known later, the hundred Grand Moffs.

While I am amused by the idea, in practice, I'm not sure it's possible to keep a conspiracy that large from leaking in modern times. Also, if more volunteers are readily accepted, it would be impractical to bribe them all.

Don't worry. If the Eliezer conspiracy fails and one of the Grand Moffs betrays them, a year later the Hanson upload clan will successfully coordinate a take-over. After all, their coordination is evolutionarily optimal.

How important is averting God Emperor Yudkowsky given that it's an insanely powerful AI, which would lead to a much, much more benign and pleasant utopia than the (imo highly questionable) Fnargl? Much better than Wireheads for the Wirehead God.

I've actually thought, for a while, that Obedient AI might be better than the Strict Utilitarian Optimization models preferred by LW.

I think you're assigning too little weight to provably friendly but this is pretty funny

How much would you be willing to wager that you will be able to follow the proof of friendly for the specific AI which gets implemented?

Very little. I don't like my odds. If Eliezer has provable friendliness theorems but not an AI, it's in his and everyone's interest to distribute the generalized theorem to everyone possible so that anyone working on recursive AGI has a chance to make it friendly, which means the algorithms will be checked by many, publicly. If Eliezer has the theorems and an AI ready to implement, there's nothing I can do about it at all. So why worry?

You could Trust Eliezer and everyone who checked his theorems and then put money towards the supercomputer cluster which is testing every string of bits for 'friendly AI'.

In fact, at that point that project should become humanity's only notable priority.

You could Trust Eliezer and everyone who checked his theorems and then put money towards the supercomputer cluster which is testing every string of bits for 'friendly AI'.

In fact, at that point that project should become humanity's only notable priority.

Probably not. That project does not sound efficient (to the extent that it sounds unfeasible).

Does the expected return on "this has a one in 2^(10^10) chance of generating a proven friendly AI" lose to anything else in terms of expected result? If so, we should be doing that to the exclusion of FAI research already.

Does the expected return on "this has a one in 2^(10^10) chance of generating a proven friendly AI" lose to anything else in terms of expected result?

Yes. It loses to any sane strategy of finding an FAI that is not "testing every string of bits".

Testing every string of bits is a nice 'proof of concept' translation of "can prove a string of bits is an FAI if it in fact is an FAI" to "actually have an FAI". It just isn't one that works in practice unless, say, you have a halting oracle in your pocket.

If I had a halting oracle, I would have AGI.

By your logic, testing an actual AI against the proofs of friendliness isn't worth the expense, because there might be an inconclusive result.

If I had a halting oracle, I would have AGI.

No you wouldn't.

By your logic, testing an actual AI against the proofs of friendliness isn't worth the expense, because there might be an inconclusive result.

No, my logic doesn't claim that either.

I write a program which recognizes AGI and halts when it finds one. It tests all programs less than 1 GB. If it halts, I narrow the search space. If it doesn't, I expand the search space. Each search takes the time used by the halting oracle.

We already have a way to prove that a string of bits is FAI; it is trivial to generalize that to AI.

"I have a string of bits. I do not know if it is FAI or not, but I have a way of testing that is perfectly specific and very sensitive. The odds of an AI researcher writing a FAI are vanishingly small. Should I test it if it was written by an AI researcher? Should I test it if it isn't?

We already have a way to prove that a string of bits is FAI; it is trivial to generalize that to AI.

Yes, if you have that proof and a halting oracle then you have AGI. That is vastly different to the claim "If I had a halting oracle I would have an AGI".

That proof was a prerequisite for the idea to try the brute force method of finding the FAI.

Someone could also set an end condition that only an AGI is likely to ever reach and use iterated UTMs to find a program that reached the desired end condition. I'm not sure how to a priori figure out what someone smarter than me would do and test for it, but it isn't the first thing I would do with a constant-time halting oracle. The first thing I would do is test a program that found additional implications of first order logic and halted when it found a contradiction. Then I would add the inverse of the Gödel statement and test that program.

I usually assume "provably friendly" means "will provably optimise for complex human-like values correctly" and thus includes both actual humanity-wide values and one person's values (and the two options can plausibly be switched between at a late stage of the design process).

And, well, I meant for it to be a little funny, so I'll take that as a win!

Friendly means something like "will optimize for the appropriate complex human-like values correctly."

Saying "we don't have clear criteria for appropriate human values" is just another way of saying that defining Friendly is hard.

Provably Friendly means we have a mathematical proof that an AI will be Friendly before we start running the AI.

An AI that gives its designer ultimate power over humanity is almost certainly not Friendly, even if it was Provably designer-godlike-powers implementing.

How do you define "appropriate"? It seems a little circular. Friendly AI is AI that optimises for appropriate values, and appropriate values are the ones for which we'd want a Friendly AI to optimise.

You might say that "appropriate" values are ones which "we" would like to see the future optimised towards, but I think whether these even exist humanity-wide is an open question (and I'm leaning towards "no"), in which case you should probably have a contingency definition for what to do if they, in fact, do not.

I would also be shocked if there were a "provable" definition of "appropriate" (as opposed to the friendliness of the program being provable with respect to some definition of "appropriate").

That said, it might also be the case that if someone keeps simulating having a particular goal set for decades, she may be able to easily generate responses based on that goal set without the responses registering as lies in any conventional sense.

I think it's at least as probable that the "simulated" goal set will become the bulk of the real one. That's if one single person starts out with the disguised plan. A conspiracy might be more stable, but also more likely to be revealed before completion.

I think it's at least as probable that the "simulated" goal set will become the bulk of the real one.

It's possible, but I think if you're clever about it, you can precommit yourself to reverting to your original preferences once the objective is within your grasp. Every time you accomplish something good or endure something bad, you can tell yourself "All this is for my ultimate goal!" Then when you can actually get either your ultimate goal or your pretend ultimate goal, you'll be able to think to yourself "Remember all those things I did for this? Can't betray them now!" Or some other analogous plan. But I would agree that the trade off here is probably having an easier time pretending vs having greater fidelity to your original goal set. Probaby.

One possibility to prevent the god-emperor scenario is for multiple teams to simultaneously implement and turn on their own best efforts at FAI. All the teams should check all the other teams' FAI candidate, and nothing should be turned on until all the teams think it's safe. The first thing the new FAIs should do is compare their goals with each other and terminate all instances immediately if it looks like there are any incompatible goals.

One weakness is that most teams might blindly accept the most competent team's submission, especially if that team is vastly more competent. Breaking a competent team up may reduce that risk but would also reduce the likelihood of successful FAI. Another weakness is that perhaps multiple teams implementing an FAI will always produce slightly different goals that will cause immediate termination of the FAI instances. There is always the increasing risk over time of a third party (or one of the FAI teams accidentally) turning on uFAI, too.