The Need for Human Friendliness

[-]Qiaochu_Yuan13y120

Extreme transparency increases the probability of government intervention at some late stage of the game (attempting to create a friendly AI constitutes attempting to create an extremely powerful weapon).

See what I did there? I changed the genre from sci-fi to political thriller.

[-][anonymous]13y10

So if the awareness that MIRI is working on an AGI in secret (or rather, without sharing details) happens to get to the government, and they consider it a powerful weapon as you say...then what? You know what they do to grandiose visionaries working on powerful weapons in their backyard who explicitly don't want to share, and whose goals pretty clearly compromise their position, right?

[-]GDC313y110

Related worry that I've been meaning to ask about for a while:

Given that is there is still plenty of controversy over which types of unusual human minds to consider "pathological" instead of just rare variants, how is MIRI planning to decide which ones are included in CEV? My skin in the game: I'm one of the Autistic Spectrum people who feel like "curing my autism" would make me into a different person who I don't care about. I'm still transhumanist; I still want intelligence enhancements, external boosts to my executive function and sensory processing on demand, and the ability to override the nastiest of my brain chemistry. But even with all of that I would still know myself as very different from neurotypicals. I naturally see the world in different categories that most, and I don't think in anything like words or a normal human language. Maybe more relevantly, I have a far higher tolerance---even a need---for sphexishness, than most people of comparable intelligence to me.

Fun theory for me would be a little different, and I think that there really are a lot of people who would consider what I did with eternity to be somewhat sad and pathetic, maybe even horrifying. I think it could be an empathic uncanny valley effect or just an actual basic drive people have, to make everybody be the same. I'm worried that this could be an actual terminal value for some people that would hold up under reflective equilibrium.

I'm not too freaked out because I think the consensus is that since Autistic people already exist and some are happy, we should have a right to continue to exist and even make more of ourselves. But I actually believe that if we didn't exist it would be right to create us, and I worry that most neurotypicals extrapolated volition would not create all the other variations on human minds that should exist but don't yet.

If it matters, up to $1000 for MIRI this year could be at stake in answering this concern. I say this in a blatant and open effort to incentivize Eliezer etc. to answering me. I hope that I'm not out of line for waving money around like this, because this really is a big part of my choice about whether FAI is good enough. I really want to give what I can to prevent existential threats, but I consider a singularity overly dominated by neurotypicals to be a shriek.

[-]wedrifid13y40

If it matters, up to $1000 for MIRI this year could be at stake in answering this concern. I say this in a blatant and open effort to incentivize Eliezer etc. to answering me. I hope that I'm not out of line for waving money around like this, because this really is a big part of my choice about whether FAI is good enough.

Not out of line at all. You are encouraged to use economics like that.

[-]Philip_W11y00

Did MIRI answer you? I would expect them to have answered by now, and I'm curious about the answer.

[-]Eliezer Yudkowsky13y60

The other problem with checking the code is that an FAI's Friendliness content is also going to consist significantly or mostly of things the FAI has learned, in its own cognitive representation. Keeping these cognitive representations transparent is going to be an important issue, but basically you'd have to trust that the tool and possibly AI skill that somebody told you translates the cognitive content, really does so; and that the AI is answering questions honestly.

The main reason this isn't completely hopeless for external assurance (by a trusted party, i.e., they have to be trusted not to destroy the world or start a competing project using gleaned insights) is that the FAI team can be expected to spend effort on maintaining their own assurance of Friendliness, and their own ability to be assured that goal-system content is transparent. Still, we're not talking about anything nearly as easy as checking the code to see if the EVIL variable is set to 1.

[-]Kawoomba13y50

have a team of a hundred volunteers

Or, as they came to be known later, the hundred Grand Moffs.

[-]Elithrion13y10

While I am amused by the idea, in practice, I'm not sure it's possible to keep a conspiracy that large from leaking in modern times. Also, if more volunteers are readily accepted, it would be impractical to bribe them all.

[-]gwern13y70

Don't worry. If the Eliezer conspiracy fails and one of the Grand Moffs betrays them, a year later the Hanson upload clan will successfully coordinate a take-over. After all, their coordination is evolutionarily optimal.

[-]ikrase13y20

How important is averting God Emperor Yudkowsky given that it's an insanely powerful AI, which would lead to a much, much more benign and pleasant utopia than the (imo highly questionable) Fnargl? Much better than Wireheads for the Wirehead God.

I've actually thought, for a while, that Obedient AI might be better than the Strict Utilitarian Optimization models preferred by LW.

[-]drethelin13y10

I think you're assigning too little weight to provably friendly but this is pretty funny

[-]Decius13y10

How much would you be willing to wager that you will be able to follow the proof of friendly for the specific AI which gets implemented?

[-]drethelin13y00

Very little. I don't like my odds. If Eliezer has provable friendliness theorems but not an AI, it's in his and everyone's interest to distribute the generalized theorem to everyone possible so that anyone working on recursive AGI has a chance to make it friendly, which means the algorithms will be checked by many, publicly. If Eliezer has the theorems and an AI ready to implement, there's nothing I can do about it at all. So why worry?

[-]Decius13y-40

You could Trust Eliezer and everyone who checked his theorems and then put money towards the supercomputer cluster which is testing every string of bits for 'friendly AI'.

In fact, at that point that project should become humanity's only notable priority.

[-]wedrifid13y10

You could Trust Eliezer and everyone who checked his theorems and then put money towards the supercomputer cluster which is testing every string of bits for 'friendly AI'.

In fact, at that point that project should become humanity's only notable priority.

Probably not. That project does not sound efficient (to the extent that it sounds unfeasible).

[-]Decius13y-10

Does the expected return on "this has a one in 2^(10^10) chance of generating a proven friendly AI" lose to anything else in terms of expected result? If so, we should be doing that to the exclusion of FAI research already.

[-]wedrifid13y20

Does the expected return on "this has a one in 2^(10^10) chance of generating a proven friendly AI" lose to anything else in terms of expected result?

Yes. It loses to any sane strategy of finding an FAI that is not "testing every string of bits".

Testing every string of bits is a nice 'proof of concept' translation of "can prove a string of bits is an FAI if it in fact is an FAI" to "actually have an FAI". It just isn't one that works in practice unless, say, you have a halting oracle in your pocket.

[-]Decius13y00

If I had a halting oracle, I would have AGI.

By your logic, testing an actual AI against the proofs of friendliness isn't worth the expense, because there might be an inconclusive result.

[-]wedrifid13y00

If I had a halting oracle, I would have AGI.

No you wouldn't.

By your logic, testing an actual AI against the proofs of friendliness isn't worth the expense, because there might be an inconclusive result.

No, my logic doesn't claim that either.

[-]Decius13y00

I write a program which recognizes AGI and halts when it finds one. It tests all programs less than 1 GB. If it halts, I narrow the search space. If it doesn't, I expand the search space. Each search takes the time used by the halting oracle.

We already have a way to prove that a string of bits is FAI; it is trivial to generalize that to AI.

"I have a string of bits. I do not know if it is FAI or not, but I have a way of testing that is perfectly specific and very sensitive. The odds of an AI researcher writing a FAI are vanishingly small. Should I test it if it was written by an AI researcher? Should I test it if it isn't?

[-]wedrifid13y00

We already have a way to prove that a string of bits is FAI; it is trivial to generalize that to AI.

Yes, if you have that proof and a halting oracle then you have AGI. That is vastly different to the claim "If I had a halting oracle I would have an AGI".

[-]Decius13y00

That proof was a prerequisite for the idea to try the brute force method of finding the FAI.

Someone could also set an end condition that only an AGI is likely to ever reach and use iterated UTMs to find a program that reached the desired end condition. I'm not sure how to a priori figure out what someone smarter than me would do and test for it, but it isn't the first thing I would do with a constant-time halting oracle. The first thing I would do is test a program that found additional implications of first order logic and halted when it found a contradiction. Then I would add the inverse of the Gödel statement and test that program.

[-]Elithrion13y00

I usually assume "provably friendly" means "will provably optimise for complex human-like values correctly" and thus includes both actual humanity-wide values and one person's values (and the two options can plausibly be switched between at a late stage of the design process).

And, well, I meant for it to be a little funny, so I'll take that as a win!

[-]TimS13y00

Friendly means something like "will optimize for the appropriate complex human-like values correctly."

Saying "we don't have clear criteria for appropriate human values" is just another way of saying that defining Friendly is hard.

Provably Friendly means we have a mathematical proof that an AI will be Friendly before we start running the AI.

An AI that gives its designer ultimate power over humanity is almost certainly not Friendly, even if it was Provably designer-godlike-powers implementing.

[-]Elithrion13y10

How do you define "appropriate"? It seems a little circular. Friendly AI is AI that optimises for appropriate values, and appropriate values are the ones for which we'd want a Friendly AI to optimise.

You might say that "appropriate" values are ones which "we" would like to see the future optimised towards, but I think whether these even exist humanity-wide is an open question (and I'm leaning towards "no"), in which case you should probably have a contingency definition for what to do if they, in fact, do not.

I would also be shocked if there were a "provable" definition of "appropriate" (as opposed to the friendliness of the program being provable with respect to some definition of "appropriate").

[-]torekp13y00

That said, it might also be the case that if someone keeps simulating having a particular goal set for decades, she may be able to easily generate responses based on that goal set without the responses registering as lies in any conventional sense.

I think it's at least as probable that the "simulated" goal set will become the bulk of the real one. That's if one single person starts out with the disguised plan. A conspiracy might be more stable, but also more likely to be revealed before completion.

[-]Elithrion13y00

I think it's at least as probable that the "simulated" goal set will become the bulk of the real one.

It's possible, but I think if you're clever about it, you can precommit yourself to reverting to your original preferences once the objective is within your grasp. Every time you accomplish something good or endure something bad, you can tell yourself "All this is for my ultimate goal!" Then when you can actually get either your ultimate goal or your pretend ultimate goal, you'll be able to think to yourself "Remember all those things I did for this? Can't betray them now!" Or some other analogous plan. But I would agree that the trade off here is probably having an easier time pretending vs having greater fidelity to your original goal set. Probaby.

[-]Pentashagon13y-10

One possibility to prevent the god-emperor scenario is for multiple teams to simultaneously implement and turn on their own best efforts at FAI. All the teams should check all the other teams' FAI candidate, and nothing should be turned on until all the teams think it's safe. The first thing the new FAIs should do is compare their goals with each other and terminate all instances immediately if it looks like there are any incompatible goals.

One weakness is that most teams might blindly accept the most competent team's submission, especially if that team is vastly more competent. Breaking a competent team up may reduce that risk but would also reduce the likelihood of successful FAI. Another weakness is that perhaps multiple teams implementing an FAI will always produce slightly different goals that will cause immediate termination of the FAI instances. There is always the increasing risk over time of a third party (or one of the FAI teams accidentally) turning on uFAI, too.

LESSWRONG
LW

LESSWRONG
LW

8

The Need for Human Friendliness

8

8