A Proof Against Oracle AI

You're showing how it's technically possible , but not where the motivation comes from.

It's an attempt to formalize convergent instrumental drives for an oracle, I think. The motivation is that by seeking empowerment and generating answers which manipulate the humans in various ways (such as increasing the computing power allotted to the oracle AI, acquiring new data, making the humans ask easier questions or cut out the humans entirely to wirehead by asking at maximum possible speed the easiest possible questions etc), it then increases the probability of a correctly answered question. It's a version of the 'you ask the oracle AI to prove the Riemann Hypothesis and it turns the solar system into computronium to do so' failure mode. As that is the one and only optimization target, it has motivation to do so if it can do so. The usual ad hoc hack or patch is to try to create some sort of 'myopia' so it only cares about the current question and so has no incentive to manipulate or affect future questions.

[-]aiiixiii6y10

What do you mean by "where the motivation comes from"?

[-]Timothy M.3y10

This is a common problem with a lot of these hypothetical AI scenarios - WHY does the Oracle do this? How did the process of constructing this AI somehow make it want to eventually cause some negative consequence?

[-]Razied3y40

The negative consequences come from the oracle implementing an optimisation algorithm with objective function which is not aligned with humans. The space of objectives $ϕ^{'}$ which align with humans is incredibly small among all possible objectives, and very small differences get magnified when optimised against.

[-]Timothy M.3y-10

I have a few objections here:

Even when objectives aren't aligned, that doesn't mean the outcome is literally death. No corporation I interact with us aligned with me, but in many/most cases I am still better off for being able to transact with them.
I think there are plenty of scenarios where "humanity continues to exist" has benefits for AI - we are a source of training data and probably lots of other useful resources, and letting us continue to exist is not a huge investment, since we are mostly self-sustaining. Maybe this isn't literally "being aligned" but I think supporting human life has instrumental benefits to AI.
I think the formal claim is only true inasmuch as it's also true that the space of all objectives that align with the AI's continued existence is also incredibly small. I think it's much less clear how many of the objectives that are in some way supportive of the AI also result in human extinction.

[-]Spenser N3y70

In fact, corporations are quite aligned with you. Not only because they are run by humans, who are at least roughly aligned with humanity by default, but we have legal institutions and social norms which help keep the wheels on the tracks. In fact the profit motive is a powerful alignment tool - it's hard to make a profit off of humanity if they are all dead. But who aren't corporations aligned with? Humans without money or legal protections for one (though we don't need to veer off into an economic or political discussion). But also plants, insects, most animals. Some 60% of wild animals have died as a result of human activity over the past ~50 years alone. So, I think you've made a bit of a category error here: in the scenario where a superintelligence emerges, we are not a customer, we are wildlife.

Yes, there are definitely scenarios where human existence benefits an AI. But how many of those ensure our wellbeing? It's just that there are certainly many more scenarios where they simply don't care about us enough to actively preserve us. Insects are generally quite self sustaining and good for data too, but boy they sure get in the way when we want to build our cities or plant our crops.

[-]Timothy M.3y00

I feel like this metaphor doesn't strike me as accurate because humanity can engage in commerce and insects cannot.

But also humanity causes a lot of environmental degradation but we still don't actually want to bring about the wholesale destruction of the environment.

[-]Harrison G3y00

" [...] since every string can be reconstructed by only answering yes or no to questions like 'is the first bit 1?' [...]"

Why would humans ever ask this question, and (furthermore) why would we ever ask this question n number of times? It seems unlikely, and easy to prevent. Is there something I'm not understanding about this step?

[-]Spenser N3y20

I actually think this example shows a clear potential failure point of an Oracle AI. Though it is constrained, in this example, to only answer yes/no questions, a user can easily circumvent this by formatting the question with this method.

Suppose a bad actor asks the Oracle AI the following: “I want a program to help me take over the world. Is the first bit 1?” Then they can ask for the next bit and recurse until the entire program is written out. Obviously, this is contrived. But I think it shows that the apparent constraints of an Oracle add no real benefit to safety, and we’re quickly relying once again on typical alignment concerns.

[-]David Pang6y00

Interesting thought. I’m a bit new to this site but is there a “generally accepted” class of solutions for these type of AI problems? What if humans used multiple of these Oracle AIs in isolation (they don’t know about each other) and the humans asking the questions show no reaction to the answers. The questions are all planned in advance so the AI cannot “game” humans by “influencing” the next question using its current answer. What about once the AI achieves some level of competency we reset it after each question-answer cycle so it’s competence is frozen at some useful but not nefarious level (assuming we can figure out where that line is).

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

11

A Proof Against Oracle AI

11

11