Value achievement dilemma

The value achievement dilemma is a way of framing the value alignment problem in a larger context. This both emphasizes that there might be possible solutions besides AI, and emphasizes that such solutions must meet a high bar of potency or efficacy in order to resolve our basic dilemmas the way that a sufficiently value-aligned and cognitively powerful AI could resolve our basic dilemmas.

The point of value alignment can be seen in the suggestion by Eliezer Yudkowsky

and Nick Bostrom

that we can see Earth-originating intelligent life as having two possible stable states, superintelligence and extinction. If intelligent life goes extinct, especially if it damages or destroys the ecosphere, it will probably stay extinct. If it becomes superintelligent, it will presumably expand through the universe and stay superintelligent for as long as physically possible. Eventually, our civilization is bound to wander into one of these attractors or another. Furthermore, by the Gandhi stability argument, any sufficiently advanced cognitive agent will be stable in its motivations or preference framework. So if and when life wanders into the superintelligence attractor, it will either end up in a stable state of e.g. fun-loving or the reflective equilibrium of its creators' civilization and hence achieving lots of value, or it will go on maximizing paperclips forever.

Among the dilemmas we face in getting into the high-value-achieving attractor, rather than the extinction attractor or the equivalence class of paperclip maximizers, are:

The possibility of careless (or insufficiently cautious, or much less likely malicious) actors creating a non-value-aligned AI that undergoes an intelligence explosion.

The possibility of engineered superviruses destroying enough of civilization that the remaining humans go extinct without ever reaching sufficiently advanced technology.

Conflict between multipolar powers with nanotechnology resulting in a super-nuclear-exchange disaster that extinguishes all life.

Other events or capacities seem like they could prompt entry into the high-value-achieving superintelligence attractor:

Direct creation of a fully normatively aligned Autonomous AGI agent.

Creation of a genie powerful enough to avert the creation of other UnFriendly AI.

Intelligence-augmented humans (or 64-node clustered humans linked by brain-computer interface brain information exchange, etcetera) who are able and motivated to solve the value alignment problem.

On the other hand, consider someone who proposes that "Rather than building AI, we should build Oracle AIs that just answer questions," and who then, after further exposure to the concept of the AI-Box Experiment and cognitive uncontainability, further narrows their specification to say that an Oracle running in three layers of sandboxed simulation must output only formal proofs of given theorems in Zermelo-Fraenkel set theory, and a heavily sandboxed and provably correct verifier will look over this output proof and signal 1 if it proves the target theorem and 0 otherwise, at some fixed time to avoid timing attacks. This is meant to ensure that the Oracle can only influence a single binary bit of our world in any predictable way.

Barring operation through truly unexpected causal domains, this set of precautions may plausibly constrain even a superintelligence to the point where it can no longer plan much of a rich causal influence on our world, though it can still refuse to prove certain theorems that are in fact provable.

But it doesn't resolve the larger value achievement dilemma, because there's no obvious thing we can do with this proof Oracle that resolves our larger problems. There's no plan such that it would save the world if only we could take some suspected theorems of ZF and know that some of them had formal proofs (without being able to inspect the proof).

On the other hand, suppose someone proposes, as an intended Relevant Limited AI, a non-self-modifying Genie agent that is only allowed to model non-cognitive material systems and has been constructed and injuncted not to model other agents, whether those agents are human or other hypothetical minds.

If we buy that this degree of restriction is pragmatically possible and makes the safe construction of the Genie easier, we next have the question of whether we can use a Genie like this to do anything strongly relevant to our larger dilemma - anything that changes the nature of the game.

An obvious target strategy for a limited Genie is to ask it to create nanotechnology and use that tech to gently shut down all other AI projects, e.g. by copying the software and then sealing the hardware. But this putative Genie can't model agents, so it may not be able to identify only potential AI projects. We could use such a Genie to build nanotechnology and then heal the sick or create lots of food for the hungry, but while this is a conventional good, we haven't yet identified any path to victory that stops other projects from building AI, or lets us create intelligence-enhanced humans (unless this can be done without modeling human minds or other agents at all).

When we consider the strategy "Build a Genie that can only model material problems but not other agents, and set it to create nanotechnology that is used to synthesize food", executing this strategy would still leave the value achievement dilemma, even if some of the hungry had been fed. There would not yet have been suggested a fix to the larger problem, a way out of the hole.

On the other hand, suppose that we have previously considered the case of a "Genie that can only model material problems but not other agents, smart enough to create nanotechnology but not to model and fool its programmers" and made the case for this as a plausible development target where the use restrictions actually improve ease of safe development. In this case we are just looking for any relevant use of a Genie like that. Suppose someone suggests using the Genie to melt all computers (except the Genie's own), while providing synthesized food, nanotechnological medical assistance, and taking other steps to avoid casualties from the resultant disruption to human civilization - effectively putting the rest of human civilization on hold. This doesn't get us all the way to a happy intergalactic civilization, but it creates space and time for further value alignment work on more sophisticated AIs while (putatively) having averted all of the killer problems that would usually prevent us from taking our time.

Success at this strategy would leave us in a situation with a clear path to victory, which would be for the creators of the original Genie to work at their leisure and without fearing interruption by outside existential catastrophe, on more advanced Genies that could carry out human intelligence enhancement, or human uploads, or even a Friendly Sovereign if such could be made trustworthy. With the time pressure removed, we would have an excellent chance of solving these problems if any humanly achievable solution to any one of them existed.

If we accept the above argument and agree that at least one relevant strategy would putatively exist for "a Genie that could only carry out material tasks but not model other agents", a Genie like this would be a relevant limited AI, unlike the proof Oracle that is limited but not relevant.

The thrust of considering a larger 'value achievement dilemma' is that while imaginable alternatives to value-aligned AI exist, as well as imaginable Limited AIs, they must pass a double test:

They must be genuinely easier or safer than the value alignment problem or the non-limited form of the AI.

They must be game-changers for the overall situation in which we find ourselves, opening up a clear path to victory from the newly achieved scenario.

Any strategy that does not putatively open a clear path to victory if it succeeds, doesn't seem like a plausible policy alternative to trying to solve the value alignment problem or to doing something else such that success leaves us a clear path to victory. Trying to solve the value alignment problem is something intended to leave us a clear path to achieving almost all of the achievable value for the future and its astronomical stakes. Anything that doesn't open a clear path to getting there is not an alternative solution for getting there.

The point of value alignment can be seen in the suggestion by Eliezer Yudkowsky

and Nick Bostrom

Among the dilemmas we face in getting into the high-value-achieving attractor, rather than the extinction attractor or the equivalence class of paperclip maximizers, are:

The possibility of careless (or insufficiently cautious, or much less likely malicious) actors creating a non-value-aligned AI that undergoes an intelligence explosion.

The possibility of engineered superviruses destroying enough of civilization that the remaining humans go extinct without ever reaching sufficiently advanced technology.

Conflict between multipolar powers with nanotechnology resulting in a super-nuclear-exchange disaster that extinguishes all life.

Other events or capacities seem like they could prompt entry into the high-value-achieving superintelligence attractor:

Direct creation of a fully normatively aligned Autonomous AGI agent.

Creation of a genie powerful enough to avert the creation of other UnFriendly AI.

Intelligence-augmented humans (or 64-node clustered humans linked by brain-computer interface brain information exchange, etcetera) who are able and motivated to solve the value alignment problem.

The thrust of considering a larger 'value achievement dilemma' is that while imaginable alternatives to value-aligned AI exist, as well as imaginable Limited AIs, they must pass a double test:

They must be genuinely easier or safer than the value alignment problem or the non-limited form of the AI.

They must be game-changers for the overall situation in which we find ourselves, opening up a clear path to victory from the newly achieved scenario.

LESSWRONG
LW

LESSWRONG
LW

Value achievement dilemma

Value achievement dilemma