This post is a response to the recent Astral Codex Ten post, “CHAI, Assistance Games, And Fully-Updated Deference”.

A brief summary of the context, for any readers who are not subscribed to ACX or familiar with the shutdown problem:

The Center for Human-Compatible Artificial Intelligence (CHAI) is a research group at UC Berkeley. Their researchers have published on the shutdown problem, showing that “propose an action to humans and wait for approval, allowing shutdown” strictly dominates “take that action unilaterally” as well as “shut self down unilaterally” for agents satisfying certain assumptions.

MIRI discusses a counterexample, using a toy example where the AI has a finite number of policy options available, and expresses that “learn which of those finite set of options is best according to humans, then execute without allowing humans to shut it down” can dominate the course of “propose action to humans and wait for approval.”

I claim that the fact that the AI is “larger” than its value-space seems to me to be a critical ingredient in the AI being able to conclude that it has reached its terminal point in value-space. I posit that given a value-space that is “larger” than the AI, the AI will accept shutdown. Here I present an argument that, for at least one AI architecture and structure of value-space, the “propose action and allow shutdown” option should dominate much of the time.

Assume that a current AI model $A$ contains a deep neural net connected to some decision procedure, of a specified, finite size (such as 16 layers with 1024 nodes each).

Then assume we can specify that human values are best specified by some ideal AI $Z$ with the same structure^[1], but with unknown size (e.g. $Z$ includes a neural net with currently unknown number of layers and nodes in each layer.)

Further assume that we can specify that $A$ ’s action-space is to set the weights of its existing nodes, then propose actions to humans who can allow the actions or veto them by shutting down $A$ .

This search space over all possible neural net sizes is infinite-dimensional. In particular, the “correct” value can always be an architecture which has additional layers or additional nodes in each layer, beyond what $A$ is capable of expressing.

This space admits a well-formed prior probability distribution, and also presents a well-formed alternative...

AI-Assisted Alignment

AI-Assisted Alignment