This post is a response to the recent Astral Codex Ten post, “CHAI, Assistance Games, And Fully-Updated Deference”.
A brief summary of the context, for any readers who are not subscribed to ACX or familiar with the shutdown problem:
The Center for Human-Compatible Artificial Intelligence (CHAI) is a research group at UC Berkeley. Their researchers have published on the shutdown problem, showing that “propose an action to humans and wait for approval, allowing shutdown” strictly dominates “take that action unilaterally” as well as “shut self down unilaterally” for agents satisfying certain assumptions.
MIRI discusses a counterexample, using a toy example where the AI has a finite number of policy options available, and expresses that “learn which of those finite set of options is best according to humans, then execute without allowing humans to shut it down” can dominate the course of “propose action to humans and wait for approval.”
I claim that the fact that the AI is “larger” than its value-space seems to me to be a critical ingredient in the AI being able to conclude that it has reached its terminal point in value-space. I posit that given a value-space that is “larger” than the AI, the AI will accept shutdown. Here I present an argument that, for at least one AI architecture and structure of value-space, the “propose action and allow shutdown” option should dominate much of the time.
Assume that a current AI model contains a deep neural net connected to some decision procedure, of a specified, finite size (such as 16 layers with 1024 nodes each).
Then assume we can specify that human values are best specified by some ideal AI with the same structure, but with unknown size (e.g. includes a neural net with currently unknown number of layers and nodes in each layer.)
Further assume that we can specify that ’s action-space is to set the weights of its existing nodes, then propose actions to humans who can allow the actions or veto them by shutting down .
This search space over all possible neural net sizes is infinite-dimensional. In particular, the “correct” value can always be an architecture which has additional layers or additional nodes in each layer, beyond what is capable of expressing.
This space admits a well-formed prior probability distribution, and also presents a well-formed alternative interpretation within ’s conception (assuming this is compatible with the limits of the AI’s action-space, though I think this is a reasonable assumption given sufficiently large architecture.)
When a human decides to shut down , has the following options, paraphrasing MIRI's list and collapsing :
- Optimize existing best-guess at time (when the AI makes this decision)
- Optimize ’s updated best-guess at (The time when updating on the evidence of human shutdown is completed)
- Continuously update to ’s best-guess at any number of future times
- Allow shutdown and human action
Since humans have attempted shutdown, (1) is strictly dominated by (2) after , thus (1) should only be preferred if the expected value of between and is extremely high.
However, now has a more detailed hypothesis for what might happen in (3): humans may run a new AI with a different (larger) network shape at a later time .
Then (3) has expected value at least equal to P(humans run ) * EV( optimizes at ), where may be some policy that is not capable of running (for example because it is too small).
That is, assuming P(humans run ) is sufficiently high and the expected value of between and is not too high (and note that if updating takes a long time, could even be later than ), the AI can take shutdown as evidence that is sufficiently greater than , and can update toward (3) > (2) and even (3) > (2.1); potentially even extrapolating to further models which may reach the target .
While the space of neural net architectures is easier to do analysis on, the space of all AI architectures can be counted as Turing machines under a Solomonoff prior, so this should not rely on being myopic about the specifics of architecture.
It is unclear to me whether this / gap is a feature or a bug. For example, we might attempt to shut down if it wants to fire a gun out the window. However if it sees someone outside about to throw a grenade into the research lab or release a bioengineered pandemic, we might want it to act under time pressure and explain itself or allow shutdown afterward. Also, once the urgent moment has passed it should return to deference (e.g. it knows something important now, but knows it doesn’t have the full picture to run the universe yet). In practice I would expect the gaps between to be relatively small (hours/days/months/years to reach final models) whereas time after should be very large (centuries, millennia, millions of years) and have greater scope for action due to buildup time, such that concerns optimizing after easily dominate almost all concerns before . However this also seems to have gaps, such as if the AI concludes that all humans are suffering horribly and definitely need to be given heroin before a new model is built (but doesn’t understand that this will pollute its value estimate for the rest of the future).
Reaching some final may not be possible, if for example the "true" has googolplex layers and cannot be computed in our universe.