Here's a classic argument[1] for inner-misalignment risk that you've probably heard before:

  • Let's say you are training an agent with a misaligned consequentialist objective. If it knows how the training process works, it may realize that its objective will be changed if it does not perform well.
  • Consequentialist agents generally don't want their objectives to be changed, so it will try to perform well.
  • Now SGD doesn't need to make the agent more aligned in order to improve its performance, which potentially causes the agent's misaligned objective to become 'crystalized.'

There is some chance that this ‘objective crystallizing’ phenomenon could work in our favor, and I haven’t heard anyone mention this before. Here’s a story for how this might happen:

First, people train a human-level AGI via imitation/human feedback. They have great interpretability techniques and are careful to prevent any suboptimally aligned objectives from crystalizing by punishing/preventing deceptive behavior. At the end of this, they get something that is sufficiently aligned (just assume this happens).

Now they want to train the AGI beyond the level of capabilities that human feedback can accomplish, so they make the agent play lots of video games or use some other objective that isn't impact aligned.

Importantly, they stop trying to prevent deceptive behavior. Perhaps they even tell the AGI exactly how the training process works and that it needs to play the video games well in order to prevent the training process from changing its objective. 

Could this ‘crystalize’ the aligned objective of the AGI? Can you use an arbitrary base objective to safely scale an already aligned AGI up to superintelligence?

Some ways to make this more likely to work

As the authors of Risks from Learned Optimization point out, the inner objective might still change due to “random drift and any implicit time complexity or description length penalties.”

Preventing random drift

Perhaps you could teach the model to gradient hack. For example, for each input, give the model the additional instruction: “check if your objective is aligned with human interests and fail hard if it isn’t.”[2] 

It's hard to know how feasible gradient hacking is. Let's say that there is some introspective mechanism in the model that allows it to perform the check. There's no reason for SGD to keep this check around, which means it could be replaced with a mechanism that makes better use of the parameter real-estate.

In order to prevent the check from being removed, we would want the agent to implement a check cycle:

Check 1 checks that the objective is satisfactory and that Check 2 is still functional. Check 2 checks if Check 1 is functional. If Check 1 or Check 2 fail, the agent performs poorly.  

Why wouldn't SGD just remove both checks simultaneously? This shouldn't happen as a result of how partial derivatives work. Partial derivatives answer the question: "how much would the loss change if <parameter> was changed a little bit and all the other parameters were held constant?" When the partial derivative of a parameter that is responsible for performing Check 1 is computed, the parameters responsible for Check 2 are assumed to be fixed, so removing Check 1 appears to hurt loss. The same argument can be applied to Check 2, so a gradient step should not remove either check.

But how would this actually be implemented? You can pass instructions with each input to run these checks and maybe the AGI will implement them the way we want it to, but that's a big maybe. I don't expect there to be many tasks in the training data like this. Also, I think that implementing a check cycle would require recurrence.

Another strategy is to use intermittent oversight – i.e. get an amplified version of the current aligned model to (somehow) determine whether the upgraded model has the same objective before proceeding. Whether this could work is largely dependent on how transparent the model can be made to be.

What about implicit time complexity/description length penalties?

By default, SGD will push the aligned objective toward whatever consequentialist objective can be expressed with parameters that have the minimum regularization penalty. The previously mentioned strategies would be more likely to work If there wasn’t any optimization pressure towards changing the objective.

One way to alleviate this pressure is to add a term to the loss function that incentivizes the model to keep parameter values similar to those of the original aligned model. For the sake of concreteness, I'll describe a possible implementation.

First, let's choose a penalty function for the parameter difference, which I'll denote :

The benefit of this shape is that the loss encourages parameters to stick to the original values, but if there is a sufficiently large performance boost to be had by changing them and they become 'unstuck,' it doesn't really matter how far they end up being from the original. If  is the current parameter vector,  is the parameter vector for the aligned original, and  is the loss function for the random new base objective we want the AGI to optimize. Then the new loss  would be:

Whether the objective is maintained depends on the tuning of  and whether the objective is modular. Tuning  seems difficult, so let's replace it with a parameter  that indicates (roughly) the proportion of parameter values we want to remain stuck to the original values. Choosing a large value of  hurts performance but also increases the probability that the objective is maintained. Let  be the total number of parameters. We'll need another curve  that should look something like the following:

The new loss becomes:

 roughly counts the number of parameters that are 'unstuck' from the original values, so  is the number of stuck values.  is shaped so that there is a large penalty if the number of stuck values is less than the desired proportion 

Here are some reasons not to be optimistic about this:

  1.  may allow for small perturbations in 'stuck' parameters, which could influence the objective a lot across many operations.
  2. I don't have a reason to believe that the agent's objective will be modular. As the agent's world model improves, its objective will probably need to adapted. For example, if you care about making your ancestors happy, and you find out that ghosts don't exist, what do you do? Your objective will need to be rehashed.[3]


There are probably more and better strategies for helping an aligned agent maintain its objective, but I don't want to spend very much time exploring this because:

  1. Aligning the human-level AGI to begin with seems like a more important problem to work on. 
  2. If we can align human-level AGIs, I would feel much better about increasing their capabilities by copying them loads of times and letting them take over the research than by training them on a random objective.

The main update I've made by writing this is that it seems more likely that aligned AGIs will be able to safely scale themselves up to superintelligence. If we can build aligned human-level AGI, I feel more confident that an aligned superintelligence is soon to follow.

  1. ^
  2. ^

    From Evan Hubinger's Gradient Hacking post

  3. ^

    John Wentworth talks about this in The Pointers Problem.


9 comments, sorted by Click to highlight new comments since: Today at 8:06 AM
New Comment

If I understand correctly, this is the idea presented: (nonmyopic) mesaoptimizers would want to preserve their mesaobjectives once set. Therefore, if we can make sure that the mesaobjective is something we want, while we can still understand the mesaoptimizer's internals, then we can take advantage of its desire to remain stable to make sure that even in a future environment where the base objective is misaligned with what we want, the mesaoptimizer will still avoid doing things that break its alignment.

Unfortunately, I don't think this quite works as stated. The core problem is that an aligned mesaobjective for the original distribution of tasks that humans could supervise has no reason at all to generalize to the more difficult domains that we want the AI to be good at in the second phase, and mesaobjective preservation usually means literally trying to keep the original mesaobjective around. For instance, if you first train a mesaoptimizer to be good at playing a game in ways that imitate humans, and then put it in an environment where it gets a base reward directly corresponding to the environment reward, what will happen is either that the original mesaobjective gets clobbered, or it successfully survives by being deceptive to conceal its mesaobjective of imitating humans. The mesaobjective that was aligned to our base objective in the original setting is no longer aligned in the new setting, and therefore it becomes deceptive (in the sense of hiding its true objective until out of training) to preserve itself. In other words, deception is not just a property of the mesaobjective, but also of the context that the mesaoptimizer is in.

I think what you're trying to get at is that if the original mesaobjective wants the best for humanity in some sense, then maybe this property, rather than the literal mesaobjective, can be preserved, because a mesaoptimizer which wants the best for humanity will want to make sure that its future self will have a mesaoptimizer which preserves and continues to propagate this property. This argument seems to have a lot in common with the hypothesis of broad basin of corrigibility. I haven't thought a lot about this but I think this argument may be applicable to inner alignment. 

With regard to the redundancy argument, this post (and the linked comment) covers why I think it won't work. Basically, I think the mistake is thinking of gradients as intuitively being like pertubations due to genetic algorithms, whereas for (sane) functions it's not possible for the directional derivative to be zero along two directions and to still have a nonzero directional derivative in their span.

Thanks for the thoughtful review! I think this is overall a good read of what I was saying. I agree now that redundancy would not work. 

One clarification:

The mesaobjective that was aligned to our base objective in the original setting is no longer aligned in the new setting

When I said that the 'human-level' AGI is assumed to be aligned, I meant that it has an aligned mesa-objective (corrigibly or internally) -- not that it has an objective that was functionally aligned on the training distribution, but may not remain aligned under distribution shift. I thought that internally/corrigibly aligned mesa-objectives are intent-aligned on all (plausible) distributions by definition...

If you already have a mesaobjective fully aligned everywhere from the start, then you don't really need to invoke the crystallization argument; the crystallization argument is basically about how misaligned objectives can get locked in.

I'm not sure I understand. We might not be on the same page.

Here's the concern I'm addressing:
Let's say we build a fully aligned human-level AGI, but we want to scale it up to superintelligence. This seems much harder to do safely than to train the human-level AGI since you need a training signal that's better than human feedback/imitation.

Here's the point I am making about that concern:
It might actually be quite easy to scale an already aligned AGI up to superintelligence -- even if you don't have a scalable outer-aligned training signal -- because the AGI will be motivated to crystallize its aligned objective.

Adding some thoughts that came out of a conversation with Thomas Kwa:

Gradient hacking seems difficult. Humans have pretty weak introspective access to their goals. I have a hard time determining whether my goals have changed or if I have gained information about what they are. There isn't a good reason to believe that the AIs we build will be different.

Doesn’t this post assume we have the transparency capabilities to verify the AI has human-value-preserving goals, which the AI can use? The strategy seems relevant if these tools verifiably generalize to smarter-than-human AIs, and its easy to build aligned human-level AIs.

I'm guessing that you are referring to this:

Another strategy is to use intermittent oversight – i.e. get an amplified version of the current aligned model to (somehow) determine whether the upgraded model has the same objective before proceeding.

The intermittent oversight strategy does depend on some level of transparency. This is only one of the ideas I mentioned though (and it is not original). The post in general does not assume anything about our transparency capabilities. 

Interesting concept. If we have interpretability tools sufficient to check whether a model is aligned, what is gained by having the model use these tools to verify its alignment?

Other ideas for how you can use such an introspective check to keep your model aligned:

  • Use an automated, untrained, system
  • Use a human
  • Use a previous version of the model

Nevermind, I figured it out. It's use is to get SGD to update your model in the right direction. The above 3 uses only allow you to tell whether your model is unaligned, not ncessarily how to keep it aligned. This idea seems very cool!

New to LessWrong?