Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

For an overview of the problem of Optimization Regularization, or Mild Optimization, I refer to MIRI's paper Alignment for Advanced Machine Learning Systems, section 2.7

My solution

Start with a bounded utility function, , that is evaluated based on state of the world at a single time (ignoring for now simultaneity is ill-defined in Relativity). Examples:

  • If a human at time (at the start of the optimization process) are shown the world state at time , how much would they like it (mapped to to the interval ).

Then maximize , where is a regularization parameter chosen by the AI engineer, and is a free variable chosen by the AI.

Time is measured from the start of the optimization process. Because the utility is evaluated based on the world at time , this value is the amount of time the AI spends on the task. It is up to the AI to decide how much time it wants. Choosing should be seen as part of choosing the policy, or be included in the action space.

Because the utility function is bounded, the optimization process will eventually hit diminishing returns, and will then choose to terminate, because of the time penalty.

Why time penalty?

Unbounded optimization pressure is dangerous. Without any form of regularization, we need to get the alignment exactly right. However, with regularization we merely need to get it almost exactly right, which I believe is much easier.

However, impact regularization have turned out to be very hard. We don't want the impact measure to depend on the AI's understanding of human values, because that will not provide extra safety. But a value neutral impact measure is almost impossible, because the world has too many degrees of freedom. However, time is both value neutral and has only a single degree of freedom.

Why not use a fixed finite time horizon?

The reason is a variable chosen by the AI and not a constant chosen by us, is because we don't know when the optimization process will start hitting diminishing returns. Leaving up to the AI solves this dynamically. In addition, we will still get to choose a maximum time horizon by the choice of and . The AI will never keep going beyond

What happens after ?

That depends on the AI design. Since we don't yet know how to build a general intelligence, we also don't know what will happen after in some default scenario.

However, we probably don't want a strong optimizer sitting around with no goal, because it is very unclear what that thing will do. Or if we accidentally give it some tiny incentive, it might then maximize that until the end of time.

E.g. if the AI has any uncertainty whether it has reach time or not, it will keep maximizing conditioned on , because those are the only worlds that count. As becomes less and less likely, the AI will act more and more crazy.

A way to solve this is to include an incentive for turning itself off, e.g. maximize

Unfortunately, I don't know of any good objective way to define "is turned of". The best definition I thought of so far is:

Definition: AI is turned off at time = If a human at time are shown the world state a time , they would agree that, the AI and all its sub agents and successor agents are turned off.

And if we are referring to human judgement anyway, we might as well throw in some more desirable things. Maximize

Definition: AI is turned off and the world is OK, at time = If a human at time are shown the world state a time , they would agree that, the AI and all its sub agents and successor agents are turned off, and the world at time is not significantly worse or in greater danger than at at time .

Note that "the world is OK" is not necessary for the regularization to work. But I would still recommend to include some explicit optimization pressure towards not destroying the world, ether in , or as an extra term. The regularization mainly stops the AI from Goodharting too hard, it does not do much to reduce side effects you have not even tried to specify.

Some open problems

How is time measured?

I think it is best if time refers to real physical time, and not clock ticks or number of computing operations. This is just an intuition at this point, but it seems to be like we get a better overall optimization regulator if we punish both computation and execution, because that is less likely to have loopholes. E.g. penalizing physical time is robust under delegation.

How to make this compatible with General Relativity?

If measures physical time then this is ill-defined in GR, and since we probably live in GR or similar, this is a big problem.

Is there a better way to define "is turned off"?

It would be nice with a definition of "is turned off" that does not relay on humans' ability to judge this, or the AI's ability to model humans.

"world is OK" is clearly a value statement, so for this part we will have to rely on some sort of value learning scheme.


This suggestion is inspired by, and partly based on ARLEA and discussions with John Maxwell. The idea was further developed in discussion with Stuart Armstrong.

New Comment
4 comments, sorted by Click to highlight new comments since:

If we ignore subagents and imagine a cartesian boundary, turned off can easily be defined as all future outputs are 0.

I also doubt that an AI working ASAP is safe in any meaningful sense. Of course you can move all the magic into "human judges world ok". If you make lambda large enough, your AI is safe and useless.

If the utility function is 1 if widget exists, else 0. Where a widget is easily build-able, not currently existing object.

Suppose that ordering the parts through normal channels will take a few weeks. If it hacks the nukes and holds the world to ransom, then everyone at the widget factory will work nonstop, then drop dead of exhaustion.

Alternately it might be able to bootstrap self replicating nanotech in less time. The AI has no reason to care if the nanotech that makes the widget is highly toxic, and no reason to care if it has a shutoff switch or grey goos the earth after the widget is produced.

World looks ok at time T is not enough, you could still get something bad arising from the way seemingly innocuous parts were set up at time T. Being switched off and having no subagents in the conventional sense isn't enough. What if the AI changed some physics data in such a way that humans would collapse the quantum vacuum state, believing the experiment they were doing was safe. Building a subagent is just a special case of having unwanted influence

I like this line of thought overall.

• How would we safely set lambda?

• Isn’t it still doing an argmax over plans and T, making the internal optimization pressure very non-mild? If we have some notion of embedded agency, one would imagine that doing the argmax would be penalized, but it’s not clear what kind of control the agent has over its search process in this case.

But a value neutral impact measure is almost impossible, because the world has too many degrees of freedom.

Can you explain why you think something like AUP requires value-laden inputs?

Hey there!

I think this method works well as an extra layer of precaution to go along with another measure of reduced impact. On its own, it has a few issues, some of which you cover.

First of all, I'd replace the utility function with a reward function, specifically one that provides rewards for past achievements. Why? Well, in general, utility functions give too much of an incentive to keep control of the future. "Create a subagent and turn yourself off" is my general critique of these kinds of methods; if the subagent is powerful enough, the best policy for the agent could be to create them and then turn themselves off for or some similarly low number.

Having a reward function on past achievements precludes that, and it also means the agent is not incentivised to continue past ; indeed part of the definition of the reward could be that it stops at .

When using human judgements, normally the risk is that the AI is incentivised to fool us. Here, however, the AI is on a time budget, so it might find it easier to be "honest" than to put in the time and effort to fool us. Another approach is to use indifference so that it doesn't care about the human decision ahead of time.

General relativity doesn't seem much of an issue. Just pick a reference frame (say one centred on the AI at time and with relative velocity to the AIT$" as the hyperplane. Because of the (reverse) triangle inequality, any path the AI takes to reach this hyperplane will give it proper time in which to act. If we worry about wormholes and such, we could even define to be the minimum of the reference frame time () and the AI's proper time (), to be really sure that the AI doesn't get too much time to think.


I'm not convinced that relativity is really a problem: it looks to me like you can probably deal with it as follows. Instead of asking about the state of the universe at time T and making T one parameter in the optimization, ask about the state of the universe within a spacetime region including O (where O is a starting-point somewhere around where the AI is to start operating) where now that region is a parameter in the optimization. Then instead of , use times some measure of the size of that region. (You might use something like total computation done within the region but that might be hard to define and as OP suggests it might not penalize everything you care about.) You might actually want to use the size of the boundary rather than of the region itself in your regularization term, to discourage gerrymandering. (Which might also make some sense in terms of physics because something something holographic principle something something, but that's handwavy motivation at best.)

Of course, optimizing over the exact extent of a more-or-less-arbitrary region of spacetime is much more work than optimizing over a single scalar parameter. But in the context we're looking at, you're already optimizing over an absurdly large space: that of all possible courses of action the AI could take.