Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post is yet another unworkable agent design that uses infinite compute. It outlines an AI that ignores all instrumental values, pursuing only it's terminal values.

What is does it look like to pursue an instrumental value. It means you can predict the AI is likely to do action X, even if you have no idea what it's utility function is. 

This AI doesn't do that. It obeys the simple principle that, if you are clueless about it's utility function, then you are clueless about it's actions. 

Let  mean the probability distribution over the set .

Lets have some set of Observations , Actions  and (bounded? ) Utilities .

Then Pick some probability distribution over utility functions  , such as uniform or complexity weighted. And some probability distribution  over the action space.

Then let the AI be a function 

We can formalize our condition as .

(Note  refers to the probability that  assigns to .)

Why might we expect such agents to be safe. Because if we picked a random utility function, we get a safely random output. So all humans need to do is be better than random in our choice of utility function. 

Now let  be the expected utility. Choose  to maximize  subject to the condition. As the condition doesn't relate cases with different values of , this optimization only needs to be computed for the particular  observed. 

Is this AI design safe when repeatedly called with the same utility function?

Imagine the action space just consists of outputting 0 or 1. And there are only 2 utility functions, paperclips and staples. We could imagine that when run millions of times with a utility of paperclips, this AI outputs code for a paperclip maximizer, and when run millions of times with a goal of staples, it outputs a bitwise negation of the paperclip maximizer. (Which could possibly be a staple maximizer, if the AI has done some op code magic) 

This would be because the humans choice of the same utility function each time is worse than random. I think this is an actual problem. At least if the AI's have total knowledge of past and future, then this is one Schelling point to their coordination game. (The consider all possibilities doesn't quite work well with multiple different AI's) 

I have an intuition that something TDT'ish might fix this. 

The other problem that comes to mind is that it isn't easy to whitelist a limited set of instrumental actions you are ok with.  You can fake it though, suppose you have a robot, instead of piping the output of this AI directly into the motors, you have it call simple hardcoded functions to walk and pick up objects. The result being, if the AI has a random utility function, instead of thrashing randomly on the floor, it walks to random places and picks up random objects. Because someone hardcoded those low level commands into the robot.

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 5:05 PM

Your math can be terser: f:U->ΔA. g:ΔU. h:ΔA. The condition is: f and g combine into h. To fix your first problem, make A policies, not actions.

My biggest problem here is that f depends on how we represent g since U identifies u with 2u. Silence this warning by normalizing U, and f depends on how we normalize.

Ah, we've seen my problem before: Solve bargaining, then make g bargain to choose f.

This seems in danger of being a "sponge alignment" proposal, i.e. the proposed system doesn't do anything useful. https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities#:~:text=sponge 

This current version is dumb, but still exerts some optimization pressure. (Just the bits of optimization out are at most the bits of selection put into its utility function.) 

It could be a conceptual ingredient to something useful. For example, it can select between two plans.