Instrumental ignoring AI, Dumb but not useless. — LessWrong

Instrumental ignoring AI, Dumb but not useless. — LessWrong

This post is yet another unworkable agent design that uses infinite compute. It outlines an AI that ignores all instrumental values, pursuing only it's terminal values.

What is does it look like to pursue an instrumental value. It means you can predict the AI is likely to do action X, even if you have no idea what it's utility function is.

This AI doesn't do that. It obeys the simple principle that, if you are clueless about it's utility function, then you are clueless about it's actions.

Let $Δ X$ mean the probability distribution over the set $X$ .

Lets have some set of Observations $O$ , Actions $A$ and (bounded? ) Utilities $U$ .

Then Pick some probability distribution over utility functions $U \in Δ U$ , such as uniform or complexity weighted. And some probability distribution $A \in Δ A$ over the action space.

Then let the AI be a function $f : (U, O) \to Δ A$

We can formalize our condition as $\forall o \in O : \forall a \in A : \sum_{u \in U} P (f (u, o) = a) U (u) = A (a)$ .

(Note $U (u) \in R$ refers to the probability that $U$ assigns to $u$ .)

Why might we expect such agents to be safe. Because if we picked a random utility function, we get a safely random output. So all humans need to do is be better than random in our choice of utility function.

Now let $E (o, a, u)$ be the expected utility. Choose $f$ to maximize $\sum_{u \in U} \sum_{a \in A} P (f (u, o) = a) U (u) E (u, o, a)$ subject to the condition. As the condition doesn't relate cases with different values of $o$ , this optimization only needs to be computed for the particular $o \in O$ observed.

Is this AI design safe when repeatedly called with the same utility function?

Imagine the action space just consists of outputting 0 or 1. And there are only 2 utility functions, paperclips and staples. We could imagine that when run millions of times with a utility of paperclips, this AI outputs code for a paperclip maximizer, and when run millions of times with a goal of staples, it outputs a bitwise negation of the paperclip maximizer. (Which could possibly be a staple maximizer, if the AI has done some op code magic)

This would be because the humans choice of the same utility function each time is worse than random. I think this is an actual problem. At least if the AI's have total knowledge of past and future, then this is one Schelling point to their coordination game. (The consider all possibilities doesn't quite work well with multiple different AI's)

I have an intuition that something TDT'ish might fix this.

The other problem that comes to mind is that it isn't easy to whitelist a limited set of instrumental actions you are ok with. You can fake it though, suppose you have a robot, instead of piping the output of this AI directly into the motors, you have it call simple hardcoded functions to walk and pick up objects. The result being, if the AI has a random utility function, instead of thrashing randomly on the floor, it walks to random places and picks up random objects. Because someone hardcoded those low level commands into the robot.