*Here's some thoughts I've had about utility maximizers, heavily influenced by ideas like **FDT** and **Morality as Fixed Computation**.*

## The Description vs The Maths vs The Algorithm (or Implementation)

This is a frame which I think is important. Getting from a description of what we want, to the maths of what we want, to an algorithm which implements that seems to be a key challenge.

I sometimes think of this as a pipeline of development: description maths algorithm. A description is something like "A utility maximizer is an agent-like thing which attempts to compress future world states towards ones which score highly in its utility function". The maths of something like that involves information theory (to understand what we mean by compression), proofs (like the good regulator theorem, the power-seeking theorems) etc. The algorithm is something like RL or AlphaGo.

More examples:

System | Description | Maths | Algorithm |

Addition | "If you have some apples and you get some more, you have a new number of apples" | The rules of arithmetic. We can make proofs about it using ZFS theory. | Whatever machine code/ logic gates are going on inside a calculator. |

Physics | "When you throw a ball, it accelerates downwards under gravity" | Calculus | Frame-by-frame updating of position and velocity vectors |

AI which models the world | "Consider all hypotheses weighted by simplicity, and update based on evidence" | Kolmogorov complexity, Bayesian updating, AIXI | DeepMind's Apperception Engine (But it's not very good) |

A good decision theory | "Doesn't let you get exploited in Termites, while also one-boxing in Newcome" | FDT, concepts like subjunctive dependence | ??? |

Human values | ??? | ??? | ??? |

This allows us to identify three failure points:

- Failure to make an accurate description of what we want (Alternatively, failure to turn an intuitive sense into a description)
- Failure to formalize that description into mathematics
- Failure to implement that mathematics into an algorithm

These failures can be total or partial. DeepMind's Apperception Engine is basically useless because it's a *bad* implementation of something AIXI-like. Failure to implement the mathematics may also happen because the algorithm doesn't *accurately* represent the maths. Deep neural networks are sort-of-like idealized Bayesian reasoning, but a very imperfect version of it.

If the algorithm doesn't accurately represent the maths, then reasoning about the maths doesn't tell you about the algorithm. Proving properties of algorithms is much harder than proving them about the abstracted maths of a system.

(As an aside, I suspect this is actually a crux relating to near-term AI doom arguments: are neural networks and DRL agents similar enough to idealized Bayesian reasoning and utility maximizers to act in ways which those abstract systems will provably act?)

All of this is just to introduce some big classes of reasoners: self-protecting utility maximizers, self-modifying utility maximizers, and thoughts about what a different type of utility-maximizer might look like.

## Self-Protecting Utility Maximizers

On a **description **level: this is a system which chooses actions to maximize the value of a utility function.

**Mathematically** it compresses the world into states which score highly according to a function .

Imagine the following **algorithm **(it's basically a description of an RL agent with direct access to the world state):

Take a world-state vector , a list of actions , and a dynamic matrix . Have a value function . Then output the following .

To train it, update according to basic deep-learning rules to make it more accurate. Also update according to some reward signal.

This is a shallow search over a single action. Now consider updating it to use something like a Monte-Carlo tree search. This will cause it to maximize the value of far into the future.

So what happens if this system is powerful enough to include an accurate model of itself in its model of the world? And let's say it's also powerful enough to edit it's own source code. The answer is pretty clear: it will delete the code which modifies . Then (if it is powerful enough) it will destroy the world.

Why? Well it wants to take the action which maximizes the value of far into the future. If its current is modified to , then it will become an agent which maximizes instead of . This means the future is likely to be less good according to .

This is one of the most obvious problems with utility maximizers, and it was first noticed a long time ago (by AI alignment standards).

## (Fake) Self-Modifying Utility Maximizers

A system which is **described** as wanting to maximize something like "Do whatever makes humans happy".

What this might look like **mathematically** is something which models humans as a utility maximizer, then maximizes whatever it thinks humans want to maximize. The part which does this modelling extracts a new value function from its future model of the world.

So for an example of an **algorithm**, we have our , , and the same as above, but instead of using a fixed , it has a fixed which produces .

Then it chooses futures similarly to our previous algorithm. Like the previous algorithm, it also destroys the world if given a chance.

Why? Well for one reason if depends on , then it will simply change so that gives that a high score. For example, it might modify humans to behave like hydrogen maximizers. Hydrogen is pretty common, so this scores highly.

But another way of looking at this is that is just acting like did in the old algorithm: since only depends on and , together they're just another map from to .

In this case **something which looks like it modifies its own utility function is actually just preserving it at one level down.**

## Less Fake Self-Modifying Utility Maximizers

So what might be a better way of doing this? We want a system which we might **describe** as "Learn about the 'correct' utility function without influencing it".

**Mathematically** this is reminiscent of FDT. The "correct" utility function is something which many parts of the the world (i.e. human behaviour) subjunctively depend on. It influences human behaviour, but cannot be influenced.

This might look like a modification of our first **algorithm **as follows: now returns a series of worlds drawn from a probability distribution over possible results of an action . We begin with our initial estimate of , which is updated according to some updater and each world is evaluated according to the corresponding .

This looks very much like the second system, so we add a further condition. For each we have the production of an associated representing a relative probability of those worlds. So we enforce a new consistency as a **mathematical** property:

This amounts to the **description** that "No action can affect the expected value of a future world state." which is similar to subjunctive dependence from FDT.

This is an old solution to the sophisticated cake or death problem.

There are a few possible ways to implement this consistency: we can have the algorithm modify its own as it considers possible futures. We can have the consistency enforced on the operator so that it updates the value function only in consistent ways. We also have the rather exotic way of *generating the probabilities* by comparing to the various .

The first one looks like basic reasoning, and is the suggested answer to the sophisticated cake or death problem given above. But it incentivises the AI to only think in certain ways, if the AI is able to model itself.

The second one seems to run into the problem that if becomes too accurate, our is unable to update the value function at all.

The third one is weird and requires more thought on my part. The main issue is that it doesn't guard against attempts by the AI to edit its future value function, only makes the AI believe they're less likely to work.

An algorithm which accurately represents the fixed-computation-ness of human morality is still out of reach.