Brute force searching for alignment

Donald Hobson

This is a speculative idea I came up with when writing clankers, which was one of the short stories in the AI vignettes workshop.

The premise is that you already have enough alignment done that you can do non-malign optimization over computationally verifiable goals. In other words, it assumes that you can take any function that you know how to program (and that runs fairly efficiently), and get an AI to pick $x \in X$ to maximize $foo$ .

You begin by considering any problem that you believe may have a conceptually simple core, but you don't know that core yet.

Examples of such problems include

optimizing for a goal that you can't directly observe.
corrigible behaviour
non deceptiveness.

Once you have this list, you create large numbers of grid worlds, and similar. In each world, you specify the quantity to be maximized. For example, if you wanted to understand (1.), you would design a grid world where there are ways to create and destroy blue blocks. There are things that look like blue blocks but aren't. (say the world contains mirrors, so you can have one block and several reflections) The agent can only see its immediate surroundings. The goal is to maximize the number of blue blocks.

You are using an already designed, at least fairly powerful AI, searching for a piece of code. (Brute force would work, except it would take too much compute) The piece of code you are searching for is short. Maybe a few pages, maybe less. You might add some predefined functions, writing most of the agents code and just leaving a few gaps. You might have a piece of code specific to each gridworld, and some code shared between them.

The code you are searching for is an agent. A small toy agent that has the property you don't understand. This agent should be forced to learn about its environment. Its code should be sufficiently short, and the environments sufficiently complex and diverse, that it can't get by with hard coded knowledge. It has to learn.

You are searching for something short, a postcard full of mathematics, not millions of network weights. Your ideal is code you could have written, if you understood the topic at hand. You should be able to put 100 different deterministic 2 player games into this setup, and discover min max trees.

Once the program finishes, you look at the generated code. You are hoping that the computer has stumbled upon a simple general principle. You read the code, maybe do a few tests on toy problems, and try to understand why it works. You treat this as raising that particular mathematical formalism to the level of your explicit consideration. You don't trust it.

This approach obviously only works if

There is a mathematically simple conceptual core of the problem
You know it when you see it.
That core generalizes from toy problems
There are relatively few mathematically simple tricks that only work on toy problems.
You can easily devise a reasonable number of gridworlds.

OK, now I get what you are saying! Interesting. I am skeptical that this will work for most alignment problems, due to lack of simple conceptual core maybe. In particular, I doubt that corrigibility and non-deceptiveness have simple conceptual cores. I hope I'm wrong.

Well, if you worry that these properties don't have a simple conceptual core, maybe you can do the trick where you try to formalize a subset of them with a small conceptual core. That's basically Evan move on Myopia as a more easy to study subset of non-deceptiveness.

If I try to rephrase it in my words, your proposal looks like a way to go from partial deconfusion (in the form of an extensive definition, a list of examples of what you want) to full deconfusion (an actual program with the property that you want) through brute force search.

Stated like that, it looks really cool. I wonder if you need an AGI already to do the search with a reasonable amount of compute. In this case, the worry might be that you have to deconfuse what you want to deconfuse before being able to apply this technique, which would make it useless.

Still, I will add this sort of thought experiment to my bag of tools. It's a pretty good argument for extensive definition in a way.

LESSWRONG
LW

LESSWRONG
LW

23

Brute force searching for alignment

23

Ω 13

23

Ω 13

23

Ω 13