In my post on incentive structures I gave an potted summary of how to do incentive structures better when what you are trying to achieve is ill defined,

Improve Models using Measures, use the Model to update Targets.

I would add,

Try to hit Targets. Avoid Tests.

In this post I will recap what I mean by the above, give an abridged history of the field of AI and how it has failed to follow this maxim well and what following it might have looked like.

If we want to create safe AI we need good incentive structures for the people researching it, we need to get good at this.

Recap

A Model is your causal view of the thing you are trying to achieve. A Measure is something you can apply to your system to let you know if you are going in the right direction. Targets are things you are trying to do in the world. A Test in this formalism is something that is a Measure and Target all in one, you don't need a Model, if you do well at the Test you are doing well.

One of the key differences between a Measure and Test, if you do better than a Model predicts on a Measure you should change your Model (and maybe change your Target). If you do better than you expect on a Test, there is no need to change anything.

Measure 1: The Turing test

The most famous measure in AI is the Turing test.

[it] is a test of a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation is a machine, and all participants would be separated from one another.

From this we got the Loebner prize. This in turn has led to profusion of chat bots like this one. It might entertain the judges and give them a moments pause to try and figure out whether it is a human or not, but you can't get much real work out of it. It can't do maths, run a business or teach kids. It is a product of taking the Measure as a Test.

However it is still a good Measure: if something really passed it we would expect it to be intelligent.

Measure 2: Image net

This is not supposed to be a test of full intelligence, but how well an algorithm detects object in images . While it is creating algorithms that do better and better at this all the time, it seems unlikely that it is getting close to how humans do object recognition.

For example the algorithms aren't being designed to take hints about what is in an image to help it locate the object, because that is not what they are being tested for. Why might you want to do so? Because there are lots of good examples humans being able to process these hints. This seems like a useful thing to do. You can experience the power of "hint taking" first hand, if you read this slatestarcodex article.

If your Measure and your Target is static, you are just trying to make a slightly better algorithm that does better at these Tests. You are not going to change your Model of image recognition to be able to incorporate other types of data, like "hints".

Breaking the Test cycle

So we are in general teaching our system to the tests. And when the the Tests are iterated upon (because they don't get what people actually want), the people trying to iterate the Tests get accused of moving the goal posts. There is even a name for it, the AI effect.

If we are to get safe general AI or IA we need to break this cycle.
We need a Model of what intelligence is and iterate on that, so that we can get different Targets. Not naively try to meet the Tests better or add more and more Tests.

I obviously think separating Measures and Metrics will help us with making safe AI, so how can we

Improve Models using the Measure, use the Model to update Targets. Hit Targets, Avoid Tests.

The following an illustrative alternative history about the development AI could've gone gone if we had better separated our Targets from our Measures. It will be described by a series of rounds. Each round will have an assumed Model, a Target to try and create due to that model, the Measure (in this case the Turing test, but you could have more measures) and a result of having performed that measurement. There will also be a failure mode to avoid in the round.

Round 1:

Model: Let us start with a model of human conversation as a fixed input/output mapping, for simplicity's sake.

Target: Create a mapping of input to output that is nice to talk to.
Measure: Use the turing test as a measure.

Result: This isn't very good to talk to. It is the same every time. Humans aren't the same every time. Update the model of an intelligence so that it isn't seen as a fixed mapping.

Failure mode: Create ever more elaborate mappings from input to output, include previous parts of the conversation in the input.

Round 2

Update Model: Humans seem to learn over time. So let us assume an intelligence also learns.

Target: Create a machine learning system that tries to find a mapping from input to output from data. Provide the system with the data from previous conversations and how well the judges like those conversations. So that it can update it's mapping from input to output.

Measure: Use the turing test as a measure.

Result: It was pleasant to talk to but one judge tried to teach it simple mathematics (adding two numbers together) and it failed. There is no way our current model of an intelligence could learn mathematics in a single conversation.

Failure Mode: Throw more and more data and processing power at it, creating ever more complex mappings

Round 3

Update Model: Humans seem to be able to treat natural language as programs to be interpreted and compiled. They learn languages by this using large amounts of data, but they also use language to help learn other bits of language.

Target: Create some system that can discern good programs from bad programs, then put programs in it that try to compile human language into novel programs. Also have programs that search for novel patterns in the input language and tries to hook them up to code generation.

Measure: Use the turing test as a measure.

Result:??

Conclusion

I hope I have shown you how the maxim might be useful for thinking about incentive structures for solving a complex problem that we don't quite understand what we are aiming for. While this is just my initial Model of what we should do for incentive structures for human researchers, it is very important to get right (as it could be used inside AI as well as how to build it).
So to move forward on the intelligence problem we need to create the best model of intelligence we can, so we can find targets. I think I have one (Round 3), but I would I love to get more input into it.

It would also be interesting to think about when having a Model would be a bad idea and it would better to just use Tests. There are probably some circumstances.

New Comment
14 comments, sorted by Click to highlight new comments since: Today at 1:56 PM

I think it may be enlightening to consider what happens if you get super-lucky and your first Model is perfectly accurate. I think what happens is that you end up solving the same problem as you would without the process proposed here and in the previous article, and running into the same problems.

It's possible that doing everything via an inaccurate model may help (by reducing overfitting, in some sense) but I don't think it's obvious that it does (since it will increase underfitting, in the same handwavy sense) and my guess is that it will help and hurt in different cases, and that it will be difficult to guess ahead of time which it will be.

Overfitting is the same thing as goodhart's law. If using an inaccurate model helps, it's because not trying too hard is necessary to avoid goodharting yourself.

That's ... pretty much what I thought I was saying. Was I unclear, or have I misunderstood you somehow?

Ah, I thought "in some sense" meant you weren't sure if you were using the metaphor correctly.

I do not think that overfitting is "the same thing as" Goodhart's law. I think Goodhart's law is more broad. One of the mechanisms by which it works is similar to overfitting, but there is a lot more to Goodhart's law than overfitting. In particular, I think the standard examples of Goodhart's law include adversaries that are trying to break your proxy in a way that does not show up in overfitting. See also https://agentfoundations.org/item?id=1621

If your model is perfectly accurate, yep this process is not needed.

I'd argue though if you had a perfect Model and a Model of the Measure, you wouldn't need the real Measure either. The Measure is just something to help you search. You would just create the good life or an AI or whatever hard to define thing you have a definition for.

My point isn't that if the model is perfectly accurate the process isn't needed.

My point is that if the model is perfectly accurate the process doesn't work (at least in the difficult cases) and that the process involves trying to improve the model all the time, so it's liable to push itself into a situation where it doesn't work.

In other words, I don't see that this process is really effective in saving you from Goodhart's law.

Someone reached and noted the loebner prize link is a 404 now, and suggested this link. (I'm not sure if this is accurate, haven't looked into it, but passing it along)

https://www.aresearchguide.com/the-loebner-prize-in-artificial-intelligence.html

For example the algorithms aren't being designed to take hints about what is in an image to help it locate the object, because that is not what they are being tested for.

Yes they are, yes it is. Convnets learn to use context very heavily. In general, machine learning is entirely focused on measures.

Maybe, I should have expanded on what I mean by a hint. I think I wasn't clear. It is not the raw context, it is the ability to be spoilered by words. So a hint is non-visual information about the picture, which allows you to locate the object in question.

Does that make things clearer?

In general, machine learning is entirely focused on measures.

But what does it stop them becoming targets, per Goodhearts?

http://www.aclweb.org/anthology/W16-3204

But what does it stop them becoming targets, per Goodhearts?

This is unavoidable. If you optimize something, you're optimizing the thing. Your post suggesting a separation is an attempt at describing something real, but you misunderstood why there's a separation. You only have one true value function in your low level circuity in your head, and everything else is proxies; so, you have to check things against your value function. By saying "give your subordinates goals", you are giving them proxies to optimize, which will be goodharted if your subordinates are competent enough. You need to have a white box view of what's happening to explore and improve the goals you're giving, until you can find ones that accurately represent your internal values.

In machine learning, this just happens by running a model, trying it on a test set, and giving it a new training objective if it does badly on the thing you actually care about. The training objective is just a proxy, and you test it against a test set to ensure that if it is over-optimizing your proxy - also known as overfitting - then you'll notice the discrepancy. In value function learning, you'd still do this; you'd have some out-of-sample ethical problems in your test set, and you'd see how your AI does on them. This would be one of the ingredients in making a safe AI based on ML. See "Concrete problems in ai safety", Amodei 2016

It seems to me that your post can be summarized as "one needs continuous metrics to optimize for, not boolean tests". Does that seem wrong in any way?

That is not what I was trying to say at all.... Lets try math's notation. You have

Have a model M which is a function of the state of the world s and the target and hit, you take at time t, a1. The model gives a predicted state of the world at time 2, ps2.

So ps2 = M(s1, a1)

Lets say you have a measure U() over states of the world. Which gives a utility for the state.

U(s1) = u1

You build up a model of what U is Mu so that you don't have to try to hit every state to find the U of that state. So you have

Mu(s1) = pu1

A normal optimisation process just adjusts a1 until pu2 is maximised. So

f(a1) = argmax Mu(M(s1,a1))

What I'm arguing for is is that you are not getting high U on an poorly defined problem you want to be spending a lot of time adjusting M not adjusting a if you get a poor u.

If your model M is inaccurate that is ps2 != s2 you can improve your pu2 by updating your model to minimise the sqrt of error in the prediction at time t with the action that happened at t and the actual state at t+1 st+1

g(st) = argmin on M (sqrt(M(st,at)-st+1))

So in machine learnings case, your actions a are algorithms or systems you are testing and M is your model of what an intelligence is. If the turing test is giving your system a bad u, and you can't easily predict the real world intelligences (e.g. humans) with your model of intelligence that is M(st,at)-st+1 is large, update your model of what an intelligence should be doing so that you can better predict what an intelligence should be. This will give you enable you to pick a better.

To give an example less close to home. If you are trying to teach kids, don't just try lots of different teaching styles and see which ones produce the best test results for your tests. Instead have principled reasons for trying a particular teaching style. Try and explain why a teaching style was bad. If you can, create an explicitly bad teaching style and see whether it was as bad as you thought it would be. Once you have got a good model of learning styles to test results, then pick the optimal learning style for principled reasons. Else you could just give the kids the answers before hand, that would optimise the test results.

Does that clarify the difference?

I'll try and figure out formatting this properly in a bit.

yeah! I think we're actually saying the same things back at each other.

I was objecting to the continuous vs boolean distinction :) . I'd boil the article down to.

It is more important to optimise the model of the world , than it is acting in the world, if your model of the world is bad.

It is lucky that this is continuous function though.