You want to make things better or live in a world which makes things better. But how do you go about actually doing that? There is truth to Drucker's maxim

"If you can't measure it, you can't improve it."

But you have also heard of Goodheart's law.

"When a measure becomes a target, it ceases to be a good measure."

But you have measures. How can you use them without them becoming a target?

Delegation

If you are delegating work this becomes even trickier. You can't give the people the you are delegating the measures you use (unless you trust them to use them properly and not use them as targets). So you give people rules to follow or goals to meet that aren't the measures.

Then evaluate them on how well they follow the rules or goals (not on how well they meet the measure), and iterate on the rules or goals. If you have people that follow the rules and achieve the goals *AND* you iterate on those things you can actually affect change in the world. If you don't iterate you'll just end up optimising whatever the first set of rules points at, rather than the thing you actually want to acheive.

You also probably want to give people some slack with the rules/goals, so that they have spare energy to look at the world and figure out what they think is best. If people are run ragged trying to meet a goal to survive, all other considerations fall by the way side.

Fixing the goals

During the iteration of the rules, how do you avoid Goodheart's law yourself? You want people to lead good happy lives, but don't want to end up secretly giving people drugs to make them happy, because you are short-circuiting things. You also don't want to kill everyone to reduce long term suffering.

So instead you build yourself a model of what Good looks like. This model is important it allows you to decouple your measure from your target. An example might be, "It is Good for People to be wealthy as it allows them to do more things". You use your model to generate a target, in this case "make people wealthier". Then you alter the rules and goals to hit that target.

What happens if your model is wrong? Let us say some people are becoming unhappier as they become wealthier, due to pollution causing health issues.

This is where the measure comes in. You then use your measures to see if your model is correct. If people aren't becoming happier, less stressed, healthier etc as they become wealthier, you change and update your model. In this case they aren't, so you improve your model, find a new targets and therefore give new goals and rules to the people you delegate to.

Any anger at the poor performance related to your measure should not be taken out on the people you have delegated to (unless they didn't do what you said), it should be taken out on your model of the world that thought it was idea to tell people to do something.

Also you can improve your model with small scale studies and trying to understand the inner workings of humans. This gives you a quicker feedback loop than changing society and seeing what happens.

This is a description of roughly where we are as a society currently, although we suck at the last section updating our models and changing our targets. We tend get stuck on the first set of targets we find, those of GDP and IQ, publishing 'high impact' papers or reducing waiting times at doctors. We don't use measures to say, hey somethings wrong, let us change things. The best we have is Democracy but that is a very blunt instrument and has its own incentive structure problems.

New to LessWrong?

New Comment
13 comments, sorted by Click to highlight new comments since: Today at 11:17 AM

Boiling the post down to an aphorism I would say:

Improve models using the measure, use the model to update targets.

This misalignment of model seems like it drives a hidden agency cost - where the agents/people you've delegated to /think/ their interests are aligned with yours but are actually not... and this is hard to account for. Do you know if there is a name for that type of agency cost, driven by the problem you describe?

Not that I know of! The principal-agent problem is when the agent knows its interests aren't aligned, but I don't know of one for a scenario where there is just confusion about how aligned an agent is with the principal are.

(I'm posting this in response to this article rather than the follow-up because I think the main argument is here and the follow-up is mostly intended as clarification. If you would prefer discussion to take place there, let me know.)

It feels to me as if you start out by describing one problem (I'll call it the Hard Problem), and then describe a process that encounters an easier problem (I'll call it the Easy Problem, though actually it's also pretty hard), and say how your process addresses the Easy Problem ... but you never get round to showing that it does anything about the Hard Problem, and in fact I think it can't.

The Easy Problem is that when trying to make X happen, you may aim at Y and find to your chagrin that achieving Y isn't enough to achieve X. The Hard Problem is that you may try to make X happen but find to your chagrin that achieving X isn't enough to achieve *the mysterious unknown thing that you really wanted*.

So, in the first paragraph under the heading "Fixing the goals", you give an excellent example of the Hard Problem. X is "make people happy", but if we simply try to achieve X then we may end up making people happy by drugging them into oblivion and that isn't what we really want; but unfortunately we don't know exactly what we really want, which is why we come out with approximations like "make people happy".

You propose a process in which we have a model, and that leads to intermediate goals, but then it might turn out that achieving those goals isn't good enough. This is the Easy Problem, and you give a specific example: we want people to be happy and healthy and unstressed (this is X), we think making them wealthy might achieve that (making them wealthy is Y), but alas it turns out that that isn't a good enough proxy.

So then your process says to adjust the model (it's a bit unclear on how, but let's say it's something like "make it more accurate until it accurately predicts the bad consequences we encountered") and try again. OK, fine, but this isn't changing X and it won't help at all if we are facing the Hard Problem and X doesn't correctly capture what we really care about. And nothing in this process ever changes X; the process is effectively a rather roundabout way of trying to optimize for X, and if it solves the Hard Problem I think it can only be by coincidence.

Now, what I've done here is to make up two problem statements, neither of which appears in your article, and complain that your proposed process doesn't address one of them. Perhaps it was never meant to. Well, what you do explicitly say you're trying to do is to get around Goodhart's law: "when a measure becomes a target it ceases to be a good measure". I regret to say that I don't see that your proposal does that, either. It seems to me that you have two sets of targets, the "outer" X and the "inner" Y; that in the "inner" optimization process you are using Y as both measures and targets, and this does lead to the measures becoming less effective, and from this perspective your prescription amounts to "try to notice when your measures are going bad, and then improve things so you get better measures" -- which is good advice but not exactly novel. What then of your "outer" targets X? You are still using them as measures -- because you adjust your "inner" model in the light of how well the outcomes you get by using them match X! So unless X really truly perfectly matches what you care about you will get the exact same Goodhart effect. It'll just happen on a longer timescale. (And if X does really truly perfectly match, you don't have a problem in the first place.)

Once again, this is because your process is simply a somewhat roundabout way of trying to achieve X, and Goodhart's law is very general: however you try to optimize for X, it's likely that you'll find it diverging from what you really care about.

So then your process says to adjust the model (it's a bit unclear on how, but let's say it's something like "make it more accurate until it accurately predicts the bad consequences we encountered") and try again. OK, fine, but this isn't changing X and it won't help at all if we are facing the Hard Problem and X doesn't correctly capture what we really care about.

Sorry I did not see this earlier. My notifications aren't working.

I think the Model can include X. If you allow the Model to include questions about what goal you should be following.

Let us take a problem that we know the end goal of it and put an agent in the position of not knowing the end goal, with imperfect feedback and see how we can use a system I describe in it.

So lets say we have a chess agent that gets rewarded for singular actions based upon what the human judge thinks of that action. The real goal is to win the chess game, not to maximise the reward, but it doesn't know that.

So if it loses a queen that might be a bad action in one context, but a good one if it allows check mate sooner.

Lets say the first Model it has is.
"My goal is to not lose chess pieces"

It compares what the Model predicts about the measure and what the measure actually is, if they are out of whack updates the model.

The first discrepancy it finds is when it gets rewarded well for when it accidentally strategically sacrifices a pawn, so the next Model is:

"My goal is to not lose low value chess pieces, if if it saves a high value chess piece"

The next update happens when it accidentally sacrifices a high value piece to save a strategically valuable pawn. It then finds enlightenment.

"My goal is to Checkmate the other player".

However this doesn't allow it to predict the values of the measure perfectly. As the human is fallible they might give bad utility to a specific move.

So it needs to refine it's model to.

"My goal is to Checkmate the other player, but the measure that is used is fallible in these ways: it can't see so far ahead or as well as me".

Importantly, this means it would still have the goal to checkmate the other player.

This can be seen as arguing in a similar vane to the arguments for goal uncertainty.

I don't understand how your hypothetical chess-playing agent is supposed to work out that (1) when the model says "maximize value of pieces" but it gets rewarded for something else, that means that the model needs revising, but (2) when the model says "checkmate opponent" but it gets rewarded for something else, that means the rewards are being allocated wrongly.

You are right it could go to the goal.
"I must play chess as badly as a human"

Both it and, "My goal is to Checkmate the other player, but the measure that is used is fallible in these ways: it can't see so far ahead or as well as me", would minimise the expected error of the model predicted utility vs the utility given by the measure.

The benefit of what I am suggesting is that the system can entertain ideas that the measure is wrong.

So it sucks a little less than one that purely tries to optimise the measure.

You can only go so far with incentive structures alone. To improve on that you need communication. Luckily we can communicate with any systems that we make

We don't have that when trying to deal with the hard problems, we can't talk to our genes and ask them what they thinking when they made us like freedom or dislike suffering. But it seems like a good guess that it has something to do with survival/genetic propagation.

What, then, is the system trying to do? Not purely trying to optimize the measure, OK. But what instead? I understand that it is (alternately, or even concurrently) trying to optimize goals suggested by its model and adjusting its model -- but what, exactly, is it trying to adjust its model to achieve? What's the relationship between its goals and the reward signal it's receiving?

It feels like you're saying "A naive measure-optimizing system will do X, which is bad; let's make a system that does Y instead" but I don't see how your very partial description of the system actually leads to it doing Y instead of X, and it seems possible that all the stuff that would have that consequence is in the bits you haven't described.

exactly, is it trying to adjust its model to achieve?

Reduce the error between the utility predicted by the model and the utility given by a measure.

What's the relationship between its goals and the reward signal it's receiving?

In a real world system: Very complicated! I think for a useful system the relationship will have to be comparable to the relationship between human goals and the reward signals we receive.

I'm trying to say: "A naive measure-optimizing system will do X, which is bad; let's explore a system that could do possibly not do X".

It is a small step. But a worthwhile one I think.

If your system is (1) trying to achieve goals suggested by its model and (2) trying to reduce the difference between the model's predictions and some measure, then it is optimizing for that measure, just in a roundabout way, and I don't see what will make it any less subject to Goodhart's law than another system trying to optimize for that same measure.

(That doesn't mean I think the overall structure you describe is a bad one. In fact, I think it's a more or less inevitable one. But I don't see that it does what you want it to.)

If your system is (1) trying to achieve goals suggested by its model and (2) trying to reduce the difference between the model's predictions and some measure, then it is optimizing for that measure,

If and only if the model creates goals that optimize for the measure, it doesn't need to!

Consider the humans choice between different snacks, a carrot or the sweet sweet dopamine hit of a sugary fatty chocolate biscuit. The dopamine here is the measure. If the model can predict that eating carrots will not feel great, but be better in some way of hitting the thing that the measure is actually pointing at, say survival/health, it might decide to have the strategy of picking the carrot. It has to just correctly predict that it won't get much positive feedback from the dopamine measure for it, and it won't be penalized for it.

I'm not saying that this is a silver bullet or will solve all our problems.

So are you assuming that we already know "the thing that the measure is actually pointing at"? Because it seems like that, rather than anything to do with the structure of models and measures and so forth, is what's helpful here.

So are you assuming that we already know "the thing that the measure is actually pointing at"?

Nope, I'm assuming that you want to be able to know what the measure is actually pointing at. To do so you need an architecture that can support that type of idea. It may be wrong, but I want the chance that it will be correct.

With dopamine for sugary things, we started our lives without knowing what the measure is actually pointing at, we manage to get to a state where we think we know what the measure is pointing at. This would have been impossible if we did not have a system of capable of believing it knew better than the measure.

Edit to add: Other ways we could be wrong about what the dopamine measure is pointing to but still useful is things like. Sweet things are of the devil you should not eat of them, they are delicious but will destroy your immortal soul. Carrots are virtuous but taste bad. This gives the same predictions and actions but is wrong. The system should be able to support this type of thing as well.