An incentive structure that might not suck too much.

[-][anonymous]8y40

Boiling the post down to an aphorism I would say:

Improve models using the measure, use the model to update targets.

This misalignment of model seems like it drives a hidden agency cost - where the agents/people you've delegated to /think/ their interests are aligned with yours but are actually not... and this is hard to account for. Do you know if there is a name for that type of agency cost, driven by the problem you describe?

[-][anonymous]8y20

Not that I know of! The principal-agent problem is when the agent knows its interests aren't aligned, but I don't know of one for a scenario where there is just confusion about how aligned an agent is with the principal are.

[-]gjm8y20

(I'm posting this in response to this article rather than the follow-up because I think the main argument is here and the follow-up is mostly intended as clarification. If you would prefer discussion to take place there, let me know.)

It feels to me as if you start out by describing one problem (I'll call it the Hard Problem), and then describe a process that encounters an easier problem (I'll call it the Easy Problem, though actually it's also pretty hard), and say how your process addresses the Easy Problem ... but you never get round to showing that it does anything about the Hard Problem, and in fact I think it can't.

The Easy Problem is that when trying to make X happen, you may aim at Y and find to your chagrin that achieving Y isn't enough to achieve X. The Hard Problem is that you may try to make X happen but find to your chagrin that achieving X isn't enough to achieve *the mysterious unknown thing that you really wanted*.

So, in the first paragraph under the heading "Fixing the goals", you give an excellent example of the Hard Problem. X is "make people happy", but if we simply try to achieve X then we may end up making people happy by drugging them into oblivion and that isn't what we really want; but unfortunately we don't know exactly what we really want, which is why we come out with approximations like "make people happy".

You propose a process in which we have a model, and that leads to intermediate goals, but then it might turn out that achieving those goals isn't good enough. This is the Easy Problem, and you give a specific example: we want people to be happy and healthy and unstressed (this is X), we think making them wealthy might achieve that (making them wealthy is Y), but alas it turns out that that isn't a good enough proxy.

So then your process says to adjust the model (it's a bit unclear on how, but let's say it's something like "make it more accurate until it accurately predicts the bad consequences we encountered") and try again. OK, fine, but this isn't changing X and it won't help at all if we are facing the Hard Problem and X doesn't correctly capture what we really care about. And nothing in this process ever changes X; the process is effectively a rather roundabout way of trying to optimize for X, and if it solves the Hard Problem I think it can only be by coincidence.

Now, what I've done here is to make up two problem statements, neither of which appears in your article, and complain that your proposed process doesn't address one of them. Perhaps it was never meant to. Well, what you do explicitly say you're trying to do is to get around Goodhart's law: "when a measure becomes a target it ceases to be a good measure". I regret to say that I don't see that your proposal does that, either. It seems to me that you have two sets of targets, the "outer" X and the "inner" Y; that in the "inner" optimization process you are using Y as both measures and targets, and this does lead to the measures becoming less effective, and from this perspective your prescription amounts to "try to notice when your measures are going bad, and then improve things so you get better measures" -- which is good advice but not exactly novel. What then of your "outer" targets X? You are still using them as measures -- because you adjust your "inner" model in the light of how well the outcomes you get by using them match X! So unless X really truly perfectly matches what you care about you will get the exact same Goodhart effect. It'll just happen on a longer timescale. (And if X does really truly perfectly match, you don't have a problem in the first place.)

Once again, this is because your process is simply a somewhat roundabout way of trying to achieve X, and Goodhart's law is very general: however you try to optimize for X, it's likely that you'll find it diverging from what you really care about.

[-][anonymous]8y20

So then your process says to adjust the model (it's a bit unclear on how, but let's say it's something like "make it more accurate until it accurately predicts the bad consequences we encountered") and try again. OK, fine, but this isn't changing X and it won't help at all if we are facing the Hard Problem and X doesn't correctly capture what we really care about.

Sorry I did not see this earlier. My notifications aren't working.

I think the Model can include X. If you allow the Model to include questions about what goal you should be following.

Let us take a problem that we know the end goal of it and put an agent in the position of not knowing the end goal, with imperfect feedback and see how we can use a system I describe in it.

So lets say we have a chess agent that gets rewarded for singular actions based upon what the human judge thinks of that action. The real goal is to win the chess game, not to maximise the reward, but it doesn't know that.

So if it loses a queen that might be a bad action in one context, but a good one if it allows check mate sooner.

Lets say the first Model it has is.
"My goal is to not lose chess pieces"

It compares what the Model predicts about the measure and what the measure actually is, if they are out of whack updates the model.

The first discrepancy it finds is when it gets rewarded well for when it accidentally strategically sacrifices a pawn, so the next Model is:

"My goal is to not lose low value chess pieces, if if it saves a high value chess piece"

The next update happens when it accidentally sacrifices a high value piece to save a strategically valuable pawn. It then finds enlightenment.

"My goal is to Checkmate the other player".

However this doesn't allow it to predict the values of the measure perfectly. As the human is fallible they might give bad utility to a specific move.

So it needs to refine it's model to.

"My goal is to Checkmate the other player, but the measure that is used is fallible in these ways: it can't see so far ahead or as well as me".

Importantly, this means it would still have the goal to checkmate the other player.

This can be seen as arguing in a similar vane to the arguments for goal uncertainty.

[-]gjm8y20

I don't understand how your hypothetical chess-playing agent is supposed to work out that (1) when the model says "maximize value of pieces" but it gets rewarded for something else, that means that the model needs revising, but (2) when the model says "checkmate opponent" but it gets rewarded for something else, that means the rewards are being allocated wrongly.

[-][anonymous]8y20

You are right it could go to the goal.
"I must play chess as badly as a human"

Both it and, "My goal is to Checkmate the other player, but the measure that is used is fallible in these ways: it can't see so far ahead or as well as me", would minimise the expected error of the model predicted utility vs the utility given by the measure.

The benefit of what I am suggesting is that the system can entertain ideas that the measure is wrong.

So it sucks a little less than one that purely tries to optimise the measure.

You can only go so far with incentive structures alone. To improve on that you need communication. Luckily we can communicate with any systems that we make

We don't have that when trying to deal with the hard problems, we can't talk to our genes and ask them what they thinking when they made us like freedom or dislike suffering. But it seems like a good guess that it has something to do with survival/genetic propagation.

[-]gjm8y20

What, then, is the system trying to do? Not purely trying to optimize the measure, OK. But what instead? I understand that it is (alternately, or even concurrently) trying to optimize goals suggested by its model and adjusting its model -- but what, exactly, is it trying to adjust its model to achieve? What's the relationship between its goals and the reward signal it's receiving?

It feels like you're saying "A naive measure-optimizing system will do X, which is bad; let's make a system that does Y instead" but I don't see how your very partial description of the system actually leads to it doing Y instead of X, and it seems possible that all the stuff that would have that consequence is in the bits you haven't described.

[-][anonymous]8y20

exactly, is it trying to adjust its model to achieve?

Reduce the error between the utility predicted by the model and the utility given by a measure.

What's the relationship between its goals and the reward signal it's receiving?

In a real world system: Very complicated! I think for a useful system the relationship will have to be comparable to the relationship between human goals and the reward signals we receive.

I'm trying to say: "A naive measure-optimizing system will do X, which is bad; let's explore a system that could do possibly not do X".

It is a small step. But a worthwhile one I think.

[-]gjm8y20

If your system is (1) trying to achieve goals suggested by its model and (2) trying to reduce the difference between the model's predictions and some measure, then it is optimizing for that measure, just in a roundabout way, and I don't see what will make it any less subject to Goodhart's law than another system trying to optimize for that same measure.

(That doesn't mean I think the overall structure you describe is a bad one. In fact, I think it's a more or less inevitable one. But I don't see that it does what you want it to.)

[-][anonymous]8y20

If your system is (1) trying to achieve goals suggested by its model and (2) trying to reduce the difference between the model's predictions and some measure, then it is optimizing for that measure,

If and only if the model creates goals that optimize for the measure, it doesn't need to!

Consider the humans choice between different snacks, a carrot or the sweet sweet dopamine hit of a sugary fatty chocolate biscuit. The dopamine here is the measure. If the model can predict that eating carrots will not feel great, but be better in some way of hitting the thing that the measure is actually pointing at, say survival/health, it might decide to have the strategy of picking the carrot. It has to just correctly predict that it won't get much positive feedback from the dopamine measure for it, and it won't be penalized for it.

I'm not saying that this is a silver bullet or will solve all our problems.

[-]gjm8y20

So are you assuming that we already know "the thing that the measure is actually pointing at"? Because it seems like that, rather than anything to do with the structure of models and measures and so forth, is what's helpful here.

[-][anonymous]8y20

So are you assuming that we already know "the thing that the measure is actually pointing at"?

Nope, I'm assuming that you want to be able to know what the measure is actually pointing at. To do so you need an architecture that can support that type of idea. It may be wrong, but I want the chance that it will be correct.

With dopamine for sugary things, we started our lives without knowing what the measure is actually pointing at, we manage to get to a state where we think we know what the measure is pointing at. This would have been impossible if we did not have a system of capable of believing it knew better than the measure.

Edit to add: Other ways we could be wrong about what the dopamine measure is pointing to but still useful is things like. Sweet things are of the devil you should not eat of them, they are delicious but will destroy your immortal soul. Carrots are virtuous but taste bad. This gives the same predictions and actions but is wrong. The system should be able to support this type of thing as well.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

5

An incentive structure that might not suck too much.

5

5

Delegation

Fixing the goals