Goodhart Taxonomy: Agreement

Ben Pace

"When a measure becomes a target, it ceases to be a good measure."

-Goodhart’s Law

If you spend a while talking with someone you disagree with and end up agreeing, this is a sign you are both reasoning and communicating well - one of the primary uses of good reasoning is resolving disagreement. However, if you use agreement as your main proxy for good reasoning, some bad things might happen.

Scott Garrabrant has helpfully laid out four different models of how things can go wrong if you optimise for the proxy really hard, a phenomenon known as ‘goodharting’ (based on Goodhart’s Law that any proxy when optimised for hard, stops being a good proxy). I want to take a look at each model and see what it predicts for the real world, in the domain of agreement.

Regressional

First, you can fall prey to regressional goodharting. This is when the proxy you’re optimising for is a good measure of the thing you actually care about, but plus some noise (i.e. other uncorrelated variables), and the examples that maximise the sum of these are the examples that maximise the noise. I can think of three ways this could happen: misunderstanding, spurious correlation, and shared background models.

Misunderstanding is the simple idea that, of the times when you most agree with someone, you misunderstood each other (e.g. were using words differently). Especially if it’s an important topic and most people disagree with you, suddenly the one person who gets you seems to be the best reasoner you know (if you’re regressional goodharting).

Spurious correlation is like I described in A Sketch of Good Communication - two people who have different AI timelines can keep providing each other with new evidence until they have the same 50th percentile date, but it may turn out that they have wholly different causal models behind, and thus don’t meaningfully agree around AI x-risk. This is different from misunderstanding, because you heard the person correctly when they stated their belief.

And shared background models happens like this: You decide that the people who are good reasoners and communicators are those who agree with you a lot on complex issues after disagreeing initially. Often this heuristic ends up finding people who are good at understanding your point of view, and updating when you make good arguments. But if you look at the people who agree with you the most of these people, you’ll tend to start finding people who share a lot of background models. “Oh, you’ve got a PhD in economics too? Well obviously you can see that these two elastic goods are on the pareto frontier if there’s an efficient market. Exactly. We're so good at communicating!”

Extremal

Second, you can fall prey to extremal goodharting. This is where the peaks of your heuristic are actually falling out of an entirely different process, and have no bearing at all on the thing you cared about. Here’s some things you might actually get if you followed the heuristic ‘agrees with me’ to its extreme:

A mirror
A service sector worker whose rule is ‘the customer is always right’
Someone who trusts you a lot personally and so believes what you say is true
A partner who likes the sound of your voice and knows saying ‘yes, go on’ causes you to talk a lot
An identical copy of you (e.g. an emulated mind)

While these don’t seem like practical mistakes any of us would make, I suppose it’s a good skill to be able to know the literal maximum of the function you wrote down. It can help you to not build the wrong AGI, for example.

Adversarial

But there is one particularly common type of process that can end up being spuriously high in your proxy: our third type, adversarial goodharting. This is where someone notices that you’ve connected your proxy to a decision over a large amount of resources, thus creating in them an incentive to disguise themselves as maximising your proxy.

You’ll often incentivise the people around you to find ways to agree with you more than finding ways to successfully communicate. If you have a person who you’ve not successfully communicated with who says so, and another who is in the same state but pretends otherwise, then you’ll prefer the liar.

People who are very flexible with their beliefs (i.e. don’t really have models) and good at sounding like they agree with you, will be rewarded the most. These are yes-men, they aren’t actually people who know how to update their beliefs on a fundamental level, and their qualities deeply do not correlate with the goal of ‘good communicator’ at all.

Adversarial goodharting can happen even if humans aren’t explicitly trying to be hostile. Sure, the liars looking for power will try to agree with you more, but a perfectly well-intentioned manager goodharting on agreement will try to get more power, entirely because they observe it leads to them agreeing with people more, and that this is a sign of good reasoning.

This is most clear if you are a manager. If you’re the boss, people have to agree with you more. If you’re using agreement as your main proxy for good communication and honestly not attempting to grab power, you’ll nonetheless learn the pattern that taking power causes you to be a good communicator and reasoner. And I don’t think this is at all unlikely. I can well imagine this happening by accident at the level of social moves. “Huh, I notice when I stand up and speak forcefully, people agree with me more. This must be making me a better communicator - I’ll do this more!”

Causal

Fourthly and finally, we come to most juicy type of goodharting in this domain, causal goodharting. This is where you have the causal model the wrong way around - you notice that basketball players are taller than other people, so you start playing basketball in an effort to get taller.

If you causal goodhart on agreement, you don’t believe that good reasoning and communication cause agreement, but the opposite. You believe that agreement causes good reasoning. And so you try to directly increase the amount of agreementin an attempt to reason better.

It seems to me that causal goodharting is best understood by the beliefs it leads to. Here are three, followed by some bulleted examples of what the beliefs can lead to.

The first belief is if a reasoning process doesn't lead to agreement, that's a bad process.

You’ll consider an extended session of discussion (e.g. two hours, two days) to be a failure if you don’t agree at the end, a success if you do, and not measure things like “I learned a bunch more about how Tom thinks about management” as positive hits or "It turned out we'd been making a basic mistake for hours" as negative hits.

The second belief is if I disagree with someone, I’m bad at reasoning.

If someone has expressed uncertainty about a question, I’ll be hesitant to express a confident opinion, because then we’ll not be in agreement and that means I’m a bad communicator.
If it happens the other way around, and you express a confident opinion after I’ve expressed uncertainty, I’ll feel an impulse to say “Well that’s certainly a reasonable opinion” (as opposed to “That seems like the wrong probability to have”) because then it sounds like we agree at least a bit. In general, when causal goodharting, people will feel uncomfortable having opinions - if you disagree with someone it’s a signal you are a bad communicator.
You’ll only have opinions either when you think the trade-off is worth it (“I see that I might look silly, but no, I actually care that we check the exhaust is not about to catch fire”) or when you have a social standing such that people will defer to you (“Actually, if you are an expert, then your opinion in this domain gets to be right and we will agree with you”) - that way you can be free from the worry of signalling you’re bad at communication and coordination.

In my own life I’ve found that treating someone as an ‘expert’ - whether it's someone treating me, or me treating someone else - lets that person express their opinions more and without obfuscation or fear. It’s a social move that helps people have opinions “Please meet my friend Jeff, who has a PhD in / has thought very carefully about X.” If I can signal that I defer to Jeff on this topic, then the social atmosphere can make similar moves, and stop Jeff being afraid and actually think.

(My friend notes that sometimes this goes the other way around, that sometimes people are much more cautious when they feel they’re being tested on knowledge they’re supposed to be an expert on. This is true, I was talking about the opposite, which happens much more when the person is around a group of laymen without their expertise.)

The third belief is if someone isn't trying to agree, they're bad at reasoning.

I’m often in situations where, if at the end of the conversation people don’t say things like “Well you made good points, I’ll have to think about them” or “I suppose I’ll update toward your position then” they’re called ‘bad at communicating’.

(An umeshism I have used many times in the past: If you never ending hangouts with friends saying “Well you explained your view many times and I asked clarifying questions but I still don’t understand your perspective” then you’re not trying hard enough to understand the world - and instead are caring too much about signalling agreement.)

If you disagree with someone who signals that they are confident in their belief and are unlikely to change their mind, you’ll consider this a sign of being bad at communication, even if they send other signals of having a good model of why you believe what you believe. Basically, people who are right and confident that they’re right, can end up looking like bad reasoners. The normal word I see used for these people is ‘overconfident’.

(It’s really weird to me how often people judge others as overconfident after having one disagreement with them. Overconfidence is surely something you can only tell about a person after observing 10s of judgements.)

Once you successfully communicate (“Oh, I understand your perspective now”) you’ll expect you also have to agree, rather than have a new perspective that’s different from your or their prior beliefs. “Well, I guess I must agree with you now...”

A general counterargument to many of these points, is that all of these are genuine signs of bad reasoning or bad communication. They are more likely to be seen in world where you or I are a bad reasoner than if we’re not, so they are bayesian evidence. But the problem I’m pointing to is that, if your model only uses this heuristic, or if it takes it as the most important heuristic that accounts for 99% of the variance, then it will fail hard on these edge cases.

To take the last example in the scattershot list of causal goodharting, you might assign someone who is reasoning perfectly correctly, as overconfident. To jump back to regressional goodharting, you’ll put someone who just has the same background models as you at the top of your list of good reasoners.

Overall, I have many worries about the prospect of using agreement as a proxy for good reasoning. I’m not sure of the exact remedy, though one rule I often follow is: Respect people by giving them your time, not your deferrence. If I think someone is a good reasoner, I will spend hours or days trying to understand the gears of their model, and disagreeing with them fiercely if we’re in person. Then at the end, after learning as much as I can, I’ll use whatever moving-parts model I eventually understand, using all the evidence I’ve learned from them and from other sources. But I won’t just repeat something because someone said it.

My thanks to Mahmoud Ghanem for reading drafts.

A few thoughts on using the Goodhart Taxonomy

I wrote this post in part with the hope that others would realise you can take the abstraction Scott offered and apply it to concrete domains of interest. I'd be excited to read posts using this on other domains, like Goodhart Taxonomy: Bayesianism or Goodhart Taxonomy: Management, to make up two examples. Anyway, it's also good to report what it was like on the concrete level, so here are some parts I found hard.

On Power

I have no idea which type of goodharting power really is. Obviously adversarial is where if you have power, this causes folks to deceive you, but there's a less explicit type whereby people aren't explicitly hostile and yet you observe that increasing power causes them to submit to you more in arguments. The world where others are explicitly optimising on you feels adversarial, but the world where you notice that acting higher status leads to people agreeing with you more feels regressional, because you're grabbing power but not for adversarial reasons, just because you found a noise variable. But I'm really not sure.

I originally picked adversarial, then causal, then regressional, and finally back to adversarial. Does someone want to argue to me that it’s extremal?

On Regressional vs Extremal

(Note that I wrote this when power was in regressional. I'll leave it here in case it's useful to anyone.)

It can seem at first like the difference between regressional goodharting and extremal goodharting is merely quantitative, not qualitative. Someone trusting you a lot is maybe part of the noise under regressional goodhart, and is extremal goodhart only when it overwhelms.

I think the key distinction here is that regressional goodhart is over the normal range of experience, and extremal is only noticeable outside of that range.

I feel like I can map them all between the two types without difficulty. In this list, Regressional is marked 'R' and Extremal is marked 'E'.

R: Shared background models -> E: An identical copy of you
R: Power -> E: A dictator of a small country

It seems clear to me that, if you said "I'm going to think about who agrees with me the most on the important issues" it would be useful to say "Take care not to regressional goodhart on that. Don't over-weight agreement with people who already have a lot of similar cognitive tools to you, or people you have power over", but that saying "Be careful not to accidentally become a dictator of a small country or create an identical copy of yourself" would be unhelpful. Yet they are the same type of error, just with a whopping quantitative distinction between them.

I admit that I'm claiming a distinction based in subjective qualities (i.e. the fact we are humans, not any other facts about math), but that doesn't seem reason enough to do away with the distinction in the language.

On Causal vs Regressional

I feel that many of the examples in causal could move back to regressional, but I'm not sure why that is. Trying to say things more like 'I agree with you' or discounting processes that take longer to reach agreement as 'bad processes' because those questions were just deeper and more nuanced, feel like optimising for noise variables. /shrug

Conclusion

Overall, I didn't come into the post knowing what argument I was going to make, just with the intent to see what each model implied. Extremal ended up being the mosts surprising (and the least useful, on average, though 'yes-men' was a good thing to notice). Causal ended up being the most useful - for ages it was just a long list of examples, then I noticed that they split into three groups. I'm pretty happy with the result, it clarified my thinking on the topic a bunch.

I understood Regressional vs Extremal like that:

Signal is a victim of Regressional goodheart if it's a good indicator of thing you care about but stops working the moment you start optimizing for it. for example empty email inbox is a decent signal for me having taken care of things I need to do but if I do the obvious optimization and set up a filter to delete all incoming email...
Signal is a victim of Extreme Goodheart if optimizing for it works well. Until you get to values that are very high when it suddenly sops working. For example if I increased fraction of time I spend exercising I expect I would keep getting healthier for a while. But I expect that if was exercising so much I'd need to take stimulants to be able to move time from sleep to exercise I'd be hurting myself.

So I think those are mutually exclusive (you can't have optimizing signal both fail immediately and keep working for a while).

Given we are not perfect Bayesians and cannot necessarily expect to agree enough via Aumann to classify our belief as "agreeing", is there much reason to think optimizing on agreement even makes much sense? That is, these well thought out lines of reasoning for how optimizing on agreement can fail aside, it seems we should anyway not expect agreement to be a strong measure of reasoning ability.

Empirically I very commonly see people around me (and myself) trying to agree, which is one of the reasons I wrote this post.

And that's not obviously wrong. I'm pro sharing models and info more than trying to nail down perfect agreement on a variable, but it seems fairly sensible to me to use disagreement as a sign of a place where model-sharing has high returns on investment. The reason this post is necessary is not because optimising for agreement is terrible, but because it's useful. Yet I think that many achieve the failure modes of causal goodharting.