Stuart_Armstrong

Sequences

Anthropic Decision Theory
Subagents and impact measures
If I were a well-intentioned AI...

Comments

The reverse Goodhart problem

Cheers, these are useful classifications.

The reverse Goodhart problem

Almost equally hard to define. You just need to define , which, by assumption, is easy.

The reverse Goodhart problem

By Goodhart's law, this set has the property that any will with probability 1 be uncorrelated with outside the observed domain.

If we have a collection of variables , and , then is positively correlated in practice with most expressed simply in terms of the variables.

I've seen Goodhart's law as an observation or a fact of human society - you seem to have a mathematical version of it in mind. Is there a reference for that.

The reverse Goodhart problem

It seems I didn't articulate my point clearly. What I was saying is that V and V' are equally hard to define, yet we all assume that true human values has a Goodhart problem (rather than a reverse Goodhart problem). This can't be because of the complexity (since the complexity is equal) nor because we are maximising a proxy (because both have the same proxy).

So there is something specific about (our knowledge of) human values which causes us to expect Goodhart problems rather than reverse Goodhart problems. It's not too hard to think of plausible explanations (fragility of value can be re-expressed in terms of simple underlying variables to get results like this), but it does need explaining. And it might not always be valid (eg if we used different underlying variables, such as the smooth-mins of the ones we previously used, then fragility of value and Goodhart effects are much weaker), so we may need to worry about them less in some circumstances.

The reverse Goodhart problem

It seems I didn't articulate my point clearly. What I was saying is that V and V' are equally hard to define, yet we all assume that true human values has a Goodhart problem (rather than a reverse Goodhart problem). This can't be because of the complexity (since the complexity is equal) nor because we are maximising a proxy (because both have the same proxy).

So there is something specific about (our knowledge of) human values which causes us to expect Goodhart problems rather than reverse Goodhart problems. It's not too hard to think of plausible explanations (fragility of value can be re-expressed in terms of simple underlying variables to get results like this), but it does need explaining. And it might not always be valid (eg if we used different underlying variables, such as the smooth-mins of the ones we previously used, then fragility of value and Goodhart effects are much weaker), so we may need to worry about them less in some circumstances.

Power dynamics as a blind spot or blurry spot in our collective world-modeling, especially around AI

Thanks for writing this.

For myself, I know that power dynamics are important, but I've chosen to specialise down on the "solve technical alignment problem towards a single entity" and leave those multi-agent concerns to others (eg the GovAI part of the FHI), except when they ask for advice.

The reverse Goodhart problem

V and V' are symmetric; indeed, you can define V as 2U-V'. Given U, they are as well defined as each other.

The reverse Goodhart problem

The idea that maximising the proxy will inevitably end up reducing the true utility seems a strong implicit part of Goodharting the way it's used in practice.

After all, if the deviation is upwards, Goodharting is far less of a problem. It's "suboptimal improvement" rather than "inevitable disaster".

SIA is basically just Bayesian updating on existence

SIA is the Bayesian update on knowing your existence (ie if they were always going to ask if dadadarren existed, and get a yes or no answer). The other effects come from issues like "how did they learn of your existence, and what else could they have learnt instead?" This often does change the impact of learning facts, but that's not a specifically anthropics problem.

Load More