# 20

Epistemic Status: I need feedback on these ideas, and I've been delaying because I'm not sure I'm on the right track. This is the product of a lot of thinking, but I'm not sure the list is complete or there isn't something important I'm missing. (Note: This is intended to form a large part of a paper for an article to be submitted to the journal special issue here.)

Following up on Scott Garrabrant's earlier post on Goodhart's Law and the resulting paper, I wrote a further discussion of non-adversarial goodhart, and explicitly deferred discussion of the adversarial case. I've been working on that.

Also note that these are often reformulations or categorizations of other terms (treacherous turn, faulty reward functions, distributional shift, reward hacking, etc.) It might be good to clarify exactly what went where, but I'm unsure.

To (finally) start, here is Scott's "Quick Reference" for the initial 4 methods, which is useful for this post as well. I've partly replaced the last one with the equivalent cases from the Arxiv paper.

# Quick Reference

• Regressional Goodhart - When selecting for a proxy measure, you select not only for the true goal, but also for the difference between the proxy and the goal.
• Model: When U is equal to V+X, where X is some noise, a point with a large U value will likely have a large V value, but also a large X value. Thus, when U is large, you can expect V to be predictably smaller than U.
• Example: height is correlated with basketball ability, and does actually directly help, but the best player is only 6'3", and a random 7' person in their 20s would probably not be as good
• Causal Goodhart - When there is a non-causal correlation between the proxy and the goal, intervening on the proxy may fail to intervene on the goal.
• Model: If V causes U (or if V and U are both caused by some third thing), then a correlation between V and U may be observed. However, when you intervene to increase U through some mechanism that does not involve V, you will fail to also increase V.
• Example: someone who wishes to be taller might observe that height is correlated with basketball skill and decide to start practicing basketball.
• Extremal Goodhart - Worlds in which the proxy takes an extreme value may be very different from the ordinary worlds in which the correlation between the proxy and the goal was observed.
• Model: Patterns tend to break at simple joints. One simple subset of worlds is those worlds in which U is very large. Thus, a strong correlation between U and V observed for naturally occuring U values may not transfer to worlds in which U is very large. Further, since there may be relatively few naturally occuring worlds in which U is very large, extremely large U may coincide with small V values without breaking the statistical correlation.
• Example: the tallest person on record, Robert Wadlow, was 8'11" (2.72m). He grew to that height because of a pituitary disorder, he would have struggled to play basketball because he "required leg braces to walk and had little feeling in his legs and feet."
• (See below from the Arxiv paper.) Adversarial Goodhart - When you optimize for a proxy, you provide an incentive for adversaries to correlate their goal with your proxy, thus destroying the correlation with your goal.
• Model: Removed. See below.
• Example - Removed.

From the Arxiv Paper: (Note - I think this is still incomplete, and focuses far too much on the Agent-Regulator framing. See below.)

• Adversarial Misalignment Goodhart - The agent applies selection pressure knowing the regulator will apply different selection pressure on the basis of the metric . The adversarial misalignment failure can occur due to the agent creating extremal Goodhart effects, or by exacerbating always-present regressional Goodhart, or due to causal intervention by the agent which changes the effect of the regulator optimization.
• Campbell’s Law - Agents select a metric knowing the choice of regulator metric. Agents can correlate their metric with the regulator’s metric, and select on their metric. This further reduces the usefulness of selection using the metric for acheiving the original goal.
• Normal Cobra Effect - The regulator modifies the agent goal, usually via an incentive, to correlate it with the regulator metric. The agent then acts by changing the observed causal structure due to incompletely aligned goals in a way that creates a Goodhart effect.
• Non-Causal Cobra Effect - The regulator modifies the agent goal to make agent actions aligned with the regulator’s metric. Under selection pressure from the agent, extremal Goodhart effects occur or regressional Goodhart effects are worsened.

# New: 5 Ways Multiple Agents Ruin Everything

To fix that insufficient bullet point above, here is a list of 5 forms of optimization failures that can occur in multi-agent systems. I intend for the new sub-list to be both exhaustive, and non-overlapping, but I'm not sure either is true. For obvious reasons, the list is mostly human examples, and I haven't formalized these into actual system models. (Anyone who would like to help me do so would be welcome!)

Note that the list is only discussing things that happen due to optimization failure and interactions. Also note that most examples are 2-party. There may be complex and specific 3-party or N-party failure modes that are not captured, but I can't find any.

1) (Accidental) Steering is when one agent alter the system in ways not anticipated by another agent, creating one of the above-mentioned over-optimization failures for the victim.

This is particularly worrisome when multiple agents have closely related goals, even if those goals are aligned.

Example 1.1 A system may change due to a combination of actors' otherwise benign influences, either putting the system in an extremal state or triggering a regime change.

Example 1.2 In the presence of multiple agents without coordination, manipulation of factors not already being manipulated by other agents is likely to be easier and more rewarding, potentially leading to inadvertent steering.

2) Coordination Failure occurs when multiple agents clash despite having potentially compatible goals.

Coordination is an inherently difficult task, and can in general be considered impossible\cite{Gibbard1973}. In practice, coordination is especially difficult when goals of other agents are incompletely known or understood. Coordination failures such as Yudkowsky's Inadequate equilibria\cite{Yudkowsky2017} are stable, and coordination to escape from such an equilibrium can be problematic even when agents share goals.

Example 2.1 Conflicting instrumental goals that neither side anticipates may cause wasted resources on contention. For example, both agents are trying to do the same thing in conflicting ways.

Example 2.2 Coordination limiting overuse of public goods is only possible when conflicts are anticipated or noticed and where a reliable mechanism can be devised\cite{Ostrom1990}.

3) Adversarial misalignment occurs when a victim agent has an incomplete model of how an opponent can influence the system, and the opponent selects for cases where the victim's model performs poorly and/or promotes the opponent's goal.

Example 3.1 Chess engines will choose openings for which the victim is weakest.

Example 3.2 Sophisticated financial actors can dupe victims into buying or selling an asset in order to exploit the resulting price changes.

4) Input spoofing and filtering - Filtered evidence can be provided or false evidence can be manufactured and put into the training data stream of a victim agent.

Example 4.1 Financial actors can filter by performing transactions they don't want seen as private transactions or dark pool transactions, or can spoof by creating offsetting transactions with only one half being reported to give a false impression of activity to other agents.

Example 4.2 Rating systems can be attacked by inputting false reviews into a system, or by discouraging reviews by those likely to be the least or most satisfied reviewers.

Example 4.3 Honeypots can be placed or Sybil attacks mounted by opponents in order to fool victims into learning from examples that systematically differ from the true distribution.

5) Goal co-option is when an agent directly modifies the victim agent reward function directly, or manipulates variables absent from the victim's system model.

The probability of exploitable reward functions increases with the complexity of both the agent and the system it manipulates\cite{Amodei2016}, and exploitation by other agents seems to follow the same pattern.

Example 5.1 Attackers can directly target the system on which an agent runs and modify its goals.

Example 5.2 An attacker can discover exploitable quirks in the goal function to make the second agent optimize for a new goal, as in Manheim and Garrabrant's Campbell's law example.

# Conclusion

I'd love feedback. (I have plenty to say about applications and importance, but I'll talk about that separately.)

# 20

Mentioned in
New Comment

In general I think working on taxonomizing failure modes is valuable. In the case of one of the meta generators of failure modes, proxy divergence, even more valuable.

Formalization generators: I often find it useful to think about which kinds of distinctions I can make in order to decompose a category. A few high level ones: split into variant and invariant parts, past/future asymmetry, descriptive/prescriptive parts, continuous vs discrete representation, implementation/algorithmic/functional level (Marr's levels), complexity classes (in particular some strategies forcing other strategies into worse complexity classes), breadth vs depth first search spaces, and strategies differing due to beliefs about payoff distribution shape (incl. type 1 and 2 error penalties).

With that last one an object level example: knowing that the payoff distribution has changed before others because you're the one who changed it (caused the proxy to diverge).

I like the generator of how markets might clear under some adversarial conditions and wonder what models quants have of this they might be willing to share.

I like the generator of how markets might clear under some adversarial conditions and wonder what models quants have of this they might be willing to share.

In the preprint paper - https://arxiv.org/abs/1810.10862 - I discuss a few examples of these failure modes that occur in practice. In finance, most of the discussed failures are ways to create "momentum ignition."

Also, having done policy work on HFT, I found it's really really hard to get quants to share any details about strategies. I suspect this would be doubly-true if it's about manipulative strategies!

(And thanks for the other thoughts. I'm still working through what those generators' failure modes would look like.)