Non-Adversarial Goodhart and AI Risks

Davidmanheim

In a recent paper by Scott Garrabrant and myself, we formalized and extended the categories Scott proposed for Goodhart-like phenomena. (If you haven't read either his post or the new paper, it's important background for most of this post.)

Here, I lay out my further intuitions about how and where the non-adversarial categories matter for AI safety. Specifically, I view these categories as particularly critical in preventing accidental superhuman AI, or near-term paperclipping. This makes them particularly crucial in the short term.

I do not think that most of the issues highlighted are new, but I think the framing is useful, and hopefully clearly presents why causal mistakes by Agentic AI are harder problems that I think is normally appreciated.

Epistemic Status: Provisional and open to revision based on new arguments, but arrived at after significant consideration. I believe conclusions 1-4 are restatements of well understood claims in AI safety. I believe conclusions 5 and 6 are less well appreciated.

Side Note: I am deferring discussion of adversarial Goodhart to the other paper and a later post; it is arguably more important, but in very different ways. The deferred topics includes most issues with multiple agentic AIs that interact, and issues with pre-specifying a control scheme for a superhuman AI.

Goodhart Effects Review - Read the paper for details!

Regressional Goodhart - When selecting for a proxy measure, you select not only for the true goal, but also for the difference between the proxy and the goal.

Extremal Goodhart - Worlds in which the proxy takes an extreme value may be very different from the ordinary worlds in which the relationship between the proxy and the goal was observed. This occurs in the form of Model Insufficiency, or Change in Regime.

Causal Goodhart - When the causal path between the proxy and the goal is indirect, intervening can change the relationship between the measure and proxy, and optimizing can then cause perverse effects.

Adversarial Goodhart will not be discussed in this post. It occurs in two ways. Misalignment - The agent applies selection pressure knowing the regulator will apply different selection pressure on the basis of the metric. This allows the agent to hijack the regulator's optimization. Cobra Effect - The regulator modifies the agent goal, usually via an incentive, to correlate it with the regulator metric. The agent then either 1) uses selection pressure to create extremal Goodhart effects occur or make regressional Goodhart effects more severe, or 2) acts by changing the causal structure due to incompletely aligned goals in a way that creates a Goodhart effect.

Regressional and Extremal Goodhart

The first two categories of Goodhart-like phenomena, regressional and extremal, are over-optimization mistakes, and in my view the mistakes should be avoidable. This is not to say we don't need to treat AI as a cryptographic rocket probe, or that we don't need to be worried about it - just that we know what to be concerned about already. This seems vaguely related to what Scott Alexander calls "Mistake Theory" - the risks would be technically solvable if we could convince people not to do the stupid things that make them happen.

Regressional Goodhart, as Scott Garrabrant correctly noted, is an unavoidable phenomenon when doing unconstrained optimization using a fixed metric which is imperfectly correlated with a goal. To avoid the problems of overoptimization despite the unavoidable phenomenon, a safe AI systems must 1) have limited optimization power to allow robustness to misalignment, perhaps via satisficing, low-impact agents, or suspendable / stoppable agents, and/or 2) involve a metric which is adaptive using techniques like oversight or reinforcement learning. This allows humans to realign the AI, and safe approaches should ensure there are other ways to enforce robustness to scaling up.

Conclusion 1 - Don't allow unconstrained optimization using a fixed metric.

In less unavoidable ways, Extremal Goodhart effects are mistakes of overoptimization, and my intuition is that they should be addressable in similar ways. We need to be able to detect the regime changes or increasing misalignment of the metric, but the strategies that address regressional effects should be closely related or useful in the same cases. Again, it's not easy, but it's a well defined problem.

Conclusion 2 - When exploring the fitness landscape, don't jump down the optimization slope too quickly before double checking externally. This is especially true when moving to out-of-sample areas.

Despite the challenges, I think that the divergences between goals and metrics in the first two Goodhart-like effects can be understood and addressed beforehand, and these techniques are being actively explored. In fact, I think this describes at least a large plurality of the current work being done on AI alignment.

The Additional Challenge of Causality

Causal Goodhart, like the earlier two categories, is always a mistake of understanding. Unlike the first two, it seems less easily avoidable by being cautious. The difficulty of inferring causality correctly means that it's potentially easy to accidentally screw up an agentic AI's world model in a way that allows causal mistakes to be made. I'm unsure the approaches being considered for AI Safety are properly careful about this fact. (I am not very familiar with the various threads of AI safety research, so I may be mistaken on that count.)

Accounting for uncertainty about causal models is critical, but given the multiplicity of possible models, we run into the problems of computation seen in AIXI. (And even AIXI doesn't guarantee safety!)

So inferring causal structure is NP hard. Scott's earlier post claims that "you can try to infer the causal structure of the variables using statistical methods, and check that the proxy actually causes the goal before you intervene on the proxy. " The problem is that we can't actually infer causal structure well, even given RCTs, without simultaneously testing the full factorial set of cases. (And even then statistics is hard, and can be screwed up accidentally in complex and unanticipated ways.) Humans infer causality partly intuitively, but in more complex systems, badly. They can be taught to do it better (PDF), but only in narrow domains.

Conclusion 3 - Getting causality right is an intrinsically computationally hard and sample-inefficient problem, and building AI won't fix that.

As Pearl notes, policy is hard in part because knowing exactly these complex causal factors is hard. This isn't restricted to AI, and it also happens in [insert basically any public policy that you think we should stop already here]. (Politics is the mind-killer, and policy debates often center around claims about causality. No, I won't give contemporary examples.)

We don't even get causality right in the relatively simpler policy systems we already construct - hence Chesterton's fence, Boustead's Iron Law of Intervention, and the fact that intellectuals throughout history routinely start advocating strongly for things that turn out to be bad when actually applied. They never actually apologize for accidentally starving and killing 5% of their population. This, of course, is because their actual idea was good, it was just done badly. Obviously real killing of birds to reduce pests in China has never been tried.

Conclusion 4 - Sometimes the perverse effects of getting it a little bit wrong are really, really bad, especially because perverse effects may only be obvious after long delays.

There are two parts to this issue, the first of which is that mistaken causal structure can lead to regressional or extremal Goodhart. This is not causal Goodhart, and isn't more worrisome than those issues, since the earlier mentioned solutions still apply. The second part is that the action taken by the regulator may actually change the causal structure. They think they are doing something simple like removing a crop-eating predator, but the relationship between crop-eating and birds ignores the fact that the birds eat other pests. This is much more worrisome, and harder to avoid.

This second case is causal Goodhart. The mistake can occur as soon as you allow a regulator - Machine Learning, AI, or otherwise - to interact in arbitrary ways with wider complex systems directly to achieve specified goals, without specifying and restricting the methods to be used.

These problems don't show up in current deployed systems because humans typically choose the action set to be chosen from based on the causal understanding needed. The challenge is also not seen in toy-worlds, since testing domains are usually very well specified, and inferring causality becomes difficult only when the system being manipulated contains complex and not-fully-understood causal dynamics. (A possible counterexample is a story I heard secondhand about OpenAI developing what became the World of Bits system. Giving a RL system access to a random web browser and the mouse led to weird problems including, if I recall correctly, the entire system crashing.)

Conclusion 5 - This class of causal mistake problem should be expected to show up as proto-AI systems are fully deployed, not beforehand when tested in limited cases.

This class of problem does not seem to be addressed by much of the AI-Risk approaches that are currently being suggested or developed. (The only approach that avoids this is using Oracle-AIs.) It seems there is no alternative to using a tentative causal understanding for decision making if we allow any form of Agentic AI. The problems that it causes are not usually obvious to either the observer or the agent until the decision has been implemented.

Note that attempting to minimize the impact of a choice is done based on the same mistaken or imperfect causal model that leads to the decision, so it is not avoidable in this way. Humans providing reinforcement learning based on the projected outcomes of the decision are similarly unaware of the perverse effect, and impact minimization assumes that the projected impact is correct.

Conclusion 6 - Impact minimization strategies do not seem likely to fix the problems of causal mistakes.

Summary

It seems that the class of issues identified in the realm of Goodhart-like phenomena illustrates some potential advantages and issues worth considering in AI safety. The problems identified in part simply restate problems that are already understood, but the framework seems worth further consideration. Most critically, a better understanding of causal mistakes and causal Goodhart effects would potentially be valuable. If the conclusions here are incorrect, understanding why also seems useful for understanding the way in which AI risk can and cannot manifest.

This class of problem does not seem to be addressed by much of the AI-Risk approaches that are currently being suggested or developed.

I think these problems are implicitly addressed in both MIRI's and Paul Christiano's approaches? I guess what we want to end up with is an AI that understands the potential for Causal Goodhart and takes that risk into account when making decisions (and e.g. act conservatively when the risk is high). This seems well within the scope of what a good decision theory would do (in MIRI's approach) or what a large-scale human deliberation would consider (in Paul's approach).

I don't think it does get implicitly addressed, but I've been working on things unrelated to these agendas for a while, and have not focused on what else is being done at MIRI. I'd be happy to be more specifically corrected. (And I'm not as familiar with Paul's approach at all.)

From what I understand, within the decision theory agenda of MIRI, I think there is an additional hard problem of correctly identifying causality while minimally interfering in the system, and I don't think it's on anyone's agenda - again, I may be wrong.

If I understand the results on the topic correctly, there's another problem: discovering causality is NP, and I don't think there's an obvious way to try safely guessing without determining the answer. I'm unsure that this is as fundamental an issue, or as critical. It may be that AI can safely avoid the problem by identifying areas where causality is not understood, and being extra careful there.

For naturalized induction, I have not followed the work in detail at all, but in Nate Soares MIRI technical agenda from a couple years back, he says that Solomoff induction solves the problem for agents outside of a system. For the reasons I outlined above, AIXI/Solomonoff induction is an unsafe utility maximizer if the causal chain is incorrect. To state this differently, the simplest hypothesis that predicts the data is generally predictive in the future, but under regime changes do to causal interaction with the system, that claim seems false. (Yes, this is a significant claim, and should be justified.)

I'm pretty sure that "hard problem of correctly identifying causality" is a major goal of MIRI's decision theory.

In what sense is discovering causality NP-hard? There's the trivial sense in which you can embed a NP-hard problem (or tasks of higher complexity) into the real world, and there's the sense in which inference in Bayesian networks can embed NP-hard problems.

Can you elaborate on why AIXI/Solomonoff induction is an unsafe utility maximizer, even for Cartesian agents?

I will try to edit this to include a more comprehensive reply later, but as this will take me at least another week, I will point to one paper I am already familiar with on hardness of decisions where causality is unclear; https://arxiv.org/pdf/1702.06385.pdf (Again, computational complexity is not my area of expertise - so I may be wrong.)

Re: safety of Solomonoff/AIXI, I am again unsure, but I think we can posit a situation where very early on in the world-model building process, the simpler models, those which are weighted heavily due to simplicity, are incorrect in ways that lead to very dangerous information collection options.

Apologies for not responding more fully - this is an area where I have a non-technical understanding of the area, but came to tentative conclusions on these points, and have had discussions with those more knowledgable than myself who agreed.

Necromantic comment, sorry :P

I might be misinterpreting, but what I think you're saying is that if the humans make a mistake in using a causal model of the world and tell the AI to optimize for something bad-in-retrospect, this is "mistaken causal structure lead[ing] to regressional or extremal Goodhart", and thus not really causal Goodhart per se (by the categories you're intending). But I'm still a little fuzzy on what you mean to be actual factual causal Goodhart.

Is the idea that humans tell the AI to optimize for something that is not bad-in-retrospect, but in the process of changing the world the causal model the AI is using will move outside its domain of validity? Does this only happen if the AI's model of the world is lacking compared to the humans'?

Yes on point Number 1, and partly on point number 2.

If humans don't have incredibly complete models for how to achieve their goals, but know they want a glass of water, telling the AI to put a cup of H2O in front of them can create weird mistakes. This can even happen because of causal connections the humans are unaware of. The AI might have better causal models than the humans, but still cause problems for other reasons. In this case, a human might not know the difference between normal water and heavy water, but the AI might decide that since there are two forms, it should have them present in equal amounts, which would be disastrous for reasons entirely beyond the understanding of the human who asked for the glass of water. The human needed to specify the goal differently, and was entirely unaware of what they did wrong - and in this case it will be months before the impacts of the weirdly different than expected water show up, so human-in-the-loop RL or other methods might not catch it.

Inspiring! Decided to try brainstorming about proxies.

What causes proxy drift?

proxy optimized until outside its domain of validity with no legible alarm

landscape changes along dimensions that were not known to impact proxy, or impacted underlying thing desired to be measured without impacting proxy.

both consequences of proxies sharing some but not all dimensions with the underlying quantity.

few shared dimensions: fragile proxies

more shared dimensions: more expensive for adversarial munchkining to disentangle (ie find a lever that moves one only along non-shared dimensions)

key dimensions?: legibility, cost, tightness (overlapping dimensions/relevant dimensions?)

legibility allows easier 'out of domain' alarms but also easier munchkining, requiring greater tightness.

expense and legibility -> security via obscurity

How could parameterizing proxy space fail?

choosing an efficient representation of proxies via an orthogonal carving up of proxy space is potentially bad because it robs you of consilience. Overlapping proxies give you more chances to construct and notice alarms.

quantity>quality of proxies?

orthogonality often considered desirable for modularity ie legibility of side effects.

long feedback cycles increase time for proxy drift.

underrated?: construction of new proxies and adversarial attacking of them for practice.

Brainstorming approaches to working with causal goodhart

Low-impact measures that include the change in the causal structure of the world. It might be possible to form a measure like this which doesn't depend on recovering the true causal structure at any point (ie. minimizing the difference between predictions of causal structure in state A and B, even if both of those predictions are wrong)
Figure out how to elicit human models of causal structure, and provide the human model of causal structure along with the metric, and the AI uses this information to figure out whether it's violating the assumptions that the human made
Causal transparency: have the AI explain the causal structure of how it's plans will influence the proxy. This might allow a human to figure out whether the plan will cause the proxy to diverge from the goal. ie. True goal is happiness, proxy is happiness score as measured by online psychological questionnaire, AI's plan says that it will influence the proxy by hacking into the online psychological questionnaire. You don't need to understand how the AI plans to hack into the server to understand that the plan is diverging the proxy from the goal.

These are interesting ideas. I'm not sure I understand what you mean by the first; causal structure can be arbitrarily complex, so I'm unsure how to mitigate across the plausible structures. (It seems to be an AIXI-like problem.)

2&3, however, require that humans understand the domain, and too-often in existing systems we do not. Superhuman AI might be better than us at this, but if causal understanding scales more slowly than capability, it would still fail.

Broken link on the text "real killing of birds to reduce pests in China has never been tried".

Strange and annoying. When I went to edit it, the link was correct - it's only when it was posted that it didn't work. I switched where it pointed.

This class of problem does not seem to be addressed by much of the AI-Risk approaches that are currently being suggested or developed.

I'm pretty sure that "hard problem of correctly identifying causality" is a major goal of MIRI's decision theory.

Can you elaborate on why AIXI/Solomonoff induction is an unsafe utility maximizer, even for Cartesian agents?

Necromantic comment, sorry :P

Inspiring! Decided to try brainstorming about proxies.

What causes proxy drift?

proxy optimized until outside its domain of validity with no legible alarm

landscape changes along dimensions that were not known to impact proxy, or impacted underlying thing desired to be measured without impacting proxy.

both consequences of proxies sharing some but not all dimensions with the underlying quantity.

few shared dimensions: fragile proxies

more shared dimensions: more expensive for adversarial munchkining to disentangle (ie find a lever that moves one only along non-shared dimensions)

key dimensions?: legibility, cost, tightness (overlapping dimensions/relevant dimensions?)

legibility allows easier 'out of domain' alarms but also easier munchkining, requiring greater tightness.

expense and legibility -> security via obscurity

How could parameterizing proxy space fail?

quantity>quality of proxies?

orthogonality often considered desirable for modularity ie legibility of side effects.

long feedback cycles increase time for proxy drift.

underrated?: construction of new proxies and adversarial attacking of them for practice.

Brainstorming approaches to working with causal goodhart

Low-impact measures that include the change in the causal structure of the world. It might be possible to form a measure like this which doesn't depend on recovering the true causal structure at any point (ie. minimizing the difference between predictions of causal structure in state A and B, even if both of those predictions are wrong)
Figure out how to elicit human models of causal structure, and provide the human model of causal structure along with the metric, and the AI uses this information to figure out whether it's violating the assumptions that the human made
Causal transparency: have the AI explain the causal structure of how it's plans will influence the proxy. This might allow a human to figure out whether the plan will cause the proxy to diverge from the goal. ie. True goal is happiness, proxy is happiness score as measured by online psychological questionnaire, AI's plan says that it will influence the proxy by hacking into the online psychological questionnaire. You don't need to understand how the AI plans to hack into the server to understand that the plan is diverging the proxy from the goal.

Broken link on the text "real killing of birds to reduce pests in China has never been tried".

Strange and annoying. When I went to edit it, the link was correct - it's only when it was posted that it didn't work. I switched where it pointed.

LESSWRONG
LW

LESSWRONG
LW

22

Non-Adversarial Goodhart and AI Risks

22

Ω 12

Goodhart Effects Review - Read the paper for details!

Regressional and Extremal Goodhart

The Additional Challenge of Causality

Summary

22

Ω 12

22

Ω 12