Alex Turner, postdoctoral researcher at the Center for Human-Compatible AI. Reach me at turner.alex[at]berkeley[dot]edu.
The same thing which happens if the assessment isn't different from ours—the agent is more likely to take that plan, all else equal.
See the value-child speculative story for detail there. I have specific example structures in mind but don't yet know how to compactly communicate them in an intro.
As usual, you've left a very insightful comment. Strong-up, tentative weak disagree, but haven't read your linked post yet. Hope to get to that soon.
I think "outer alignment failure" is confusing terminology at this point—always requiring clarification, and then storing "oh yeah, 'outer alignment failure' means the wrong thing got rewarded as a matter of empirical fact." Furthermore, words are sticky, and lend some of their historical connotations to color our thinking. Better to just say "R rewards bad on-training behavior in situations A, B, C" or even "bad action rewarded", which compactly communicates the anticipation-constraining information.
Similarly, "inner alignment failure" (2) -> "undesired inner cognition reinforced when superficially good action performed" (we probably should get a better compact phrase for this one).
Goodhart is a problem with values-execution
I understand this to mean "Goodhart is and historically has been about how an agent with different values can do bad things." I think this isn't true. Goodhart concepts were coined within the grader-optimization/argmax/global-objective-optimization frame:
Throughout the post, I will use V to refer to the true goal and use U to refer to a proxy for that goal which was observed to correlate with V and which is being optimized in some way.~ Goodhart Taxonomy
Throughout the post, I will use V to refer to the true goal and use U to refer to a proxy for that goal which was observed to correlate with V and which is being optimized in some way.
~ Goodhart Taxonomy
This cleanly maps onto the grader-optimization case, where U is the grader and V is some supposed imaginary "true set of goals" (which I'm quite dubious of, actually).
This doesn't cleanly map onto the value shard case. The AI's shards cannot be U, because they aren't being optimized. The shards do the optimizing.
So now we run into a different regime of problems AFAICT, and I push back against calling this "Goodhart." For one, e.g. extremal Goodhart has a "global" character, where imperfections in U get blown up in the exponentially-sized plan space where many adversarial inputs lurk. Saying "value shards are vulnerable to Goodhart" makes me anticipate the wrong things. It makes me anticipate that if the shards are "wrong" (whatever that means) in some arcane situation, the agent will do a bad thing by exploiting the error. As explained in this post, that's just not how values work, but it is how grader-optimizers often work.
While it's worth considering what value-perturbations are tolerable versus what grader-perturbations are tolerable, I don't think it makes sense to describe both risk profiles with "Goodhart problems."
it is not clear which of values-execution and grader-optimization degrades more gracefully. In particular, I don't think you need to get the values perfect. I just also don't think you need to get the grader perfect in grader-optimization paradigms, and am uncertain about which one ends up being better.
I changed "perfect" to "robust" throughout the text. Values do not have to be "robust" against an adversary's optimization, in order for the agent to reliably e.g. make diamonds. The grader does have to be robust against the actor, in order to force the actor to choose an intended plan.
Alternatively could be framed as regularization on plans to be close to some starting distribution. This isn't about low impact because we still want the AI to search well enough to find clever and novel plans, instead it's about avoiding extrema that are really far from the starting distribution.
I don't think it's naturally framed in terms of distance metrics I can think of. I think a values-agent can also end up considering some crazy impressive plans (as you might agree).
I think you do a disservice by saying things like "actually, humans really care about their goals but grader-optimizers don't," because it sets up this supposed natural category of "grader optimizers" that are totally different from "value executers," and it actually seems like it makes it harder to reason about what mechanistic properties are producing the change you care about.
I both agree and disagree. I think that reasoning about mechanisms and not words is vastly underused in AI alignment, and endorse your pushback in that sense. Maybe I should write future essays with exhortations to track mechanisms and examples while following along.
But also I do perceive a natural category here, and I want to label it. I think the main difference between "grader optimizers" and "value executers" is that grader optimizers are optimizing plans to get high evaluations, whereas value executers find high-evaluating plans as a side effect of cognition. That does feel pretty natural to me, although I don't have a good intensional definition of "value-executers" yet.
Yeah, kinda, although "recipe" implies there's something else deciding to follow the recipe. Values are definitely not utility functions, on my view.
Is it meant to point at the ability of the actor to make the plan more confusing/harder to evaluate? Meaning that you're pointing at the ability for the actor to "obfuscate" its plan in order to get high reward?
No, the point is that the grader can only grade the current plan; it doesn't automatically know what its counterfactual branches output. The grader is scope-limited to its current invocation. This makes consistent grading harder (e.g. the soup-kitchen plan vs political activism, neither invocation knows what would be given by the other call to the grader, so they can't trivially agree on a consistent scale).
Ah, I should have written that question differently. I meant to ask "If we cannot robustly grade expected-diamond-production for every plan the agent might consider, how might we nonetheless design a smart agent which makes lots of diamonds?"
How do you do this?
Anyways, we might train a diamond-values agent like this.