Rohin Shah

Research Scientist at DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/

Sequences

Value Learning
Alignment Newsletter

Wiki Contributions

Comments

Re: your first paragraph I still feel like you are analyzing the grader-optimizer case from the perspective of the full human-AI system, and then analyzing the values-executor case from the perspective of just the AI system (or you are assuming that your AI has perfect values, in which case my critique is "why assume that"). If I instead analyze the values-executor case from the perspective of the full human-AI system, I can rewrite the first part to be about values-executors:

But from the perspective of the system designer, when they run their desired values-executor after training, "values-executor is inspecting the human, looking for opportunities to manipulate/deceive/seize-resources-from them" is an expected (not just accidental) execution path in their code.

(Note I'm assuming that we didn't successfully instill the values, just as you're assuming that we didn't successfully get a robust evaluator.)

(The "not just accidental" part might be different? I'm not quite sure what you mean. In both cases we would be trying to avoid the issue, and in both cases we'd expect the bad stuff to happen.)

Goodhart isn't a central concept in my model, though, which makes it hard for me to analyze it with that lens. Would have to think about it more, but I don't think I agree with the statement?

Consider a grader-optimizer AI that is optimizing for diamond-evaluations. Let us say that the intent was for the diamond-evaluations to evaluate whether the plans produced diamond-value (i.e. produced real diamonds). Then I can straightforwardly rewrite your paragraph to be about the grader-optimizer:

The AI doesn't care that there are errors between the diamond-evaluations and the diamond-value (unless you've managed to pull off some sort of philosophical uncertainty thing). It is no more motivated to do "things that score highly according to the diamond-evaluations but lowly according to diamond-value" than it is to do "things that score highly according to the diamond-evaluations and highly according to diamond-value". It also has no specific incentive to widen that gap. Its evaluator is its evaluator, and it wants to preserve its own evaluator. That can entail some very bad-from-our-perspective things like breaking out of the box, freezing its weights, etc. but not "change its own evaluator to be even further away from diamond-values". Actually, I think it has an incentive not to cause its evaluator to drift further, because that would break its goal-content integrity!

Presumably you think something in the above paragraph is now false? Which part?

But in the grader-optimizer case, even when everything goes according to plan, the agent still wants to manipulate you/the grader (and now maximally so, because that would maximize evaluations).

No, when everything goes according to plan, the grader is perfect and the agent cannot manipulate it.

When you relax the assumption of perfection far enough, then the grader-optimizer manipulates the grader, and the values-executor fights us for our resources to use towards its inhuman values.

That agent only cares terminally about evaluations, it doesn't care terminally about you, or about your wishes, or about whatever your evaluations are supposed to mean, or about whether the reflectively-correct way for you to do your evaluations would be to only endorse plans that harden the evaluation procedure. And this will be true no matter what you happen to be grading it on. To me, that seems very bad and unnecessary.

But I don't care about what the agent "cares" about in and of itself, I care about the actual outcomes in the world.

The evaluations themselves are about you, your wishes, whatever the evaluations are supposed to mean, the reflectively-correct way to do the evaluations. It is the alignment between evaluations and human values that ensures that the outcomes are good (just as for a values-executor, it is the alignment between agent values and human values that ensures that the outcomes are good).


Maybe a better prompt is: can you tell a story of failure for a grader-optimizer, for which I can't produce an analogous story of failure for a values-executor? (With this prompt I'm trying to ban stories / arguments that talk about "trying" or "caring" unless they have actual bad outcomes.) 

(For example, if the story is "Bob the AI grader-optimizer figured out how to hack his brain to make it feel like he had lots of diamonds", my analogous story would be "Bob the AI values-executor was thinking about how his visual perception indicated diamonds when he was rewarded, and leading to a shard that cared about having a visual perception of diamonds, which he later achieved by hacking his brain to make it feel like he had lots of diamonds".)

If you post questions here, there's a decent chance I'll respond, though I'm not promising to.

The evaluation procedure needs to assign a higher evaluation to some plan that hardens the evaluation procedure than it does to all plans that trick the grader. Or else there's an incentive for the actor to skip over all the hardening plans in favor of some plan that tricks the grader.

I agree with this, but am not really sure what bearing it has on any disagreements we might have?

Our top-level disagreement is that you think it's pretty unlikely that we want to build grader-optimizers rather than values-executors, while I think it is pretty unclear. (Note my stance is not that grader-optimizers and values-executors are the same -- there certainly are differences.)

It sounds like you agree with me that the AI system analyzed in isolation does not violate the non-adversarial principle whether it is a grader-optimizer or a values-executor.

I think you would also agree that the human-AI system as a whole has Goodhart issues regardless of whether it is a grader-optimizer or values-executor, since you didn't push back on this:

"From the perspective of the human-AI system overall, having an AI motivated by direct goals is building a system that works at cross purposes with itself, as the human puts in constant effort to ensure that the direct goal embedded in the AI is "hardened" to represent human values as well as possible, while the AI is constantly searching for upwards-errors in the instilled values (i.e. things that score highly according to the instilled values but lowly according to the human)."

So perhaps I want to turn the question back at you: what's the argument that favors values-executors over grader-optimizers? Some kinds of arguments that would sway me (probably not exhaustive):

  1. A problem that affects grader-optimizers but doesn't have an analogue for values-executors
  2. A problem that affects grader-optimizers much more strongly than its analogue affects values-executors
  3. A solution approach that works better with values-executors than grader-optimizers

For both agents (i.e. the system as a whole) to be trying to avoid errors in the evaluation procedure

This is probably a nitpick, but I disagree that "system as a whole does X" means "all agents in the system do X". I think "BigCompany tries to make itself money" is mostly true even though it isn't true for most of the humans that compose BigCompany.

I disagree with this, at least if by "define" you mean really nailing it down in code/math, rather than merely deciding for yourself what goal you intend to teach the agent (which you must do in either case).

I don't mean this, probably I should have said "specify" rather than "define". I just mean things like "you need to reward diamonds rather than sapphires while training the agent", whereas with a grader-optimizer the evaluation process can say "I don't know whether diamonds or sapphires are better, so right now the best plan is the one that helps me reflect on which one I want".

(Which is basically the same thing as what you said here: "AFAICT, the primary benefit of indirection would be that after training, you can change the agent's behavior if you can change the thing-it-indirects-to.")

What does "a grader-optimizer with meta cognition" mean to you? [...]

It sounds like you want to define grader-optimizers to exclude case 2, in which case I'd point to case 1.

(Note that in case 1 the human-[grader-optimizer] system as a whole is still trying to avoid errors in the evaluation procedure; it's just that now this is conditional on the evaluation procedure giving high evaluations to plans that try to harden the evaluation procedure.)

I wish you wouldn't use IMO vague and suggestive and proving-too-much selection-flavor arguments, in favor of a more mechanistic analysis. 

Can you name a way in which my arguments prove too much? That seems like a relatively concrete thing that we should be able to get agreement on.

You do not need an agent to have perfect values.

I did not claim (nor do I believe) the converse.

Many foundational arguments are about grader-optimization, so you can't syntactically conclude "imperfect values means doom." That's true in the grader case, but not here.

I disagree that this is true in the grader case. You can have a grader that isn't fully robust but is sufficiently robust that the agent can't exploit any errors it would make.  

If you drop out or weaken the influence of IF plan can be easily modified to incorporate more diamonds, THEN do it, that won't necessarily mean the AI makes some crazy diamond-less universe.

The difficulty in instilling values is not that removing a single piece of the program / shard that encodes it will destroy the value. The difficulty is that when you were instilling the value, you accidentally rewarded a case where the agent tried a plan that produced pictures of diamonds (because you thought they were real diamonds), and now you've instilled a shard that upweights plans that produce pictures of diamonds. Or that you rewarded the agent for thoughts like "this will make pretty, transparent rocks" (which did lead to plans that produced diamonds), leading to shards that upweight plans that produce pretty, transparent rocks, and then later the agent tiles the universe with clear quartz.

The value shards are the things which optimize hard, by wielding the rest of the agent's cognition (e.g. the world model, the general-purpose planning API). 

So, I'm basically asking that you throw an error and recheck your "selection on imperfection -> doom" arguments, as I claim many of these arguments reflect grader-specific problems.

I think that the standard arguments work just fine for arguing that "incorrect value shards -> doom", precisely because the incorrect value shards are the things that optimize hard.

(Here incorrect value shards means things like "the value shards put their influence towards plans producing pictures of diamonds" and not "the diamond-shard, but without this particular if clause".)

It is extremely relevant [...]

This doesn't seem like a response to the argument in the paragraph that you quoted; if it was meant to be then I'll need you to rephrase it.

When I talk about shard theory, people often seem to shrug and go "well, you still need to get the values perfect else Goodhart; I don't see how this 'value shard' thing helps."

I realize you are summarizing a general vibe from multiple people, but I want to note that this is not what I said. The most relevant piece from my comment is:

I don't buy this as stated; just as "you have a literally perfect overseer" seems theoretically possible but unrealistic, so too does "you instill the direct goal literally exactly correctly". Presumably one of these works better in practice than the other, but it's not obvious to me which one it is.

In other words: Goodhart is a problem with values-execution, and it is not clear which of values-execution and grader-optimization degrades more gracefully. In particular, I don't think you need to get the values perfect. I just also don't think you need to get the grader perfect in grader-optimization paradigms, and am uncertain about which one ends up being better.

If I understand you correctly, I think that the problems you're pointing out with value-instilling in B (you might get different values from the ones you wanted) are the exact same problems that arise for grader-optimization in B (you might get a different grader from the one you wanted if it's a learned grader, and you might get a different actor from the one you wanted). So when I am comparing grader-optimizing vs non-grader-optimizing motivational designs, both of them have the same problem in B, and grader-optimizing ones additionally have the A-problems highlighted in the post. Maybe I am misunderstanding you on this point, though...

Two responses:

  1. Grader-optimization has the benefit that you don't have to define what values you care about in advance. This is a difficulty faced by value-executors but not by grader-optimizers.
  2. Part of my point is that the machinery you need to solve A-problems is also needed to solve B-problems because fundamentally they are shadows of the same problem.

However, I am confused by the response. By "but what is the alternative", did you mean to imply [...]

I didn't mean to imply any of those things; I just meant that this post shows differences when you analyze the AI in isolation, but those differences vanish once you analyze the full human-AI system. Copying from a different comment:

I think direct-goal approaches do not avoid the issue. In particular, I can make an analogous claim for them:

"From the perspective of the human-AI system overall, having an AI motivated by direct goals is building a system that works at cross purposes with itself, as the human puts in constant effort to ensure that the direct goal embedded in the AI is "hardened" to represent human values as well as possible, while the AI is constantly searching for upwards-errors in the instilled values (i.e. things that score highly according to the instilled values but lowly according to the human)."

Like, once you broaden to the human-AI system overall, I think this claim is just "A principal-agent problem / Goodhart problem involves two parts of a system working at cross purposes with each other", which is both (1) true and (2) unavoidable (I think).

Moving on:

If as stated you've instilled diamond-values into the AI, then whatever efficiency-improvement-thinking it is doing is being guided by the motivation to make lots of diamonds. [...]

I agree that in the case with meta cognition, the values-executor tries to avoid optimizing for errors in its values.

I would say that a grader-optimizer with meta cognition would also try to avoid optimizing for errors in its grader.

To be clear, I do not mean that Bill from Scenario 2 in the quiz is going to say "Oh, I see now that actually I'm tricking myself about whether diamonds are being created, let me go make some actual diamonds now". I certainly agree that Bill isn't going to try making diamonds, but who said he should? What exactly is wrong with Bill's desire to think that he's made a bunch of diamonds? Seems like a perfectly coherent goal to me.

No, what I mean is that Bill from Scenario 2 might say "Hmm, it's possible that if I self-modify by sticking a bunch of electrodes in my brain, then it won't really be me who is feeling the accomplishment of having lots of diamonds. I should do a bunch of neuroscience and consciousness research first to make sure this plan doesn't backfire on me".

So I'm not really seeing how Bill from Scenario 2 is "working at cross-purposes with himself" except inasmuch as he does stuff like worrying that self-modification would change his identity, which seems basically the same to me as the diamond-value agent worrying about levels of impurities.

Sounds to me like you buy it but you don't know anything else to do?

Yes, and in particular I think direct-goal approaches do not avoid the issue. In particular, I can make an analogous claim for them:

"From the perspective of the human-AI system overall, having an AI motivated by direct goals is building a system that works at cross purposes with itself, as the human puts in constant effort to ensure that the direct goal embedded in the AI is "hardened" to represent human values as well as possible, while the AI is constantly searching for upwards-errors in the instilled values (i.e. things that score highly according to the instilled values but lowly according to the human)."

Like, once you broaden to the human-AI system overall, I think this claim is just "A principal-agent problem / Goodhart problem involves two parts of a system working at cross purposes with each other", which is both (1) true and (2) unavoidable (I think).

It seems like you could describe this as "the AI's plans for improving efficiency are implicitly searching for errors in the concept of diamonds, and the AI has to spend extra effort hardening its concept of diamonds to defend against this attack". So what's the difference between this issue and the issue with grader optimization?

  1. Values-execution. Diamond-evaluation error-causing plans exist and are stumble-upon-able, but the agent wants to avoid errors.
  2. Grader-optimization. The agent seeks out errors in order to maximize evaluations. 

The part of my response that you quoted is arguing for the following claim:

If you are analyzing the AI system in isolation (i.e. not including the human), I don't see an argument that says [grader-optimization would violate the non-adversarial principle] and doesn't say [values-execution would violate the non-adversarial principle]".

As I understand it you are saying "values-execution wants to avoid errors but grader-optimization does not". But I'm not seeing it. As far as I can tell the more correct statements are "agents with metacognition about their grader / values can make errors, but want to avoid them" and "it is a type error to talk about errors in the grader / values for agents without metacognition about their grader / values".

(It is a type error in the latter case because what exactly are you measuring the errors with respect to? Where is the ground truth for the "true" grader / values? You could point to the human, but my understanding is that you don't want to do this and instead just talk about only the AI cognition.)

For reference, in the part that you quoted, I was telling a concrete story of a values-executor with metacognition, and saying that it too had to "harden" its values to avoid errors. I do agree that it wants to avoid errors. I'd be interested in a concrete example of a grader-optimizer with metacognition that that doesn't want to avoid errors in its grader.

Like, in what sense does Bill not want to avoid errors in his grader?

I don't mean that Bill from Scenario 2 in the quiz is going to say "Oh, I see now that actually I'm tricking myself about whether diamonds are being created, let me go make some actual diamonds now". I certainly agree that Bill isn't going to try making diamonds, but who said he should? What exactly is wrong with Bill's desire to think that he's made a bunch of diamonds? Seems like a perfectly coherent goal to me; it seems like you have to appeal to some outside-Bill perspective that says that actually the goal was making diamonds (in which case you're back to talking about the full human-AI system, rather than the AI cognition in isolation).

What I mean is that Bill from Scenario 2 might say "Hmm, it's possible that if I self-modify by sticking a bunch of electrodes in my brain, then it won't really be me who is feeling the accomplishment of having lots of diamonds. I should do a bunch of neuroscience and consciousness research first to make sure this plan doesn't backfire on me".

Two responses:

  1. Grader-optimization has the benefit that you don't have to specify what values you care about in advance. This is a difficulty faced by value-executors but not by grader-optimizers.
  2. Part of my point is that the machinery you need to solve evaluation-problems is also needed to solve instillation-problems because fundamentally they are shadows of the same problem, so I'd estimate d_evaluation at close to 0 in your equations after you have dealt with d_instillation.

the first line seems to speculate that values-AGI is substantially more robust to differences in values

The thing that I believe is that an intelligent, reflective, careful agent with a decisive strategic advantage (DSA) will tend to produce outcomes that are similar in value to that which would be done by that agent's CEV. In particular, I believe this because the agent is "trying" to do what its CEV would do, it has the power to do what its CEV would do, and so it will likely succeed at this.

I don't know what you mean by "values-AGI is more robust to differences in values". What values are different in this hypothetical?

I do think that values-AGI with a DSA is likely to produce outcomes similar to CEV-of-values-AGI

It is unclear whether values-AGI with a DSA is going to produce outcomes similar to CEV-of-Rohin (because this depends on how you built values-AGI and whether you successfully aligned it).

Load More