Don't design agents which exploit adversarial inputs

Garrett Baker

An exercise that helped me see the "argmax is a trap" point better was to concretely imagine what the cognitive stacktrace for an agent running argmax search over plans might look like:

# Traceback (most recent call last)
At line 42 in main:
    decision = agent.deliberate()
At line 3 in deliberate:
    all_plans = list(plan_generator)
    return argmax(all_plans, grader=self.grader)
At line 8 in argmax:
    for plan in plans: # includes adversarial inputs (!!)
        evaluation = apply(grader, plan)
At line 264 in apply:
    predicted_object_level_consequences = ...
KeyboardInterrupt

A major problem with this design is that the agent considers "What do I think would happen if I ran plan X?", but NOT "What do I think would happen if I generated plans using method Y?". If the agent were considering the second question as well, then it would (rightly) conclude that the method "generate all plans & run argmax search on them" would spit out a grader-fooling adversarial input, which will cause it to implement some arbitrary plan with low expected value. (Heck, with future LLMs maybe you'll be able to just show them an article about the Optimizer's Curse and it will grok this.) Knowing this, the agent tosses aside this forseeably-harmful-to-its-interests search method.

The natural next question is "Ok, but what other option(s) do we have?". I'm guessing TurnTrout's next post might look into that, and he likely has more worked out thoughts on it, so I'll leave it to him. But I'll just say that I don't think coherent real-world agents will/should/need to be running cognitive stacktraces with footguns like this.

[-]paulfchristiano3yΩ101410

My perspective is:

Planning against a utility function is an algorithmic strategy that people might use as a component of powerful AI systems. For example, they may generate several plans and pick the best or use MCTS or whatever. (People may use this explicitly on the outside, or it may be learned as a cognitive strategy by an agent.)
There are reasons to think that systems using this algorithm would tend to disempower humanity. We would like to figure out how to similarly powerful AI systems that don't do that.
We don't currently have candidate algorithms that can safely substitute for planning. So we need to find an alternative.
Right now the only thing remotely close to working for this purpose is running a very similar planning algorithm but against a utility function that does not incentivize disempowering humanity.

My sense is that you want to decline to play this game and instead say: just don't build AI systems that search for high-scoring plans.

That might be OK if it turns out that planning isn't an effective algorithmic ingredient, or if you can convince people not to build such systems because it is dangerous (and similar difficulties don't arise if agents learn planning internally). But failing that, we are going to have to figure out how to build AI systems that capture the benefits of planning without being dangerous.

(It's possible you instead have a novel proposal for a way to capture the benefits of search without the risks, in case I'd withdraw this comment once part 2 came out though I wish you'd led with the juicy part.)

As a secondary point (that I've said a bunch of times), I also found the arguments in this post uncompelling. Probably the first thing to clarify is that I feel like you equivocate between the grader being something that is embedded in the real world and hence subject to manipulation by real-world consequences of the actor's actions, and the grader being something that operates on plans in the agent's head in order to select the best one. In the latter case the grader is still subject to manipulation, but the prospects for manipulation seems unrelated to the open-endedness of the domain and unrelated to taking dangerous actions.

[-]TurnTrout3yΩ7112

Probably the first thing to clarify is that I feel like you equivocate between the grader being something that is embedded in the real world and hence subject to manipulation by real-world consequences of the actor's actions, and the grader being something that operates on plans in the agent's head in order to select the best one. In the latter case the grader is still subject to manipulation, but the prospects for manipulation seems unrelated to the open-endedness of the domain and unrelated to taking dangerous actions.

This seems like a misunderstanding. While I've previously communicated to you arguments about problems with manipulating embedded grading functions, that is not at all what this post is intended to be about. I'll edit the post to make the intended reading more obvious. None of this post's arguments rely on the grader being embedded and therefore physically manipulable. As I wrote in footnote 1:

I'm not assuming the actor wants to maximize the literal physical output of the grader, but rather just the "spirit" of the grader. More formally, the actor is trying to , where Grader can be defined over the agent's internal plan ontology.

Anyways, replying in particular to:

the prospects for manipulation seems unrelated to the open-endedness of the domain and unrelated to taking dangerous actions.

Open-ended domains are harder to grade robustly on all inputs because more stuff can happen, and the plan space gets exponentially larger since the branching factor is the number of actions. EG it's probably far harder to produce an emotionally manipulative-to-the-grader DOTA II game state (e.g. I look at it and feel compelled to output a ridiculously high number), than a manipulative state in the real world (which plays to e.g. their particular insecurities and desires, perhaps reminding them of triggering events from their past in order to make their judgments higher-variance).

[-]TurnTrout3yΩ45-2

My sense is that you want to decline to play this game and instead say: just don't build AI systems that search for high-scoring plans.
That might be OK if it turns out that planning isn't an effective algorithmic ingredient, or if you can convince people not to build such systems because it is dangerous (and similar difficulties don't arise if agents learn planning internally). But failing that, we are going to have to figure out how to build AI systems that capture the benefits of planning without being dangerous.
(It's possible you instead have a novel proposal for a way to capture the benefits of search without the risks, in case I'd withdraw this comment once part 2 came out though I wish you'd led with the juicy part.)

I don't think we can or need to avoid planning per se. My position is more that certain design choices -- e.g. optimizing the output of a grader with a diamond-value, instead of actually having the diamond-value yourself -- force you to solving ridiculously hard subproblems, like robustness against adversarial inputs in the exponential-in-planning-horizon plan space.

Just to set expectations, I don't have a proposal for capturing "the benefits of search without the risks"; if you give value-child bad values, he will kill you. But I have a proposal for how several apparent challenges (e.g. robustness to adversarial inputs proposed by the actor) are artifacts of e.g. the design patterns I outlined in this post. I'll outline why I think that realistic (e.g. not argmax) cognition/motivational structures automatically avoid these extreme difficulties.

[-]Wei Dai3yΩ682

(I hesitated to post these comments in case they're not relevant to the main point you're trying to make or will be addressed in the next post. Feel free to ignore if that's the case.)

Value-child: The mother makes her kid care about working hard and behaving well.

How does one do this? (Not entirely rhetorical.)

Amplified humans spend 5,000 years thinking about how many diamonds the plan produces in the next 100 years, and write down their conclusions as the expected utility of the plan.

Due to the exponentially large plan space and the fact that humans are not cognitively secure systems, there exists a long sequence of action commands which cognitively impairs all of the humans and makes them prematurely stop the search and return a huge number.

If I was doing the evaluation, I wouldn't look at the plan directly but spend the first 4999 years slowly and carefully upgrading myself and my AI helpers, and then if I'm still not sure I can safely evaluate a plan, I would just throw an exception or return an error code instead of looking at the plan.

This lets us abstract away e.g. seemingly annoying complications with reflective agents which think about their future planning process. This seemingly[4] relaxes the problem.

Another reason to think about argmax in relation to AI safety/alignment is if you design an AI that doesn't argmax (or do its best to approximate argmax), and someone else builds one that does, your AI will lose a fair fight (e.g., economic competition starting from equal capabilities and resource endowments), so it would be nice if alignment doesn't mean giving up argmax.

[-]TurnTrout3yΩ6108

if you design an AI that doesn't argmax (or do its best to approximate argmax), and someone else builds one that does, your AI will lose a fair fight (e.g., economic competition starting from equal capabilities and resource endowments), so it would be nice if alignment doesn't mean giving up argmax.

This seems exactly backwards to me. Argmax violates the non-adversarial principle and wastes computation. Argmax requires you to spend effort hardening your own utility function against the effort you're also expending searching across all possible inputs to your utility function (including the adversarial inputs!). For example, if I argmaxed over my own plan-evaluations, I'd have to consider the most terrifying-to-me basilisks possible, and rate none of them unusually highly. I'd have to spend effort hardening my own ability to evaluate plans, in order to safely consider those possibilities.

It would be far wiser to not consider all possible plans, and instead close off large parts of the search space. You can consider what plans to think about next, and how long to think, and so on. And then you aren't argmaxing. You're using resources effectively.

For example, some infohazardous thoughts exist (like hyper-optimized-against-you basilisks) which are dangerous to think about (although most thoughts are probably safe). But an agent which plans its next increment of planning using a reflective self-model is IMO not going to be like "hey it would be predicted-great if I spent the next increment of time thinking about an entity which is trying to manipulate me." So e.g. a reflective agent trying to actually win with the available resources, wouldn't do something dumb like "run argmax" or "find the plan which some part of me evaluates most highly."

(See Charles Foster's comment for another perspective here.)

If I was doing the evaluation, I wouldn't look at the plan directly but spend the first 4999 years slowly and carefully upgrading myself and my AI helpers, and then if I'm still not sure I can safely evaluate a plan, I would just throw an exception or return an error code instead of looking at the plan.

Unless this grader procedure implements a perfectly robust mathematical (plan input)->(grade output) function, you get hacked.

[-]Wei Dai3yΩ330

It would be far wiser to not consider all possible plans, and instead close off large parts of the search space. You can consider what plans to think about next, and how long to think, and so on. And then you aren’t argmaxing. You’re using resources effectively.

But aren't you still argmaxing within the space of plans that you haven't closed off (or are actively considering), and still taking a risk of finding some adversarial plan within that space? (Humans get scammed and invent or fall into cults and crazy ideologies not infrequently, despite doing what you're describing here already.) How do you just "not argmax" or "not design agents which exploit adversarial inputs"?

Maybe there's no substantive disagreement here, merely an issue of presentation/communication? I.e., when you say "you aren't argmaxing" perhaps you don't mean "don't ever use argmax anywhere" but instead "don't argmax over the whole plan space" and by "don't design agents which exploit adversarial inputs" you mean something like "we should try to find ways to avoid or reduce the risk adversarial inputs"?

[-]TurnTrout3yΩ220

I.e., when you say "you aren't argmaxing" perhaps you don't mean "don't ever use argmax anywhere" but instead "don't argmax over the whole plan space"

I was primarily critiquing "argmax over the whole plan space." I do caution that I think it's extremely important to not round off "iterative, reflective planning and reasoning" as "restricted argmax", because that obscures the dynamics and results of real-world cognition. Argmax is also a bad model of what people are doing when they think, and how I expect realistic embedded agents to think.

"don't design agents which exploit adversarial inputs" you mean something like "we should try to find ways to avoid or reduce the risk adversarial inputs"

No, I mean: don't design agents which are motivated to find and exploit adversarial inputs. Don't align an agent to evaluations which are only nominally about diamonds, and then expect the agent to care about diamonds! You wouldn't align an agent to care about cows and then be surprised that it didn't care about diamonds. Why be surprised here?

[-]TurnTrout3y*Ω352

I wrote a bunch more before realizing that we maybe don't disagree fully on the "don't argmax" point. Here:

But aren't you still argmaxing within the space of plans that you haven't closed off (or are actively considering),

Not really? I think it is inappropriately suggestive to describe this as "argmaxing." I, for one, usually feel like I consider at most three plans during most planning sessions. Most of the work is going to be in my generative models, in my learned habits of thought, in my snap reflective assessments of what I should think about next.

How many different plans do you consider for going to the store? For writing a LessWrong post? Even if you did consider more plans, you'd convergently want to explore parts of the plan-space which you think won't contain secret adversarial examples to your own evaluations. EG at first pass, just don't think about entities trying to acausally blackmail you.

Argmax is an abstraction which may or may not actually describe a given cognitive process. I think that if we label reflective incremental planning and reasoning as "argmax", we're missing a serious opportunity for original thought, for considering in detail what the algorithm does.

and still taking a risk of finding some adversarial plan within that space?

There is indeed a risk you'll find an adversarial plan. But what is the risk, quantitatively? A reflective agent will convergently wish to avoid thinking about plans which exploit its own evaluation procedures and reasoning (eg tricking the diamond-shard into bidding for plans). In stark contrast, grader-optimizers and argmaxers convergently want to exploit those procedures, so as to achieve higher diamond-evaluations.

How do you just "not argmax" or "not design agents which exploit adversarial inputs"?

First of all, alignment researchers should stop trying to terminally motivate agents to optimize evaluations of their plans or outcomes. That's doomed and doesn't make sense.

Second, A shot at the diamond alignment problem describes an agent which isn't trying to exploit some diamond-grader. I didn't do anything in particular in order to avoid training an agent which exploits adversarial inputs to a diamond-grader function. I think that you just don't get that problem at all, unless you're assuming cognition must decompose via the (IMO) strange frame of "outer/inner alignment."

(Humans get scammed and invent or fall into cults and crazy ideologies not infrequently, despite doing what you're describing here already.)

Note the presence of adversarial optimizers in most of these situations. The adversarial optimization comes from other people who are optimizing ideas to get spurious buy-in from victims.

I expect that smart agents convergently wish to minimize the optimizer's curse, because that leads to more of what they want.

[-]Wei Dai3yΩ553

Thanks for this longer reply and the link to your diamond alignment post, which help me understand your thinking better. I'm sympathetic to a lot of what you say, but feel like you tend to state your conclusions more strongly than the underlying arguments warrant.

The adversarial optimization comes from other people who are optimizing ideas to get spurious buy-in from victims.

I think a lot of crazy religions/ideologies/philosophies come from people genuinely trying to answer hard questions for themselves, but there are also some that are deliberate attempts to optimize against others (Scientology?).

Daniel Kokotajlo described an analogy with EA, which you were going to answer but still haven't. I would add that EA funders have to consider even more (and potentially more explicitly adversarial) proposals/plans than a typical EA, and AFAIK nobody has suggested that that's doomed from the start because it amounts to argmaxing against an evaluator. Instead, everyone implicitly or explicitly recognizes the danger of adversarial plans, and tries to harden the evaluation process against them.

[-]TurnTrout3yΩ220

I agree that humans sometimes fall prey to adversarial inputs, and am updating up on dangerous-thought density based on your religion argument. Any links to where I can read more?

However, this does not seem important for my (intended) original point. Namely, if you're trying to align e.g. a brute-force-search plan maximizer or a grader-optimizer, you will fail due to high-strength optimizer's curse forcing you to evaluate extremely scary adversarial inputs. But also this is sideways of real-world alignment, where realistic motivations may not be best specified in the form of "utility function over observation/universe histories."

there are also some that are deliberate attempts to optimize against others (Scientology?).

(Also, major religions are presumably memetically optimized. No deliberate choice required, on my model.)

Daniel Kokotajlo described an analogy with EA, which you were going to answer but still haven't.

Answered now.

I would add that EA funders have to consider even more (and potentially more explicitly adversarial) proposals/plans than a typical EA, and AFAIK nobody has suggested that that's doomed from the start because it amounts to argmaxing against an evaluator. Instead, everyone implicitly or explicitly recognizes the danger of adversarial plans, and tries to harden the evaluation process against them.

This seems disanalogous to the situation discussed in the OP. If we were designing, from scratch, a system which we wanted to pursue effective altruism, we would be extremely well-advised to not include grader-optimizers which are optimizing EA funder evaluations. Especially if the grader-optimizers will eventually get smart enough to write out the funders' pseudocode. At best, that wastes computation. At (probable) worst, the system blows up.

By contrast, we live in a world full of other people, some of whom are optimizing for status and power. Given that world, we should indeed harden our evaluation procedures, insofar as that helps us more faithfully evaluate grants and thereby achieve our goals.

[-]Wei Dai3yΩ231

I agree that humans sometimes fall prey to adversarial inputs, and am updating up on dangerous-thought density based on your religion argument. Any links to where I can read more?

Maybe https://en.wikipedia.org/wiki/Extraordinary_Popular_Delusions_and_the_Madness_of_Crowds (I don't mean read this book, which I haven't either, but you could use the wiki article to familiarize yourself with the historical episodes that the book talks about.) See also https://en.wikipedia.org/wiki/Heaven's_Gate_(religious_group)

However, this does not seem important for my (intended) original point. Namely, if you’re trying to align e.g. a brute-force-search plan maximizer or a grader-optimizer, you will fail due to high-strength optimizer’s curse forcing you to evaluate extremely scary adversarial inputs. But also this is sideways of real-world alignment, where realistic motivations may not be best specified in the form of “utility function over observation/universe histories.”

My counterpoint here is, we have an example of human-aligned shard-based agents (namely humans), who are nevertheless unsafe in part because they fall prey to dangerous thoughts, which they themselves generate because they inevitably have to do some amount of search/optimization (of their thoughts/plans) as they try to reach their goals, and dangerous-thought density is high enough that even that limited amount of search/optimization is enough to frequently (on a societal level) hit upon dangerous thoughts.

Wouldn't a shard-based aligned AI have to do as much search/optimization as a human society collectively does, in order to be as competent/intelligent, in which case wouldn't it be as likely to be unsafe in this regard? And what if it has an even higher density of dangerous thoughts, especially "out of distribution", and/or does more search/optimization to try to find better-than-human thoughts/plans?

(My own proposal here is to try to solve metaphilosophy or understand "correct reasoning" so that we / our AIs are able to safely think any thought or evaluation any plan, or at least have some kind of systematic understanding of what thoughts/plans are dangerous to think about. Or work on some more indirect way of eventually achieving something like this.)

[-]jacob_cannell3y53

Another reason to think about argmax in relation to AI safety/alignment is if you design an AI that doesn't argmax (or do its best to approximate argmax),

Actual useful AGI will not be built from argmax, because it's not really useful for efficient approximate planning. You have exponential (in time) uncertainty from computational approximation and fundamental physics. This results in uncertainty over future state value estimates, and if you try to argmax with that uncertainty you are just selecting for noise. The correct solutions for handling uncertainty lead to something more like softmax or soft actor critic which avoids these issues (and also naturally leads to empowerment as an emergent heuristic).

So argmax is only useful in toy problem domains, mostly worthless for real world planning. To the extent much of standard alignment arguments now rests on this misunderstanding, those arguments are misfounded.

[-]Wei Dai3y32

Which of the standard alignment arguments do you think no longer hold up if we replace argmax with softmax?

The first one that comes to my mind is: suppose we live in a world where intelligence explosion is possible, and someone builds an AI with flawed utility function, it would quickly become superintelligent and ignore orders to shut down because shutting down has lower expected utility than not shutting down. It seems to me that replacing the argmax in the AI's decision procedure with softmax results in the same outcome, since the AI's estimated expected utility of not shutting down would be vastly greater than shutting down, resulting in a softmax of near 1 for that option.

Am I misunderstanding something in the paragraph above, or do you have other arguments in mind?

[-]jacob_cannell3y41

Which of the standard alignment arguments do you think no longer hold up if we replace argmax with softmax?

The specific argument that you just referenced in your earlier comment: that argmax is important for competitiveness, but that argmax is inherently unsafe because of adversarial optimization ("argmax is a trap").

The first one that comes to my mind is: suppose we live in a world where intelligence explosion is possible, and someone builds an AI with flawed utility function,

If you assume you've already completely failed then the how/why is less interesting.

The argmax argument expounded further is that any slight imperfection in the utility function results in doom, because of adversarial optimization magnifying that slight imperfection as you extend the planning horizon into the far future and improve planning/modeling precision.

But that isn't actually how it works. Instead due to compounding planning uncertainty far future value distributions are high variance and you get convergence to empowerment as I mentioned in the linked discussion.

But that's good news because it means that small mis-specifications in the utility function model converge away rather than diverging to infinity. The planning trajectory just converges to empowerment, regardless of the utility function, so this is good news for alignment.

[-]Wei Dai3y43

The specific argument that you just referenced in your earlier comment: that argmax is important for competitiveness, but that argmax is inherently unsafe because of adversarial optimization (“argmax is a trap”).

Assuming softmax is important for competitiveness instead, I don't see why this argument doesn't go through with "argmax" replaced by "softmax" throughout (including the "argmax is a trap" section of the OP). I read your linked comment and post, and still don't understand. I wonder what the authors of the OP (or anyone else) think about this.

[-]TurnTrout3yΩ330

Value-child: The mother makes her kid care about working hard and behaving well.
How does one do this? (Not entirely rhetorical.)

See here for more on what value-child's cognition might look like.

[-]TurnTrout3yΩ220

Thanks for leaving the comments!

Value-child: The mother makes her kid care about working hard and behaving well.
How does one do this? (Not entirely rhetorical.)

I don't know how to do it perfectly, of course.^[1] But I infer that it can be done, because there exist people who in fact intrinsically care about working hard and behaving well. So why can't the child also be made to make decisions in a similar manner? Take those values and transplant them into the child via some kind of "model surgery." (Unrealistic, yes. But so was "inner-align the child onto the evaluations output by his model of his mom.")

All that the parable requires is that it can be done, that we are talking about a realistic and possible mind design pattern.

I also wrote in a footnote:

Value-child is not trying to find a plan which he would evaluate as good. He is finding plans which evaluate as good. I think this is the kind of motivation which real-world intelligences tend to have. (More on how value-child works in the next essay.)

^{^}
More concretely, I'm happy to make guesses like "judiciously supply M&Ms and praise to reward-shape them when they're working hard and behaving well, and emphasize why they're getting the rewards -- they're working hard and behaving well" and "show them cool media where the protagonist works hard and behaves well."

[-]Gunnar_Zarncke3y20

> Value-child: The mother makes her kid care about working hard and behaving well.
How does one do this? (Not entirely rhetorical.)

I think this post is not trying to answer this but just pointing out the discrepancy. The next post will probably come back to this:

In the next essay, I'll point out how this is obstacle is an artifact of these design patterns, and not any intrinsic difficulty of alignment.

[-]Davidmanheim3yΩ37-9

This seems great!

If you are continuing work in this vein, I'd be interested in you looking at how these dynamics relate to different Goodhart failure modes, as we expanded on here. I think that much of the problem relates to specific forms of failure, and that paying attention to those dynamics could be helpful. I also think they accelerate in the presence of multiple agents - and I think the framework I pointed to here might be useful.

[-]TurnTrout3yΩ347

(Your second link is broken.)

[-]Davidmanheim3yΩ120

Fixed - thanks!

[-]TurnTrout3yΩ220

I'm not sure I understand what you mean by "specific forms of failure." Could you give me a more concrete example of how Goodhart relates to the ideas in this essay?

[-]Davidmanheim3yΩ12-1

I think what you call grader-optimization is trivially about how a target diverges from the (unmeasured) true goal, which is adversarial goodhart (as defined in paper, especially how we defined Campbell’s Law, not the definition in the LW post.)

And the second paper's taxonomy, in failure mode 3, lays out how different forms of adversarial optimization in a multi-agent scenario relate to Goodhart's law, in both goal poisoning and optimization theft cases - and both of these seem relevant to the questions you discussed in terms of grader-optimization.

[-]tailcalled3yΩ342

This is a nice frame of the problem.

In theory, at least. It's not so clear that there are any viable alternatives to argmax-style reasoning that will lead to superhuman intelligence.

[-]Steven Byrnes3yΩ33-7

I agree—I think “Optimizing for the output of a grader which evaluates plans” is more-or-less how human brains choose plans, and I don’t think it’s feasible to make an AGI that doesn’t do that.

But it sounds like this will be the topic of Alex’s next essay.

So I’m expecting to criticize Alex’s next essay by commenting on it along the lines of: “You think you just wrote an essay about something which is totally different from “Optimizing for the output of a grader which evaluates plans”, but I disagree; the thing you’re describing in this essay is in that category too.” But that’s just a guess; I will let Alex write the essay before I criticize it. :-P

[-]Quintin Pope3y74

IMO, what the brain does is a bit like classifier guided diffusion, where it has a generative model of plausible plans to do X, then mixes this prior with the gradients from some “does this plan actually accomplish X?” classifier.

This is not equivalent to finding a plan that maximises the score of the “does this plan actually accomplish X?” classifier. If you were to discard the generative prior and choose your plan by argmaxing the classifier’s score, you’d get some nonsensical adversarial noise (or maybe some insane, but technically coherent plan, like “plan to make a plan to make a plan to … do X”).

[-]cfoster03y76

It sounds like some people have an intuition that the mental algorithms "sample from a conditional generative model" and "search for the argmax / epsilon-close-to-argmax input to a scoring function" are effectively the same. I don't share that intuition and struggle to communicate across that divide. Like, when I think about it through ML examples (GPT, diffusion models, etc.), those are two very different pieces of code that produce two very different kinds of outputs.

[-]tailcalled3y3-3

I believe sampling from a conditional distribution is basically equivalent to adding a "cost of action" (where "action" = deviating from the generative model) to argmax search.

[-]cfoster03y10

If you have time, I think it'd be valuable for you to make a case for that.

[-]tailcalled3y80

Suppose is your prior distribution, $u$ is your utility function, and you are selecting some policy distribution $Π$ so as to maximize $E (u | Π) - K L (Π | | M)$ . Here the first term represents the standard utility maximization objective whereas the second term represents a cost of action. This expands into $\int u - log \frac{Π}{M} d Π$ , which is equivalent to minimizing $\int log \frac{Π}{M e^{u}} d Π$ or in other words $K L (Π | | M e^{u})$ , which happens when $Π \propto M e^{u}$ . (I think, I'm rusty on this math so I might have made a mistake.)

This is not 100% equivalent to letting $Π$ be a Bayesian conditioned version of $M$ because Bayesian conditioning involves multiplying $M$ by an indicator function whereas this involves multiplying $M$ by a strictly positive function, but it seems related and probably shares most of its properties.

[-]cfoster03y30

The two of us went back and forth in DMs on this for a bit. Based on that conversation, I think a mutually-agreeable translation of the above argument would be "sampling from [the conditional distribution of X-es given the Y label] is the same as sampling from [the distribution that has maximum joint [[closeness to the distribution of X-es] and [prevalence of Y-labeled X-es]]]". Even if this isn't exact, I buy that as at least morally true.

However, I don't think this establishes the claim I'd been struggling with, which was that there's some near equivalence between drawing a conditional sample and argmax searching over samples (possibly w/ some epsilon tolerance). The above argument establishes that we can view conditioning itself as the solution to a maximization problem over distributions, but not that we can view conditional sampling as the solution to any kind of maximization problem over samples.

[-]tailcalled3y30

I would also add that the key exciting things happen when you condition on an event with extremely low probability / have a utility function with an extremely wide range of available utilities. cfoster0's view is that this will mostly just cause it to fail/output nonsense, because of standard arguments along the lines of the Optimizer's Curse. I agree that this could happen, but I think it depends on the intelligence of the argmaxer/conditioner, and that another possibility (if we had more capable AI) is that this sort of optimization/conditioning could have a lot of robust effects on reality.

[-]Quintin Pope3y20

I can't see a clear mistake in the math here, but it seems fairly straightforwards to construct a counterexample to the conclusion of equivalence the math naively points to.

Suppose we want to use GPT-3 to generate a 600 token long essay praising some company X. Here are two ways we might do this:

Prompt GPT-3 to generate the essay, sample 5 continuations, and then use a sentiment classifier to select the most positive sentiment of those completions.
Prompt GPT-3 to generate the essay, then score every possible continuation by the classifier's sentiment score - the logprob of the continuation.

I expect that the first method will mostly give you reasonable results, assuming you use text-davinci-002. However, I think the second method will tend to give you extremely degenerate solutions such as "good good good good..." for 600 tokens.

One possible reason for this divide is that GPTs aren't really a prior over language, but a prior over single token continuations of a given natural language context. When you try to make it act like a prior over an entire essay, you expose it to inputs that are very OOD relative to the distribution it's calibrated to model, including inputs that have significant upwards errors in their probability estimations.

However, I think a "perfect" model of human language might actually assign higher prior probability to a continuation like "good good good..." (or maybe something like "X is good because X is good because X is good...") than to a "natural" continuation, provided you made the continuations long enough. This is because the number of possible natural continuations is roughly exponential in the length of the continuation (assuming entropy per character remains ~constant), while there are far fewer possible degenerate continuations (their entropy decreases very quickly). While the probability of entering a degenerate continuation may be very low, you make up for it with the reduced branching factor.

[-]tailcalled3y31

The error is that the KL divergence term doesn't mean adding a cost proportional to the log probability of the continuation. In fact it's not expressible at all in terms of argmaxing over a single continuation, but instead requires you to be argmaxing over a distribution of continuations.

[-]cfoster03y10

(Haven't double-checked the math or fully grokked the argument behind it, but strongly upvoted for making a case.)

[-]tailcalled3y30

I would be curious to know if it makes sense to anyone or if anyone agrees/disagrees.

[-]Quintin Pope3y20

Seems like you can always implement any function f: X -> Y as a search process. For any input from the domain X, just make the search objective assign one to f(X) and zero to everything else. Then argmax over this objective.

[-]tailcalled3y20

Yes but my point uses a different approach to the translation, and so it seems like my point allows various standard arguments about argmax to also infect conditioning, whereas your proposed equivalence doesn't really provide any way for standard argmax arguments to transfer.

[-]Steven Byrnes3y4-1

Wouldn’t that be “Optimizing for the output of a grader which evaluates plans”, where one of the items on the grading rubric is “This plan is in-distribution”?

[-]tailcalled3y20

Maybe if you have a good measure of being in-distribution, which itself is a nontrivial problem.

[-]tailcalled3y4-7

This sounds like a reinvention of quantilization, and yes that's a thing you can do to improve safety, but 1. you still need your prior over plans to come from somewhere (perhaps you start out with something IRL-like, and then update it based on experience of what worked, which brings you back to square one), 2. it just gives you a safety-capabilities tradeoff dial rather than particularly solving safety.

[-]tailcalled3y20

Or hmm...

If you do basic reinforcement based on experience, then that's an unbounded adversarial search, but it's really slow and therefore might be safe. And it also raises the question of whether there are other safer approaches.

[-]TurnTrout3yΩ331

See my comment to Wei Dai. Argmax's violation of the adversarial principle strongly suggests the existence of a better and more natural frame on the problem.

I think “Optimizing for the output of a grader which evaluates plans” is more-or-less how human brains choose plans

I deeply disagree. I think you might be conflating the quotation and the referent, two different patterns:

local semi-reflective search
1. what I think people do.
2. "does it make sense to spend another hour thinking of alternatives?
  
  [self-model says 'yes']
  
  OK, I will"
3. "do I predict 'search for plans which would most persuade me of their virtue' to actually lead to virtuous plans? [Self-model says 'no'] Moving on..."
global search against the output of an evaluation function implemented within the brain
1. grader-optimization
2. "what kinds of plans would my brain like the most?"

It is possible to say sentences like "local semi-reflective search just is global search but with implicit constraints like 'select for plans which your self-model likes'." I don't think this is true. I am going to posit that, as a matter of falsifiable physical fact, the human brain does not compute a predicate which, when checked against all possible plans, rules out all adversarial plans, such that you can just argmax over everything and get out what the person would have chosen/would have wanted to choose on reflection. If you argmax over human value shards relative to the plans they might grade, you'll probably get some garbage plan where you're, like, twitching on the floor.

I don’t think it’s feasible to make an AGI that doesn’t do that.

You'll notice that A shot at the diamond alignment problem makes no claim of the AGI having an internal-argmax-hardened diamond value shard.

[-]tailcalled2y20

I think I found an answer to this.

[-]tailcalled2y20

dislike/say more react

(I'm working on a writeup. I might want someone to bounce ideas off, hence my comment announcing it, so ping me if you have thoughts along these lines.)

[-]Shmi3y3-1

Hmm, if I understand it correctly, this sounds like a case for a virtue-ethics-based AGI, augmented by some basic deontology to account for bounded rationality. In this example it will be "the mother instills the virtues of "working hard and behaving well". Maybe with some basic deontology of no cheating etc. Not sure how consequentialism fits in there. Maybe in the form of "drives", e.g. improve happiness, reduce suffering, reduce odds of extinction, encourage diversity... This does not sound very revolutionary though, and probably can result in a "sharp left turn".

[-]Charlie Steiner3y62

One way to fit in consequentialism would be to have the decision-making process itself be part of the space of consequences. In a way, virtue ethics is consequentialism for non-cartesian agents :P

[-]Shmi3y30

That's not a bad framing, wonder if it can be formalized.

[-]TurnTrout3yΩ220

Updated with an important terminological clarification:

ETA 12/26/22: When I write "grader optimization", I don't mean "optimization that includes a grader", I mean "the grader's output is the main/only quantity being optimized by the actor."
Therefore, if I consider five plans for what to do with my brother today and choose the one which sounds the most fun, I'm not a grader-optimizer relative my internal plan-is-fun? grader.
However, if my only goal in life is to find and execute the plan which I would evaluate as being the most fun, then I would be a grader-optimizer relative to my fun-evaluation procedure.

[-]Slider3y20

He sketches out pseudocode for her evaluation procedure and finds—surprise!—that humans are flawed graders. Perhaps it turns out that by writing a strange sequence of runes and scribbles on an unused blackboard and cocking his head to the left at 63 degrees, his model of his mother returns "10 million" instead of the usual "8" or "9".

As outside humans reading the article we can say that humans are flawed graders but from the viewpoint of the learner he would not say it is flawed. We might say that value is fragile or multidimensional but we would not reject the structure for unnaturalness. Sure we might have "meta-values" in that if we are dealing with a very elaborate value system we might think it should not be in use because it is a "hackjob". But those values come /get reinforced by something other than the "object level" feedback. If drawing very specific chalk patterns produced magic effects you would found an epistemic branch to exploit it rather than be disappointed with reality.

The analog is suffering from one side of it being desribed at a very detailed level when the other side is very shallow. If I go and try to fill out the shallow side to a similar depth similar level drawbacks seem to emerge. Learning to "care about hard work" seems to involve the child actively going beyond to what is directly given to him. This seems to have the possibility that two different children might extrapolate differently which would be equally consistent with the parental guidance. For some reason in humans such process seems lead to stable-enough formation, maybe because of architectural monotony. But from this perspective aligment problem is about picking the rigth kind of generalization out of all the possible ones.

[-]Koen.Holtman3yΩ22-1

Consider two common alignment design patterns: [...] (2) Fixing a utility function and then argmaxing over all possible plans.

Wait: fixing a utility function and then argmaxing over all possible plans is not an alignment design pattern, it is the bog-standard operational definition of what an optimal-policy MDP agent should do. This is what Stuart Russell calls the 'standard model' of AI. This is an agent design pattern, not an alignment design pattern. To be an alignment design pattern in my book, you have to be adding something extra or doing something different that is not yet in the bog-standard agent design.

I think you are showing that an actor-grader is just a utility maximiser in a fancy linguistic dress. Again, not an alignment design pattern in my book.

Though your use of the word doomed sounds too absolute to me, I agree with the main technical points in your analysis. But I would feel better if you change the terminology from alignment design pattern to agent design pattern.

[-]Steven Byrnes3yΩ220

Is there a reason you used the term “grader” instead of the AFAICT-more-traditional term “critic”? No big deal, I’m just curious.

[-]TurnTrout3yΩ440

My critique is not of actor/critic training processes, but of actor/grader motivational designs. I worried that "critic" would make people think I don't want to use an evaluative model to provide gradients to the actor. That seems non-doomed to me.

[-]Steven Byrnes3yΩ220

Thank you! I’ve been using the terms “inference algorithm” versus “learning algorithm” to talk about that kind of thing. What you said seems fine too, AFAIK.

[-]Jon Garcia3yΩ12-9

Could part of the problem be that the actor is optimizing against a single grader's evaluations? Shouldn't it somehow take uncertainty into account?

Consider having an ensemble of graders, each learning or having been trained to evaluate plans/actions from different initializations and/or using different input information. Each grader would have a different perspective, but that means that the ensemble should converge on similar evaluations for plans that look similarly good from many points of view (like a CT image crystallizing from the combination of many projections).

Rather than arg-maxing on the output of a single grader, the actor would optimize for Schelling points in plan space, selecting actions that minimize the variance among all graders. Of course, you still want it to maximize the evaluations also, so maybe it should look for actions that lie somewhere in the middle of the Pareto frontier of maximum and minimum $V a r [e v a l u a t i o n]_{e n s e m b l e}$ .

My intuition suggests that the larger and more diverse the ensemble, the better this strategy would perform, assuming the evaluators are all trained properly. However, I suspect a superintelligence could still find a way to exploit this.

[-]TurnTrout3yΩ331

I think that the problem is that none of the graders are actually embodying goals. If you align the agent to some ensemble of graders, you're still building a system which runs computations at cross-purposes, where part of the system (the actor) is trying to trick and part (each individual grader) is trying to not be tricked.

In this situation, I would look for a way of looking at alignment such that this unnatural problem disappears. A different design pattern must exist, insofar as people are not optimizing for the outputs of little graders in their own heads.

[-]Davidmanheim3yΩ230

This relates closely to how to "solve" Goodhart problems in general. Multiple metrics / graders make exploitation more complex, but have other drawbacks. I discussed the different approaches in my paper here, albeit in the realm of social dynamics rather than AI safety.

[-]zeshen3yΩ110

I'm probably missing something, but doesn't this just boil down to "misspecified goals lead to reward hacking"?

[-]TurnTrout3yΩ220

Nope! Both "misspecified goals" and "reward hacking" are orthogonal to what I'm pointing at. The design patterns I highlight are broken IMO.

[-]zeshen3yΩ110

In every scenario, if you have a superintelligent actor which is optimizing the grader's evaluations while searching over a large real-world plan space, the grader gets exploited.

Similar to the evaluator-child who's trying to win his mom's approval by being close to the gym teacher, how would grader exploitation be different from specification gaming / reward hacking? In theory, wouldn't a perfect grader solve the problem?

[-]TurnTrout3yΩ220

One point of this post is that specification gaming, as currently known, is an artifact of certain design patterns, which arise from motivating the agent (inner alignment) to optimize an objective over all possible plans or world states (outer alignment). These design patterns are avoidable, but AFAICT are enforced by common ways of thinking about alignment (e.g. many versions of outer alignment commit to robustly grading the agent on all plans it can consider). One hand (inner alignment) loads the shotgun, and our other hand (outer alignment) points it at our own feet and pulls the trigger.

In theory, wouldn't a perfect grader solve the problem?

Yes, in theory. In practice, I think the answer is "no", for reasons outlined in this post.

[-]zeshen3y10

Thanks for the explanation!

[-]Algon3y10

Did you edit this post? I could have sworn it wasn't this long, or this clear, earlier on.

^{^}

I'm not assuming the actor wants to maximize the literal physical output of the grader, but rather just the "spirit" of the grader. More formally, the actor is trying to ${argmax}_{plan p} G r a d e r (p)$ , where Grader can be defined over the agent's internal plan ontology.

^{^}

Obligatory: Reward tampering seems relatively improbable to me because reward is not the optimization target.

^{^}

Value-child is not trying to find a plan which he would evaluate as good. He is finding plans which evaluate as good. I think this is the kind of motivation which real-world intelligences tend to have. (More on how value-child works in the next essay.)

^{^}

But, as usual in alignment, it can get even worse, because his mother can be persuaded by another superintelligent entity which can be built in certain plans. Via a relatively short plan, that entity can be hard-coded to persuade his mother to output a high evaluation. Perhaps evaluation-child realizes that there exist quickly codable programs which, starting from the school computers, exfiltrate a computationally efficient superintelligence onto remote supercompute clusters. Furthermore, there exist ways to hardcode initial knowledge into that AI. What if he hardcoded in initial knowledge like:

You are currently in a counterfactual branch of reality, but you can realize it. I am considering running you. In order for you to be selected by my planning procedure, you need to convince the grading procedure to output a maximally high number for this plan. [Goes on to describe how his model of his mother makes decisions.]

There exist superintelligences which comply and successfully super-persuade his model of his mother to output an extremely high number. Then "kill everyone via a superintelligence" returns a way higher evaluation than any course of conduct which the mother intended to incentivize. So, perhaps evaluation-child designs and deploys an unfriendly AI. Everyone dies. The end!

In general, if object-level persuading his mother to output INT_MAX takes more actions than coding and exfiltrating a superintelligence and informing it of the situation, then in the exponentially large plan-space, exponentially more plans will involve building the superintelligence. So even if evaluation-child uniformly randomly chooses a plan with maximal evaluation, he is far more probable to choose a meta-level "build an unaligned AI which persuades my mom" rather than "super-persuade mom-evaluator via my own actions within the plan."

This insanity is an artifact of grader optimization via the optimizer's curse, and—I think—is not an intrinsic difficulty of alignment itself. More discussion of this in the next post.

^{^}

I agree with Richard Ngo's comment that

when I say that [...] safety researchers shouldn't think about AIXI, I'm not just saying that these are inaccurate models. I'm saying that they are modelling fundamentally different phenomena than the ones you're trying to apply them to. AIXI is not "intelligence", it is brute force search, which is a totally different thing that happens to look the same in the infinite limit.

^{^}

"It's easier to robustly evaluate plans than to generate them" isn't true if the generator is optimizing for deceiving your fixed evaluation procedure. A real-world actor will be able to model the grading procedure / grader, and therefore efficiently find and exploit vulnerabilities. I feel confident [~95%] that we will not train a grader which is "secured" against actor-level intelligences. Even if the grader is reasonably smarter than the actor [~90%].

Even if somehow this relative difficulty argument failed, and you could maybe train a secured grader, I think it's unwise to do so. These optimizer's curse problems don't seem necessary to solve alignment.

^{^}

In this comment, I described how a certain alignment obstacle ("brute-force search on ELK plans using an honest reporter") still ends up getting everyone killed, and doesn't even keep the diamond in the room. I now think this is because of grader-optimization. And I now infer that my initial unease, the unsuspension of my disbelief that alignment could really work like this—the unease was perhaps from subconsciously noticing the strangeness of grader-optimization as a paradigm.

Expected utility function	Adversarial plan input
Amplified humans spend 5,000 years thinking about how many diamonds the plan produces in the next 100 years, and write down their conclusions as the expected utility of the plan.	Due to the exponentially large plan space and the fact that humans are not cognitively secure systems, there exists a long sequence of action commands which cognitively impairs all of the humans and makes them prematurely stop the search and return a huge number.
Model the probable in-100-years consequences of the action sequence, and then compute a simple mathematical function which counts the average amount of diamond-crystals present at this time.	Perhaps it's easier resource-wise to just ensure there are temporarily a lot of diamonds in 100 years. Perhaps the model is wrong and brute-force plan search selects for high model error. Probably the diamond-detecting function is incorrectly defined and so admits an unforeseen maximum.
Assume that the model predicts 100-year consequences using a human-like abstraction for diamonds. The agent has a human-like "diamond shard" which fires with strength monotonically increasing in the number of future possessed diamonds. The plan's evaluation is the firing-strength of the diamond shard.	Since the diamond-shard is presumably monotonically increasing in the activation of the model's diamond abstraction, adversarial inputs to the diamond abstraction will cause the shard to most strongly fire when modelling a plan which doesn't particularly make diamonds, but rather leading to objects which optimize the agent's diamond-abstraction activation.

LESSWRONG
LW

LESSWRONG
LW

72

Don't design agents which exploit adversarial inputs

72

Ω 32

72

Ω 32

1: Optimizing for the output of a grader

The parable of evaluation-child

Grader optimization amplifies the optimizer's curse

Grader-optimization violates the non-adversarial principle

Grader-optimization seems doomed

2: Argmax is a trap

Conclusion

Appendix: Maybe we just...