In this post I describe a pattern of behavior I call “implicit extortion.” RL agents are particularly susceptible to implicit extortion, in a way that is likely to be problematic for high-stakes applications in open-ended strategic environments.

I expect that many people have made this point before. My goal is to highlight the issue and to explore it a little bit more carefully.

Basic setup

Consider two actors, the target (T) and manipulator (M), such that:

  • M wants T to perform some target action — e.g. make a payment, leak information, buy a particular product, handicap itself…
  • M can take destructive actions that hurts both M and T — e.g. spreading rumors about T, undercutting T in a marketplace, physically attacking T…

In explicit extortion, M threatens to take the destructive action unless T performs the target action. Then a naive T reasons: “if I don’t take the target action, something bad will happen, so I better take the target action.”

In implicit extortion, M simply performs the destructive action whenever T doesn’t perform the target action. Then a naive T eventually learns that failure to take the target action is associated with something bad happening, and so learns to take the target action.

Implicit extortion is very similar to explicit extortion:

  • T would prefer not be the kind of person who is vulnerable to extortion, so that bad things don’t happen to them.
  • Extortion doesn’t necessarily cost M very much, if they don’t follow through on the threat very often.

However, implicit extortion can be particularly hard to avoid:

  • It can be effective without T realizing that it’s happening, which makes it hard for them to respond appropriately even if they do have defenses.
  • It affects simple RL algorithms (which don’t have defenses against extortion, and can’t be easily modified to include such defenses).

Example

The most extreme and blatant example would be for M to send T a daily request for $100. On any day when T fails to pay, M launches a costly cyberattack against T. A human would immediately recognize this behavior as extortion and would respond appropriately, but an RL algorithm might simply notice that paying is the best strategy and therefore decide to pay.

Implicit extortion can be much harder to detect, while still being effective. Suppose that every time T tries to change their product, M runs a grassroots smear campaign. It might not be possible for T to distinguish the situations “M is attempting to manipulate me into not changing my product” and “Everytime I change the product people get really unhappy, so I should do so sparingly.”

Details

How expensive is this for the manipulator?

Suppose that T is using an RL algorithm, and M is trying to manipulate them. How expensive is this for M? How likely is it to be worthwhile?

At equilibrium: T learns to always perform the target action; so only fails to take the target action while exploring. The long-term cost to M depends entirely on the target’s exploration policy.

If T uses ε-exploration, then they take the target action (1 − ε) of the time. So M only needs to pay the cost of the destructive action on an ε fraction of trials. 

For complex high-level actions, the effective ε can’t be too high — it’s not a good idea to “try something crazy” 10% of the time just to see what happens. But let’s be conservative and suppose that ε=0.1 anyway.

Suppose that M is trying to directly extract money from T, $100 at a time, and that it costs M $500 of value in order to cause $150 of trouble for T.

If M asks for $100 on 10 occasions, T will refuse to pay only once as an exploration. Then M needs to pay that $500 cost only once, thereby ensuring that the cost of paying (=$100) is smaller than the average cost of refusing to pay (=$150). Meanwhile, M makes $900, pocketing $400 of profit.

In general, M can make a profit whenever the product of (payment efficiency) * (destructive efficiency) > ε, where “payment efficiency” is the benefit to M divided by the cost to T of the target action, and “destructive efficiency” is the cost to T divided by the cost to M of the destructive action.

In practice I think it’s not too uncommon for payment efficiency to be ~1, and for destructive efficiency to be >1, such that extortion is possible regardless of ε. Small values of ε make extortion considerably easier and more cost-effective, and make it much harder to prevent.

During learning: the analysis above only applies when the agent has already learned to consistently take the target action. Earlier in learning, the target action may only occur rarely and so punishment may be very expensive. This could be worth it over the long term but may be a major hurdle.

Fortunately for M, they can simply start by rewarding the target behavior, and then gradually shift to punishment once the target behavior is common. From the perspective of the RL agent, the benefit of the target action is the same whether it’s getting a reward or avoiding a punishment.

In the cash payment example, M could start by paying T $20 every time that T sends $10. Once T notices that paying works well, M can gradually reduce the payment towards $10 (but leaving a profit so that the behavior becomes more and more entrenched). Once T is consistently paying, M can start scaling up the cost of not paying while it gradually reduces the benefits of paying.

Analyzing the error

Paying off a (committed) extortionist typically has the best consequences and so is recommended by causal decision theory, but having the policy of paying off extortionists is a bad mistake.

Even if our decision theory would avoid caving in to extortion, it can probably only avoid implicit extortion if it recognizes it. For example, UDT typically avoids extortion because of the logical link from “I cave to extortion” → “I get extorted.” There is a similar logical link from “I cave to implicit extortion” → “I get implicitly extorted.” But if we aren’t aware that an empirical correlation is due to implicit extortion, we won’t recognize this link and so it can’t inform our decision.

In practice the target is only in trouble if would-be manipulators know that they are inclined to comply with extortion. If manipulators base that judgment on past behavior, then taking actions that “look like what someone vulnerable to extortion would do” is itself a bad decision that even a causal decision theorist would avoid. Unfortunately, it’s basically impossible for an RL algorithm to learn to avoid this, because the negative consequences only appear over a very long timescale. In fact, the timescale for the negative consequences is longer than the timescale over which the RL agent adjusts its policy— which is too long for a traditional RL system to possibly do the credit assignment.

Other learning systems

What algorithms are vulnerable?

At first glance the problem may seem distinctive to policy gradient RL algorithms, where we take actions randomly and then reinforce whatever actions are associated with a high reward.

But the same problem afflicts any kind of RL. For example, a model-based agent would simply learn the model “not doing what the manipulator wants causes <bad thing X> to happen,” and using that model for planning would have exactly the same effect as using policy gradients.

More broadly, the problem is with the algorithm: “learn an opaque causal model and use it to inform decisions.” That’s an incredibly general algorithm. If you aren’t willing to use that algorithm, then you are at a significant competitive disadvantage, since the world contains lots of complicated causal processes that we can learn about by experiment but can’t model explicitly. So it seems like everyone just has to live with the risk of implicit extortion.

I describe the problem as afflicting “algorithms,” but it can also afflict humans or organizations. For example, any organization that is compelled by arguments like “X has always worked out poorly in the past, even though we’re not quite sure why, so let’s stop doing it” is potentially vulnerable to implicit extortion. 

What about human learning?

Humans have heuristics like vindictiveness that help prevent us from being manipulated by extortion, and which seem particularly effective against implicit extortion. Modern humans are also capable of doing explicit reasoning to recognize the costs of giving in to extortion.

Of course, we can only be robust to implicit extortion when we recognize it is occurring. Humans do have some general heuristics of caution when acting on the basis of opaque empirical correlations, or in situations where they feel they might be manipulable. However, it still seems pretty clear that human learning is vulnerable to implicit extortion in practice. (Imagine a social network which subtly punishes users, e.g. by modulating social feedback, for failing to visit the site regularly.)

Evolution?

Evolution itself doesn’t have any check against extortion, and it operates entirely by empirical correlations, so why isn’t it exploited in this way?

Manipulating evolution requires the manipulator to have a time horizon that is many times the generation length of the target. There aren’t many agents with long enough time horizons, or sophisticated enough behavior, to exploit the evolutionary learning dynamic (and in particular, evolution can’t easily learn to exploit it).

When we do have such a large gap in time horizons and sophistication — for example, when humans square off against bacteria with very rapid evolution — we do start to see implicit extortion.

For example, when a population of bacteria develop resistance to antibiotic A, we take extra pains to totally eradicate them with antibiotic B, even though we could not afford to use that strategy if A-resistance spread more broadly through the bacteria population. This is effectively implicit extortion to prevent bacteria from developing A-resistance. It would continue to be worthwhile for humanity even if the side effects of antibiotic B were much worse than the infection itself, though we probably wouldn’t do it in that case since it’s a hard coordination problem (and there are lots of other complications).

Conclusion

There are many ways that an AI can fail to do the right thing. Implicit extortion is a simple one that is pretty likely to come up in practice, and which may seriously affect the applicability of RL in some contexts. 

I don’t think there is any “silver bullet” or simple decision-theoretic remedy to implicit extortion, we just need to think about the details of the real world, who might manipulate us in what ways, what their incentives and leverage are, and how to manage the risk on a case-by-case basis.

I think we need to define “alignment” narrowly enough that it is consistent with implicit extortion, just like we define alignment narrowly enough that it’s consistent with losing at chess. I’ve found understanding implicit extortion helpful for alignment because it’s one of many conditions under which an aligned agent may end up effectively optimizing for the “wrong” preferences, and I’d like to understand those cases in order to understand what we are actually trying to do with alignment.

I don’t believe implicit extortion is an existential risk. It’s just another kind of conflict between agents, that will divert resources from other problems but should “wash out in the long run.” In particular, every agent can engage in implicit extortion and so it doesn’t seem to shift the relative balance of influence amongst competing agents. (Unlike alignment problems, which shift influence from human values to whatever values unaligned AI systems end up pursuing.)

29

16 comments, sorted by Click to highlight new comments since: Today at 6:42 PM
New Comment

Unfortunately, it’s basically impossible for an RL algorithm to learn to avoid this, because the negative consequences only appear over a very long timescale. In fact, the timescale for the negative consequences is longer than the timescale over which the RL agent adjusts its policy— which is too long for a traditional RL system to possibly do the credit assignment.

Engaging in implicit extortion seems to require thinking about long-term consequences on a time scale similar to avoiding implicit extortion, and if RL can't handle long-term consequences, are you assuming there are other kinds of agents in the environment?

In particular, every agent can engage in implicit extortion and so it doesn’t seem to shift the relative balance of influence amongst competing agents.

I can think of a couple of ways this might be false:

  1. If RL can't handle long-term consequences, and these are the only kinds of agents we can build, that seems to favor short-term values over long-term values. (I guess this is a more general observation not directly related to implicit extortion per se.)
  2. If alignment can only be done through RL-like agents that can't handle long-term consequences, but there are other ways to build unaligned AIs which can handle long-term consequences, that would shift the relative balance of influence to those kinds of unaligned AIs.

It seems to me that 1 and 2 are potentially serious problems that we should keep in mind, and it's too early to conclude that these problems should “wash out in the long run.” (If you had instead framed the conclusion as something like "It's not currently clear whether implicit extortion will shift the relative balance of influence towards unaligned AIs, so we should prioritize other problems that we know definitely are problems." I think I'd find that less objectionable.)

are you assuming there are other kinds of agents in the environment

Yes, e.g. humans, AIs trained to imitate humans, AIs trained by amplification, RL agents with reward functions that encourage implicit extortion (e.g. approval-directed agents whose overseer endorse implicit extortion).

If RL can't handle long-term consequences, and these are the only kinds of agents we can build, that seems to favor short-term values over long-term values. (I guess this is a more general observation not directly related to implicit extortion per se.)

I agree this can shift our values (and indeed that justified my work on alternatives to RL), but doesn't seem related to implicit extortion in particular.

If alignment can only be done through RL-like agents that can't handle long-term consequences, but there are other ways to build unaligned AIs which can handle long-term consequences, that would shift the relative balance of influence to those kinds of unaligned AIs.

I agree with this. I'm happy to say that implicit extortion affects long-term values by changing which skills are important, or by changing which types of AI are most economically important.

This effect seems less important to me than the direct negative impact of introducing new surface area for conflict, which probably decreases our ability to solve problems like AI alignment. My best guess is that this effect is positive since RL seems relatively hard to align.

If you had instead framed the conclusion as something like "It's not currently clear whether implicit extortion will shift the relative balance of influence towards unaligned AIs, so we should prioritize other problems that we know definitely are problems." I think I'd find that less objectionable.

"Doesn't seem to" feels like a fair expression of my current epistemic state. I can adjust "should wash out" to "doesn't seem to have a big effect."

It seems to me like the core problem here is that basic RL doesn't distinguish between environments and agents in environments--it doesn't have separate ways of reasoning about rain being associated with clouds and water balloons being associated with Calvin. Does it seem to you like there's something deeper going on?

Why should you treat agents in a special way? It doesn't seem like "agent" is a natural kind, everything is just atoms, and you should probably treat it that way.

I think the failures here are:

  • Bad decision theory, not taking into account acausal logical consequences of our decisions.
  • Lack of foresight, not considering that particular behaviors will ultimately lead to extortion (by letting others know you are the kind of person who can be extorted)
  • Failing to recognize those logical or long-term consequences, despite using an algorithm that would respond appropriately if it recognized them.

It seems like humans get a lot of use out of concepts like "agent" and "extortion" even though in principle functional decision theory is simpler. Functional decision theory may just never be computationally tractable outside of radically simplified toy problems.

I think we still have the problem of defining values and distinguishing extortion/blackmail from other forms of trade.

If you avoid value laden words, replacing "manipulator" with "actor", "target" with "agent", "destructive action" with just "additional action", and the like, you end up describing almost any sort of learning.

The generalized reaction for a rational agent is to make decisions based on full projections of the future, not just immediate/local results. And a recognition that there are often more options than are immediately apparent. You're not limited to "pay (expecting to pay again and again)" and "don't pay (and passively await pain)". You also have "blow up the earth", "move to another city", "negotiate/counteroffer", "attempt to deter your opponent", "etc."

Many of these options are available for explicit extortion, implicit extortion, and non-sentient extortion (you can frame "forced to carry an umbrella" as the threat of being dampened if it rains, and apply the same decision framework).

distinguishing extortion/blackmail from other forms of trade.

The decision-theoretic distinction seems pretty clear: if other people expect you to pay extorters then you are worse off, if other people expect you to pay trading partners you are better off. You can take that as the definition if you like, though there might be better definitions. (It incidentally clarifies when something nominally labeled "trade" can morally be extortion).

Yes, defining counterfactuals is subtle. And of course these aren't natural kinds, like most things it's a bit blurry. But does the discussion in this post really depend on those subtleties?

you can frame "forced to carry an umbrella" as the threat of being dampened if it rains, and apply the same decision framework.

Carrying an umbrella doesn't seem similar in relevant ways to responding to a threat.

The distinction doesn't seem clear to me at all - in no case does anyone pay if they don't expect to be better off than if they didn't pay.

"pay me $5 or I won't give you this hamburger". "pay me $60 or I'll put a lock on your storage unit so you can't get in". "if you don't tip everyday, your latte might be made with a little less care". Which are these?

IMO, whether to carry an umbrella is exactly the same learning and decision as learning what happens when you pay or don't for something.

The interesting feature of a threat is that it occurs because the person making the threat expects you to change your behavior in response to the threat, but whether it rains is independent of whether you would respond to rain with an umbrella.

Promoted to frontpage.

Humans [in contrast with RL-based agents] have heuristics like vindictiveness that help prevent us from being manipulated by extortion, and which seem particularly effective against implicit extortion. Modern humans are also capable of doing explicit reasoning to recognize the costs of giving in to extortion.

What implications does this have for the claim (which I have seen promulgated here on Less Wrong as if it were obviously true and accepted) that humans in fact are reinforcement learners?

That claim is only plausible if you use a very carefully constructed reward function.

I’m not quite sure how to make sense of this reply, and it feels like there is an implication here that I’m not parsing; could you elaborate? Presumably, the idea is that our reward function is indeed “carefully constructed” by evolution. (Note that I’m trying to extrapolate from memory of past discussions; folks who have actually made the “humans are reinforcement learners” claim should please feel free to jump in here!)

If you model a human as an RL agent, then a lot of the work is being done by a very carefully constructed reward function. You can tell since humans do a lot of stuff that an RL agent basically wouldn't do (like "die for a cause"). You can bake an awful lot into a carefully constructed reward function---for example, you can reward the agent whenever it takes actions that are optimal according to some arbitrary decision theory X---so it's probably possible to describe a human as an RL agent but it doesn't seem like a useful description.

At any rate, once the reward function is doing a lot of the optimization, the arguments in this post don't really apply. Certainly an RL agent can have a heuristic like vindictiveness if you just change the reward function.

That makes sense, thank you.

If it hasn't been done already, it might be useful to test existing RL algorithms with simple grid-world domains in which "extortion dynamics" might emerge.

For example, imagine a grid-world domain with two agents, M and T, where M can do something that reduces the reward of both agents by 100, while T can do something that increases the reward of M by 100 and reduces its own reward by 10.

It might be useful to publish something that includes simulation videos in which M succeeds in extorting T. Also, perhaps something interesting could be said about algorithms/conditions in which T succeeds in defending itself from extortion.