This post was made as part of John Wentworth's SERI MATS skill-up phase. I'd like to thank Chu Chen, Stephen Fowler, John Wentworth, and LessWrong's review services for help reviewing.

When I was a kid, I used to play poker^{[1]} with my brothers. I prided myself on thinking of clever strategies to trick them. Like creating false tells, or making contradictory claims in an effort to confuse them. These strategies usually succeeded on the first one or two tries, but by the third try they'd usually catch onto my game, and devastate me for the next 3-5 rounds as I tried to think up another clever trick, meaning they'd usually win. The problem was my use of simple, easily learnable heuristics to attempt deception with, which take more time & resources to think up than they make up for by being persistently useful. A better strategy would be attempting to build a model in my head of my brothers' reactions, and basing strategies on that model. This way, I'd be able to figure out when they'd catch on to my tricks, and more easily think up such tricks without needing to try them out in the process, potentially losing some points. This post attempts to generalize my experience to give insights into these heuristic and model based approaches’ stability properties after repeated interactions between a deceiving and target agent. With some work, the descriptions here can likely be generalized into an equivalent argument about general agents, not just deceptive ones.

Say we are looking in on a poker game between Alice and Bob. Based on Alice’s betting strategy and body language, Bob can clearly tell Alice has a better hand than he. However, Alice has been tricked by Bob’s betting strategy and body language into believing Bob has a far better hand than her. Bob wins the hand after Alice folds.

In this post, we’re going to get a feel for why my intuitions lean towards the conjecture that Bob must be modeling Alice in some way. That is why I believe that information generating processes which cause deception in an otherwise adequate agent seem like they should be running some abstracted version of the target agent, and not applying a simple inverse.

Should we expect Bob to have a model of Alice?

The argument made here should be differentiated from the case where Bob is or models an inverse of Alice^{[2]}. If Bob is very good at his job, he can always be considered a certain type of inverse of Alice, since he starts at some conclusion or action he’d like Alice to make, translates this into information he predicts will cause this effect, then sends this information to Alice, who translates it into the conclusion or action Bob wanted her to make.

Though Bob may not be a perfect inverse, since there may be some actions Bob is unable to make Alice take. For instance, it may be impossible for Bob to convince Alice to give him all her money^{[3]}.

When I say Bob is modeling Alice, I mean something more specific than this. I mean Bob has an explicit model of Alice which he can use to predict Alice’s conclusions and actions after receiving various pieces of information. If you were to take this piece and run it as-is^{[4]} you would get an OK approximation of Alice’s reactions.

Such a model seems necessary when Alice and Bob are acting for extended periods of time, and Alice has access to outside sources of information. We will be playing with the poker game example from before. In this example, there are many different strategies Bob may take. Broadly, Bob has three options (Notably, it is not the case that we expect Bob to start with an explicit, simple heuristic, develop this into a complicated heuristic, and then into an internal model. These are just different strategies Bob may decide to implement.):

Use a simple heuristic to manipulate Alice’s beliefs and actions. For instance, he may figure out that if he acts confident, Alice is more likely to believe he has good cards. He may also try to give Alice a fake tell, like rubbing his elbows each time he has a bad hand, in order to subvert that tell when he has a very good hand.

Use a complicated heuristic to manipulate Alice’s beliefs and actions. This can be related to Bob using his gut to develop strategies to manipulate Alice rather than any simple, explicit model. In this example, it wouldn’t be any exact strategy, but complicated, fuzzy, gut feelings about how Bob should go about bluffing. This differs from a simple heuristic, because it is more complex, and we don’t expect Alice to be able to learn the exact method Bob is using to come to his conclusions. It’s also not obviously better, though this is outside the scope of this blog post.

Use an internal model of Alice, which directly stimulates her reactions to receiving various bits of information, and choose the information to send based on which reactions are most preferable.

If Bob uses a simple heuristic (1), Alice is very likely to pick up on what he’s doing after a few iterations. It often takes only one deceptive tell to determine that the tell is, in fact, deceptive. Likewise, if Bob consistently acts confident when he has a bad set of cards, Alice will easily be able to connect these concepts together, and no longer be subject to such deception. After exhausting the obvious simple heuristic strategies, Bob will have to start scrambling for new ones.

If Bob uses a complex heuristic (2), even though his hunches are not easily learned by Alice, Alice will still figure out she’s losing, and decide it’s time for a change of pace in how she decides upon her actions. There is often little Bob can do to rebuild such a gut intuition for how to deceive Alice when she executes this defense other than pure exposure.

If Bob has an internal model (3), he is impervious to the above defenses Alice may deploy, because any change in her strategy can be predicted by Bob’s model. If she figures out that Bob consistently bluffs, and starts taking his bravado as evidence in favor of him having a bad hand, Bob will be able to figure out after how much evidence Alice will come to this conclusion, and invert his strategy at the correct time (using bravado when he has a good hand) such that he is still able to manipulate Alice’s actions.

Stated more generally, and acknowledging certain counterarguments, Bob has three choices for how to be an Alice-inverse: implementing a simple heuristic function^{[5]} to decide on information to present Alice, implementing a complex heuristic function^{[6]} to decide on information to present Alice, or actually modeling Alice^{[7]}^{[8]}. The below arguments sketch out reasons to believe that if Bob and Alice interact repeatedly (like if they are playing between multiple hands in a poker game), and Alice has access to information that Bob can’t control (like the winner of a hand, or the cards Bob happens to have), then the only strategy which is stable is the strategy in which Bob maintains a model of Alice.

If Bob is implementing a simple heuristic, then Alice can always learn the function the simple heuristic is implementing, and make corrections to how she updates her beliefs or decides upon actions to cause the simple heuristic to no longer be valid. Bob then has three choices: try to learn a new simple heuristic, which Alice can launch the same defense on, learn a model of how Alice updates her beliefs and decides on actions given information percepts (which for our purposes is a model of Alice), or learn a highly complicated heuristic which Alice can’t easily figure out^{[9]}. Even if Bob adds on to the simple heuristic “<claim I want Alice to believe> and also Alice updates as little as possible in the direction of which simple heuristic I’m using”, if Alice is good enough, she’ll still update a little bit in favor of which heuristic Bob is using, and eventually come to the correct conclusion. This strategy is also subject to the attack Alice may launch in the next paragraph.

If Bob is running a complicated heuristic inverse of Alice, while there’s littleno chance that Alice will learn that heuristic, she can still cause the heuristic to be non-effective by self-modification or seeking out and updating on new information which Bob has no control over, and coming to policy decisions in a far different way. Bob would then need to go through the previously described process of scrambling to learn a new heuristic or an accurate model of Alice.

If Bob is running a sufficiently accurate model of Alice, whenever Alice updates on information Bob knows about, Bob can just give his model of Alice that same information, watch the updates that model makes, and use the newly created model in place of the old. This means it's far harder for Alice to learn how to not be duped by Bob. Even if there’s some error in Bob’s model, and non-perfect predictions about Alice’s new belief state and policy are made, these predictions will likely be better than random^{[10]}, and can be adjusted easily by observing the divergences between their predictions and Alice’s actual actions^{[11]}.

Thus, it seems the only way Bob can persistently deceive Alice when she has access to non-Bob-controlled information is if he’s running an Alice model. All other methods which I can think of result in Bob, at some point, losing his ability to deceive Alice for a time^{[12]}. Simple heuristics and complex heuristics seem unstable, while accurate modeling seems stable after several rounds of this type of deception game. This same argument also seems like it can be applied to any environment Bob would like to manipulate, which has a transition function which cannot be well-modeled (at least for Bob) as a function of Bob's actions.

^{^}

Taking artistic liberties here. We played more general games of deception, like general trickery or trying to win against each other in video games. This is a post about deception...

^{^}

^{^}

With reasonable assumptions about Bob’s optimization power and the amount of time he has to think.

^{^}

With possibly some encoding or decoding of the input and output channels.

^{^}

Defined as a function not directly modeling Alice which Alice is able to learn.

^{^}

Defined as a function not directly modeling Alice which Alice is not able to learn.

^{^}

Defined (as above) as Bob having an explicit model of Alice which he can use as-is (up to some encoding or decoding function) to produce predictions for Alice’s actions and belief updates, given information Alice is presented.

^{^}

Note that one can choose to implement a heuristic and model Alice in order to verify this heuristic, or to produce new heuristics. This possibility does not break the argument.

^{^}

Or figure it out at all. If the inverse (or rather checking whether a particular message is the output of the inverse) has a higher computational complexity than the set of hypotheses Alice can learn in the allotted time and given the information she has, then Alice has no chance of learning to differentiate deception from non-deception.

^{^}

This presents an attack Alice can run on Bob if she suspects deception. Namely, to make her policy chaotic with respect to small alterations. This may sound sub-optimal (if she’s randomizing her strategy, won’t most of those strategies just be bad?), but for decently-good agents, policy space seems likely to be diamond shaped. Though if Alice is not chaotic enough, Bob will be able to keep up with her updates via observation of her actions, and be able to deceive Alice into believing she has succeeded.

^{^}

It’s also notable that Alice still has the strategy of simply not paying attention to what Bob says, but this doesn’t change our conclusions for 2 reasons: 1) Alice can implement the same strategy regardless of what Bob is doing, so this doesn’t disfavor the modeling approach any more than any other approach, and 2) Modeling allows Bob to easily do lookahead, and see whether his actions now will cause Alice to pay less attention to him later. If they do, then those actions are disfavored.

^{^}

And possibly forever, if Alice uses this deception-free period to build a model of Bob, and starts running her own deception on him.

This post was made as part of John Wentworth'sSERI MATSskill-up phase. I'd like to thank Chu Chen, Stephen Fowler, John Wentworth, and LessWrong's review services for help reviewing.When I was a kid, I used to play poker

^{[1]}with my brothers. I prided myself on thinking of clever strategies to trick them. Like creating false tells, or making contradictory claims in an effort to confuse them. These strategies usually succeeded on the first one or two tries, but by the third try they'd usually catch onto my game, and devastate me for the next 3-5 rounds as I tried to think up another clever trick, meaning they'd usually win. The problem was my use of simple, easily learnable heuristics to attempt deception with, which take more time & resources to think up than they make up for by being persistently useful. A better strategy would be attempting to build a model in my head of my brothers' reactions, and basing strategies on that model. This way, I'd be able to figure out when they'd catch on to my tricks, and more easily think up such tricks without needing to try them out in the process, potentially losing some points. This post attempts to generalize my experience to give insights into these heuristic and model based approaches’ stability properties after repeated interactions between a deceiving and target agent. With some work, the descriptions here can likely be generalized into an equivalent argument about general agents, not just deceptive ones.Say we are looking in on a poker game between Alice and Bob. Based on Alice’s betting strategy and body language, Bob can clearly tell Alice has a better hand than he. However, Alice has been tricked by Bob’s betting strategy and body language into believing Bob has a far better hand than her. Bob wins the hand after Alice folds.

In this post, we’re going to get a feel for why my intuitions lean towards the conjecture that Bob must be modeling Alice in some way. That is why I believe that information generating processes which cause deception in an otherwise adequate agent seem like they should be running some abstracted version of the target agent, and not applying a simple inverse.

## Should we expect Bob to have a model of Alice?

The argument made here should be differentiated from the case where Bob is or models an inverse of Alice

^{[2]}. If Bob is very good at his job, he canalwaysbe considered a certain type of inverse of Alice, since he starts at some conclusion or action he’d like Alice to make, translates this into information he predicts will cause this effect, then sends this information to Alice, who translates it into the conclusion or action Bob wanted her to make.Though Bob may not be a perfect inverse, since there may be some actions Bob is unable to make Alice take. For instance, it may be impossible for Bob to convince Alice to give him all her money

^{[3]}.When I say Bob is modeling Alice, I mean something more specific than this. I mean Bob has an explicit model of Alice which he can use to predict Alice’s conclusions and actions after receiving various pieces of information. If you were to take this piece and run it as-is

^{[4]}you would get an OK approximation of Alice’s reactions.Such a model seems necessary when Alice and Bob are acting for extended periods of time, and Alice has access to outside sources of information. We will be playing with the poker game example from before. In this example, there are many different strategies Bob may take. Broadly, Bob has three options (Notably, it is

notthe case that we expect Bob to start with an explicit, simple heuristic, develop this into a complicated heuristic, and then into an internal model. These are just different strategies Bob may decide to implement.):If Bob uses a simple heuristic (1), Alice is very likely to pick up on what he’s doing after a few iterations. It often takes only one deceptive tell to determine that the tell is, in fact, deceptive. Likewise, if Bob consistently acts confident when he has a bad set of cards, Alice will easily be able to connect these concepts together, and no longer be subject to such deception. After exhausting the obvious simple heuristic strategies, Bob will have to start scrambling for new ones.

If Bob uses a complex heuristic (2), even though his hunches are not easily learned by Alice, Alice will still figure out she’s losing, and decide it’s time for a change of pace in how she decides upon her actions. There is often little Bob can do to rebuild such a gut intuition for how to deceive Alice when she executes this defense other than pure exposure.

If Bob has an internal model (3), he is impervious to the above defenses Alice may deploy, because any change in her strategy can be predicted by Bob’s model. If she figures out that Bob consistently bluffs, and starts taking his bravado as evidence in favor of him having a bad hand, Bob will be able to figure out after how much evidence Alice will come to this conclusion, and invert his strategy at the correct time (using bravado when he has a good hand) such that he is still able to manipulate Alice’s actions.

Stated more generally, and acknowledging certain counterarguments, Bob has three choices for how to be an Alice-inverse: implementing a simple heuristic function

^{[5]}to decide on information to present Alice, implementing a complex heuristic function^{[6]}to decide on information to present Alice, or actually modeling Alice^{[7]}^{[8]}. The below arguments sketch out reasons to believe that if Bob and Alice interact repeatedly (like if they are playing between multiple hands in a poker game), and Alice has access to information that Bob can’t control (like the winner of a hand, or the cards Bob happens to have), then the only strategy which is stable is the strategy in which Bob maintains a model of Alice.If Bob is implementing a simple heuristic, then Alice can always learn the function the simple heuristic is implementing, and make corrections to how she updates her beliefs or decides upon actions to cause the simple heuristic to no longer be valid. Bob then has three choices: try to learn a new simple heuristic, which Alice can launch the same defense on, learn a model of how Alice updates her beliefs and decides on actions given information percepts (which for our purposes is a model of Alice), or learn a highly complicated heuristic which Alice can’t easily figure out

^{[9]}. Even if Bob adds on to the simple heuristic “<claim I want Alice to believe> and also Alice updates as little as possible in the direction of which simple heuristic I’m using”, if Alice is good enough, she’ll still updatea little bitin favor of which heuristic Bob is using, andeventuallycome to the correct conclusion. This strategy is also subject to the attack Alice may launch in the next paragraph.If Bob is running a complicated heuristic inverse of Alice, while there’s littleno chance that Alice will learn that heuristic, she can still cause the heuristic to be non-effective by self-modification or seeking out and updating on new information which Bob has no control over, and coming to policy decisions in a far different way. Bob would then need to go through the previously described process of scrambling to learn a new heuristic or an accurate model of Alice.

If Bob is running a sufficiently accurate model of Alice, whenever Alice updates on information Bob knows about, Bob can just give his model of Alice that same information, watch the updates that model makes, and use the newly created model in place of the old. This means it's far harder for Alice to learn how to not be duped by Bob. Even if there’s some error in Bob’s model, and non-perfect predictions about Alice’s new belief state and policy are made, these predictions will likely be better than random

^{[10]}, and can be adjusted easily by observing the divergences between their predictions and Alice’s actual actions^{[11]}.Thus, it seems the only way Bob can persistently deceive Alice when she has access to non-Bob-controlled information is if he’s running an Alice model. All other methods which I can think of result in Bob, at some point, losing his ability to deceive Alice for a time

^{[12]}. Simple heuristics and complex heuristics seem unstable, while accurate modeling seems stable after several rounds of this type of deception game. This same argument also seems like it can be applied to any environment Bob would like to manipulate, which has a transition function which cannot be well-modeled (at least for Bob) as a function of Bob's actions.^{^}Taking artistic liberties here. We played more general games of deception, like general trickery or trying to win against each other in video games. This

isa post about deception...^{^}^{^}With reasonable assumptions about Bob’s optimization power and the amount of time he has to think.

^{^}With possibly some encoding or decoding of the input and output channels.

^{^}Defined as a function not directly modeling Alice which Alice is able to learn.

^{^}Defined as a function not directly modeling Alice which Alice is

notable to learn.^{^}Defined (as above) as Bob having an explicit model of Alice which he can use as-is (up to some encoding or decoding function) to produce predictions for Alice’s actions and belief updates, given information Alice is presented.

^{^}Note that one can choose to implement a heuristic

andmodel Alice in order to verify this heuristic, or to produce new heuristics. This possibility does not break the argument.^{^}Or figure it out

at all. If the inverse (or rather checking whether a particular message is the output of the inverse) has a higher computational complexity than the set of hypotheses Alice can learn in the allotted time and given the information she has, then Alice has no chance of learning to differentiate deception from non-deception.^{^}This presents an attack Alice can run on Bob if she suspects deception. Namely, to make her policy chaotic with respect to small alterations. This may sound sub-optimal (if she’s randomizing her strategy, won’t most of those strategies just be bad?), but for decently-good agents, policy space seems likely to

be diamond shaped.Though if Alice is not chaotic enough, Bob will be able to keep up with her updates via observation of her actions, and be able to deceive Alice into believing she has succeeded.^{^}It’s also notable that Alice still has the strategy of simply not paying attention to what Bob says, but this doesn’t change our conclusions for 2 reasons: 1) Alice can implement the same strategy regardless of what Bob is doing, so this doesn’t disfavor the modeling approach any more than any other approach, and 2) Modeling allows Bob to easily do lookahead, and see whether his actions now will cause Alice to pay less attention to him later. If they do, then those actions are disfavored.

^{^}And possibly forever, if Alice uses this deception-free period to build a model of Bob, and starts running her own deception on him.