Co-authored with Rebecca Gorman.
In section 5.2 of their Arxiv paper, "The Incentives that Shape Behaviour", which introduces structural causal influence models and a proposal for addressing misaligned AI incentives, the authors present the following graph:
The blue node is a "decision node", defined as where the AI chooses its action. The yellow node is a "utility node", defined as the target of the AI's utility-maximising goal. The authors introduce this graph to introduce the concept of control incentives; the AI, given utility-maximizing goal of user clicks, discovers an intermediate control incentive: influencing user options. By influencing user opinions, the AI better fulfils its objective. This 'control incentive' is graphically represented by surrounding it in dotted orange.
A click-maximising AI would only care about user opinions indirectly: they are a means to an end. A amoral social media company might agree with the AI on this, and be ok with it modifying user opinions to achieve higher clicks/engagement. But the users themselves would object highly to this; they do not want the algorithm to have a control incentive for changing their opinions.
Carey et al designed an algorithm to remove this control incentive. They do this by instructing the algorithm to choose its posts, not on predictions of the user's actual clicks - which produce the undesired control incentive - but on predictions of what the user would have clicked on, if their opinions hadn't been changed.
In this graph, there is no longer any control incentive for the AI on the "Influenced user opinions", because that node no longer connects to the utility node.
Call this construction a "counterfactual incentive". This substitutes the (model of) the users' original opinion where the influenced user opinions used to go. A more detailed description of such an algorithm can be found in this footnote. In words, the process can be phrased:
- The counterfactual incentive is what the utility would have been, if the influenced opinions were replaced with the original opinions (everything else being equal).
It seems to neutralise a vicious, ongoing cycle of opinion change in order to maximize clicks. But, as this post shows, there are issues with this construction and, if used naively, counterfactual modifications will fail.
Matching players and buying items
In order to analyse the issues with the setup, let's choose a more narrowly defined example. There are many algorithms that aim to manipulate payers of mobile games in order to get them to buy more expensive in-game items.
One such manipulative algorithm is to match up a new player against a more skilled player who is wielding an expensive item. The new player looses, and becomes convinced that the expensive item is key to success. And the new player therefore buys it.
Using the notation from Carey et al, this can be encoded in the following graph:
Here the AI's decision node is the player to match the user with. Its objective is to get the player to pay more money ($). The node is the user's original estimation of the relative ranking of various in-game items; the node is the same user's estimation after their first match.
The AI discovers a control incentive on : by choosing the user's opponent, it will modify the user's ranking in order to cause the player to buy more expensive items (see next section for an example of this in practice).
Suppose we wanted to remove this "dark pattern". First, we'd need some way of the AI measuring or estimating the value of . Then we could apply the counterfactual incentive construction and get the following causal graph:
Since it now targets the counterfactual predicted purchases $, it seems that the AI will no longer be incentivised to manipulate the player to change their item rankings in . In words, this is:
- The counterfactual incentive is what $ would have been, if the was replaced with the original (everything else being equal).
The implicit variables are important
That parenthetical aside - "everything else being equal" - is key. In partial derivatives, expressions like only have meaning if we know and all the other variables. Similarly, we don't know what "replace with the original " means, unless we know what isn't changed.
To demonstrate this, let's take the arrow connecting , the AI's decision node, with $, the utility node. That encodes a lot of information, but does so implicitly. Let's instead make it more explicit.
Let's pick one example. Now, one way could affect $, is if the user develops better knowledge of the game through their match with the AI-selected opponent. We could call this knowledge ; adding that explicitly to the graph gives:
In this graph, only and (respectively ) can affect the utility nodes $ (respectively $). Then, in words:
- The counterfactual incentive is what $ would have been, if the was replaced with the original (with being unchanged).
This being unchanged (the "everything else being equal" of the original formulation) will be a crucial point of analysis.
All goes well: independent nodes
Assume there are three items that can be used in this game. They are the wooden spoon, the longsword, and the Flaming Poisoning Raging Sword Of Doom.
The wooden spoon is inferior, while the other two swords are exactly equivalent in gameplay; the user starts out knowing this fact (this is the value of ).
The Flaming Poisoning Raging Sword Of Doom, is, however, extremely flashy, with lots of sound effects, lighting and lightning, sparkles and decorative motifs; it is also a lot more expensive. If the user is matched against a superior opponent with the FPRSoD, they will conclude that their defeat was due to not having the FPRSoD, and will be willing to pay the higher price for it than for the longsword.
Assume that the AI has two possible policies: , where it will match the user up with a superior player with a longsword, and , where it will match the user up with another superior player, of equivalent skill to the first, but with the FPRSoD.
In both cases, will be updated to reflect 'a sword is useful'. Assume the will leave unchanged the user's rankings (so will take the value , the same as ), that the longsword and the FPRSoD are equivalent. So, in that case, the user will buy the longsword, since it is cheaper.
In contrast, will cause the user to overvalue the FPRSoD (thus changing to ), and then buy it. Graphically, we get this (with the obvious convention that the left/right values of a node correspond to the left/right value of the decision node):
Without the counterfactual incentive, is the superior policy for the AI, since the user, with its new , will then buy the more expensive FPRSoD.
With the counterfactual incentive, the algorithm has been instructed to base its predictions off the original rankings of the players, so is set to (the value of ). Therefore the AI will model the user as valuing the longsword and the FPRSoD equally, for both and . Since the longsword is cheaper, it will expect that the user will buy it. Thus the AI has no incentive to manipulate the user by choosing over .
Problems appear: mutual information
In the previous example, the value of was set by the choice of or , while the value of would be the same in both cases. This makes and independent of each other (in a rather trivial sense).
But we'll show that problems appear when there is a dependence between the two nodes. To demonstrate that, model the situation as before, but the three options are the wooden spoon, the longsword, and the Flaming Poisoning Raging Mace Of Doom.
Again, the user starts out with the (correct) impression that the longsword and the FPRMoD are equivalent, and the AI can choose between and , setting the user against a superior opponent armed either with a longsword or the FPRMoD. And again, will change towards the FPRMoD.
A key difference is that, though the longsword and the FPRMoD are ultimately equivalent in their usefulness in winning the game, they are used differently. If the user sees a superior player use one of the item, they will start to understand how that item should be used.
So, under , the user will start developing skills in using a mace rather than a sword. Their gameplay knowledge, , will update to . If instead they see the longsword used, then this will update to value . Graphically, it looks like this:
In the standard case, there are two worlds, and . These lead to $ "longsword price" and $ "FPRMoD price", respectively.
If we use counterfactual incentives, the first world stays the same (since already), leading to $ "longsword price". The second world changes to leading to... what exactly?
Contradictory counterfactual worlds
As we've laid the situation out, the AI would have no way to learn what happens in world , because that world never happens. It can encounter multiple users, try and , get back the values of , , and $, and never know that leads to, because it never sees those values.
Adding the meaning to the symbols, means setting up the user with an opponent wielding the FPRMoD; this always changes both (their knowledge of how to play the game) and (their relative ranking of the weapons). But somehow the AI has to model a world where only was changed. The problem is that the node and the node have mutual information: they both encode, at least in part, the user's judgements about swords and maces, and it's not trivial to change the value of one while preserving the other.
There are multiple ways the AI can model the dependence of $ on the other variables, all of them consistent with the data it's seen. One way is to predict that leads to $ "longsword price", while leads to $ "FPRMoD price". In that case, counterfactual incentives will work as intended: the values will lead to $ "longsword price".
However, there's another equally consistent way of seeing the data: predicting that leads to $ "longsword price", while leads to $ "FPRMoD price". In that case, the counterfactual incentives will lead to the AI predicting that the values will lead to $ "FPRMoD price". In this setting, the counterfactual incentives are useless: the AI will behave as if they don't exist (since it doesn't use in its predictions) and push the mace on the user.
In the second case, the value of is used as a proxy for the value of . And that proxy is accurate in all actual worlds the AI sees. So it extends this to the counterfactual world, where we don't want it to do so. In fact, generically, mutual information between and will tend to undermine counterfactual incentives; see the argument in this footnote.
In practice, if the AI has enough depth of data and decent function approximators, it will be able to estimate the outcome of the world for the variables . But function approximators will suffer from the problem mentioned in this footnote[5:1]: they will tend to favour the decision actions that were maximal before the counterfactual incentives were introduced.
As for estimating from the data, there is another problem: these values represent a very extreme and unlikely world, with potentially very odd behaviours. For example, maybe the user is an artist who doesn't even notice the relative power of the weapons (so stays at ), but will choose to buy the FPRMoD on aesthetic grounds. In that case, the AI will design its approach to all users, based only on the behaviour this artist.
So, in more realistic and practical cases where and have mutual information, we generically expect that either the counterfactual incentives will not remove the manipulation that we are trying to remove, and/or that the AI's behaviour will become unpredictable.
So cannot understand how the AI behaves unless we grok the difference between and : what really happens when we insert instead of , while keeping constant? Both and are crucial to this definition.
The general case: changed and unchanged variables
In the general situation, we need to understand the difference between
This is a problem, as the variables not to take the counterfactual over are often implicit rather than explicit.
Value indifference and causal indifference
Another way to address this problem is to create a workable counterfactual system without listing all the variables not included in the counterfactual. As an example, my original value indifference post used a counterfactual over a random future event - the counterfactual was that this event would take a specific predefined value. Since this counterfactual is in the future and is random, it is independent of all AI decisions in the present. It has no mutual information with anything at the present time for the AI.
Let's simplify the setup as follows; the first graph is the standard setup, the second is its counterfactual counterpart:
The AI acts through the decision node . As before, is the utility node. In the standard setup, the AI receives data on the values of , , and (and knows its own actions). Its learning process consists of learning probabilities of the various nodes. So, for any values , , and of the four nodes, it will attempt to learn the following probabilities:
Then, given that information, it will attempt to maximise .
In the counterfactual setup, the AI substitutes for . What that means is that it computes the probabilities as above, from the , , and information. But it attempts to maximise , the counterfactual utility. The probable values of are defined by the following equality:
Note that the term can be estimated empirically, so the AI can learn the probability distribution on from empirical information. ↩︎
Note that we've slightly simplified the construction by collapsing "Original item rankings" and "Model of original item rankings" into the same node, . ↩︎
One problem with these counterfactual incentive approaches is that they often allow bad policies to be chosen, just remove part of the incentive towards them. ↩︎
For the moment, assume the AI doesn't get any information at all. Then suppose that is a "manipulative" action that increases $ via . Then if is an outcome that derives from , then AI will note a correlation between and high $. This argument extends to distributions over values of : values of that are typical for are also typical for high $.
Now let's put the information back, and add the counterfactual . It's certainly possible to design setups where this completely undoes the correlation between and high $. But, generically, there's no reason to expect that it will undo the correlation (though it may weaken it). So, in the counterfactual incentives, there will generically continue to be a correlation between "manipulative" actions and high $ . ↩︎ ↩︎
The counterfactuals defined in the non-manipulated learning paper are less clear. The counterfactual was over the AI's policy - "what would have happened, had you chosen another policy". It is not clear whether this is truly independent of the other variable/nodes the AI is considering (though some of MIRI's decision theory research may help with this). ↩︎