Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post continues the analysis started in the previous post. Here I will present some examples of algorithms operating in certain environments, seeing how the model certain things, and using that to conclude facts about their preferences/goals/reward functions.

I'll be looking first at the Poker problem with unknown motivations presented here, and secondly at a variant of the Codenames game.

Different algorithms, same outputs, different goals

In the Poker example, we are unsure whether Alice wants to win the hand against Bob, for money, or lose the hand to get into Bob's good graces. She herself is unsure what Bob's cards are; Bob has been playing confidently, but there is only one card combination that would allow him to win.

Alice Poker algorithm1:Parameters: a, ratio of relative heuristic importance.2:Input: Alice_cards, board, Bob_behave3:P_win=(a)h_1(Alice_cards,board)+(1−a)h_2(Bob_behave)4:ifP_win>0.55:return 'call'6:else7:return 'fold'8:end if

The inputs are Alice's own cards, the five cards on the board, and Bob's behaviour this hand. There are two heuristics called by the algorithm, h_1, which computes the probability of Alice winning by assuming Bob has a random hand, and h_2, which assesses the likelihood of Alice winning by looking at Bob's behaviour (at this point you should start worrying about the suggestive names and descriptions I'm giving to all these elements).

Now, in the situation we find ourselves, what we want to say is that if a is close to 1, then h_1 dominates, and P_win will be high. If a is close to 0, then it will be low, as h_2 dominates the expression. Since there is a > in line 4, we want to say that this Alice Poker algorithm is trying to win, but, depending on the value of a, has different beliefs about what action is likely to maximise expected money.

Similarly, if it were a < in line 4, we'd like to say that the Alice Poker algorithm wants to lose. And this, even though the (<,a=1) and (>,a=0) algorithms would both fold (while (<,a=0) and (>,a=1) would both call).

But this relies too much on the interpretation of what the terms mean and what the heuristics are meant to do. Now, there isn't much flexibility in interpreting what h_1 is or does: as the quality of Alice's hand increases, relative to a random hand and the given board, h_1's output increases. Thus is it 'clearly' measuring relative hand quality.

But what of h_2? All that we know is that this outputs a number that increases when Bob appears confident, and decreases when he appears worried. I've said that this is supposed to measure how good Bob's hand is, based on how he behaved. But how do we know that? Maybe Alice views Bob as an effective bluffer (or a level 2n+1 meta-bluffer), so that a high h_2 actually means that she expects Bob to have a poor hand. In that case the (>,a=0) would still fold, but would be folding to lose against Bob, not folding to win.

This brings us back to some of the oldest problems in AI (and some of the newest). Namely, what is the semantics and interpretation of what an algorithm is doing? It's been argued that you cannot derive the semantics of an algorithm, only the syntax. I've disagreed with that; arguing that when the internal syntax is sufficiently rich and detailed, and the agent relates well to the real world, then there can be only a a few semantic interpretations of the symbols that make any sense. In more modern terms, this can be seen as the problem of algorithm interpretability, especially when it is applied to out-of-training-set distributions. If a certain neuron triggers when seeing photos of dogs, including photos far from its training distribution, then that neuron is at least superficially connected to the concept of "dog" (or at least "dog photo").

So, what would cause us to believe that h_2 is actually trying to predict Bob's cards from his reactions? Well, suppose that h_2 was a learning algorithm, and it updated the correct way: when Bob is revealed to have had a good hand, it updates towards a higher value on that input set, and vice versa. Or suppose that h_2 also took Alice_cards and board as inputs, and was higher if Alice's cards were better. Then it would seem more justified to see that heuristic as actually trying to estimate the probability of Alice's cards beating Bob's.

Codenames and semantics

In Codenames, one player on a team (Spymaster) tries to signal a card (or a collection of cards) to their teammates. These cards have words on them, and the Spymaster names another word related to the targets. The Spymaster also gives a number, to say how many cards this word refers to.

Suppose there are four remaining codewords, "Dolphin", "New York", "Comet", and "Lice". The Spymaster is Bob, and has only one card remaining to signal; the other player is Alice, who has to interpret Bob's signal.

Note that before receiving Bob's word, the Alice algorithm probably doesn't have time to run through all the possible words he could give. Therefore, in terms of the first post, in the model fragment Alice has of Bob, she expects to be surprised by his action (it's mainly for this reason that I'm introducing this example).

Anyway, after receiving Bob's message - which we will assume is "Aquatic, 1 card" - Alice runs the following algorithm:

Alice Codewords algorithm1:Input: codewords, Bob_wordmap, Bob_word2:answer=argmin(forwincodewords)Bob_wordmap(w,Bob_word)3:returnanswer

The algorithm is very simple: it models Bob as having a wordmap in his head, that measures the distance between words and concepts, and assumes that the word he speaks - Bob_word - is closest to the answer among the remaining codewords.

Let's assume that, through saying "Aquatic", Bob is trying to signal "Dolphin". But let's assume that Alice's model of Bob's wordmap is wrong - it computes that "New York" is the closest word to "Aquatic" (New York is a coastal city, after all).

Let's further assume that Alice has a learning component to her algorithm, so Bob_wordmap gets updated. But the updating is terrible: it actually moves further away from the true wordmap with every data point.

In view of this, can we still say that Alice is trying to win the game, but just has a terrible model of her teammate?

Perhaps. The data above is not enough to say that - if we renamed Bob_wordmap as Negative_Bob_wordmap it would seem to be more fitting. But suppose that Bob_wordmap was not a single algorithm, but a sub-component of a much more complicated routine called Bob_model. And suppose that this component was decently connected to Bob - Alice used it whenever she thinks about what Bob does, it is activated whenever she sees Bob, and it is closely semantically connected to her experience of interacting with Bob.

Then we might be able to say that, though Bob_model(wordmap) is actually a terrible version of Bob's mental wordmap, the whole Bob_model is sufficiently clearly a model of Bob, and wordmap, in other contexts, is sufficiently clearly a wordmap, that we can say that Alice is trying and failing to model Bob's thinking (rather than deliberately failing the game).

These are the sort of "interpretability" analyses that we will have to do to figure out what the algorithms in our brains actually want.

fMRI semantics

A final, minor point: things like fMRIs can help bridge the semantic-syntax gap in humans, as it can, to some extent, literally see how certain concepts, ideas, or images are handled by the brain. This could be the human equivalent of having the algorithm laid out as above.

This post continues the analysis started in the previous post. Here I will present some examples of algorithms operating in certain environments, seeing how the model certain things, and using that to conclude facts about their preferences/goals/reward functions.

I'll be looking first at the Poker problem with unknown motivations presented here, and secondly at a variant of the Codenames game.

## Different algorithms, same outputs, different goals

In the Poker example, we are unsure whether Alice wants to win the hand against Bob, for money, or lose the hand to get into Bob's good graces. She herself is unsure what Bob's cards are; Bob has been playing confidently, but there is only one card combination that would allow him to win.

The inputs are Alice's own cards, the five cards on the board, and Bob's behaviour this hand. There are two heuristics called by the algorithm, h_1, which computes the probability of Alice winning by assuming Bob has a random hand, and h_2, which assesses the likelihood of Alice winning by looking at Bob's behaviour (at this point you should start worrying about the suggestive names and descriptions I'm giving to all these elements).

Now, in the situation we find ourselves, what we want to say is that if a is close to 1, then h_1 dominates, and P_win will be high. If a is close to 0, then it will be low, as h_2 dominates the expression. Since there is a > in line 4, we want to say that this Alice Poker algorithm is trying to win, but, depending on the value of a, has different beliefs about what action is likely to maximise expected money.

Similarly, if it were a < in line 4, we'd like to say that the Alice Poker algorithm wants to lose. And this, even though the (<,a=1) and (>,a=0) algorithms would both fold (while (<,a=0) and (>,a=1) would both call).

But this relies too much on the interpretation of what the terms

meanand what the heuristicsare meant to do. Now, there isn't much flexibility in interpreting what h_1 is or does: as the quality of Alice's hand increases, relative to a random hand and the given board, h_1's output increases. Thus is it 'clearly' measuring relative hand quality.But what of h_2? All that we know is that this outputs a number that increases when Bob appears confident, and decreases when he appears worried. I've said that this is supposed to measure how good Bob's hand is, based on how he behaved. But how do we know that? Maybe Alice views Bob as an effective bluffer (or a level 2n+1 meta-bluffer), so that a high h_2 actually means that she expects Bob to have a poor hand. In that case the (>,a=0) would still fold, but would be folding to lose against Bob, not folding to win.

This brings us back to some of the oldest problems in AI (and some of the newest). Namely, what is the semantics and interpretation of what an algorithm is doing? It's been argued that you cannot derive the semantics of an algorithm, only the syntax. I've disagreed with that; arguing that when the internal syntax is sufficiently rich and detailed, and the agent relates well to the real world, then there can be only a a few semantic interpretations of the symbols that make any sense. In more modern terms, this can be seen as the problem of algorithm interpretability, especially when it is applied to out-of-training-set distributions. If a certain neuron triggers when seeing photos of dogs, including photos far from its training distribution, then that neuron is at least superficially connected to the concept of "dog" (or at least "dog photo").

So, what would cause us to believe that h_2 is actually trying to predict Bob's cards from his reactions? Well, suppose that h_2 was a learning algorithm, and it updated the correct way: when Bob is revealed to have had a good hand, it updates towards a higher value on that input set, and vice versa. Or suppose that h_2 also took Alice_cards and board as inputs, and was higher if Alice's cards were better. Then it would seem more justified to see that heuristic as actually trying to estimate the probability of Alice's cards beating Bob's.

## Codenames and semantics

In Codenames, one player on a team (Spymaster) tries to signal a card (or a collection of cards) to their teammates. These cards have words on them, and the Spymaster names another word related to the targets. The Spymaster also gives a number, to say how many cards this word refers to.

Suppose there are four remaining codewords, "Dolphin", "New York", "Comet", and "Lice". The Spymaster is Bob, and has only one card remaining to signal; the other player is Alice, who has to interpret Bob's signal.

Note that before receiving Bob's word, the Alice algorithm probably doesn't have time to run through all the possible words he could give. Therefore, in terms of the first post, in the model fragment Alice has of Bob, she expects to be surprised by his action (it's mainly for this reason that I'm introducing this example).

Anyway, after receiving Bob's message - which we will assume is "Aquatic, 1 card" - Alice runs the following algorithm:

The algorithm is very simple: it models Bob as having a wordmap in his head, that measures the distance between words and concepts, and assumes that the word he speaks - Bob_word - is closest to the answer among the remaining codewords.

Let's assume that, through saying "Aquatic", Bob is trying to signal "Dolphin". But let's assume that Alice's model of Bob's wordmap is wrong - it computes that "New York" is the closest word to "Aquatic" (New York is a coastal city, after all).

Let's further assume that Alice has a learning component to her algorithm, so Bob_wordmap gets updated. But the updating is terrible: it actually moves further away from the true wordmap with every data point.

In view of this, can we still say that Alice is trying to win the game, but just has a terrible model of her teammate?

Perhaps. The data above is not enough to say that - if we renamed Bob_wordmap as Negative_Bob_wordmap it would seem to be more fitting. But suppose that Bob_wordmap was not a single algorithm, but a sub-component of a much more complicated routine called Bob_model. And suppose that this component was decently connected to Bob - Alice used it whenever she thinks about what Bob does, it is activated whenever she sees Bob, and it is closely semantically connected to her experience of interacting with Bob.

Then we might be able to say that, though Bob_model(wordmap) is actually a terrible version of Bob's mental wordmap, the whole Bob_model is sufficiently clearly a model of Bob, and wordmap, in other contexts, is sufficiently clearly a wordmap, that we can say that Alice is trying and failing to model Bob's thinking (rather than deliberately failing the game).

These are the sort of "interpretability" analyses that we will have to do to figure out what the algorithms in our brains actually want.

## fMRI semantics

A final, minor point: things like fMRIs can help bridge the semantic-syntax gap in humans, as it can, to some extent, literally

seehow certain concepts, ideas, or images are handled by the brain. This could be the human equivalent of having the algorithm laid out as above.