Figuring out what Alice wants, part II

Stuart_Armstrong

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post continues the analysis started in the previous post. Here I will present some examples of algorithms operating in certain environments, seeing how the model certain things, and using that to conclude facts about their preferences/goals/reward functions.

I'll be looking first at the Poker problem with unknown motivations presented here, and secondly at a variant of the Codenames game.

Different algorithms, same outputs, different goals

In the Poker example, we are unsure whether Alice wants to win the hand against Bob, for money, or lose the hand to get into Bob's good graces. She herself is unsure what Bob's cards are; Bob has been playing confidently, but there is only one card combination that would allow him to win.

\begin{matrix} Alice Poker algorithm \begin{matrix} 1 : & Parameters : a, ratio of relative heuristic importance. 2 : & Input : A l i c e_c a r d s, b o a r d, B o b_b e h a v e 3 : & P_w i n = (a) h_1 (A l i c e_c a r d s, b o a r d) + (1 - a) h_2 (B o b_b e h a v e) 4 : & if P_w i n > 0.5 5 : & return'call' 6 : & else 7 : & return'fold' 8 : & end if \end{matrix} \end{matrix}

The inputs are Alice's own cards, the five cards on the board, and Bob's behaviour this hand. There are two heuristics called by the algorithm, $h_1$ , which computes the probability of Alice winning by assuming Bob has a random hand, and $h_2$ , which assesses the likelihood of Alice winning by looking at Bob's behaviour (at this point you should start worrying about the suggestive names and descriptions I'm giving to all these elements).

Now, in the situation we find ourselves, what we want to say is that if $a$ is close to $1$ , then $h_1$ dominates, and $P_w i n$ will be high. If $a$ is close to $0$ , then it will be low, as $h_2$ dominates the expression. Since there is a $>$ in line $4$ , we want to say that this Alice Poker algorithm is trying to win, but, depending on the value of $a$ , has different beliefs about what action is likely to maximise expected money.

Similarly, if it were a $<$ in line $4$ , we'd like to say that the Alice Poker algorithm wants to lose. And this, even though the $(<, a = 1)$ and $(>, a = 0)$ algorithms would both fold (while $(<, a = 0)$ and $(>, a = 1)$ would both call).

But this relies too much on the interpretation of what the terms mean and what the heuristics are meant to do. Now, there isn't much flexibility in interpreting what $h_1$ is or does: as the quality of Alice's hand increases, relative to a random hand and the given board, $h_1$ 's output increases. Thus is it 'clearly' measuring relative hand quality.

But what of $h_2$ ? All that we know is that this outputs a number that increases when Bob appears confident, and decreases when he appears worried. I've said that this is supposed to measure how good Bob's hand is, based on how he behaved. But how do we know that? Maybe Alice views Bob as an effective bluffer (or a level 2n+1 meta-bluffer), so that a high $h_2$ actually means that she expects Bob to have a poor hand. In that case the $(>, a = 0)$ would still fold, but would be folding to lose against Bob, not folding to win.

This brings us back to some of the oldest problems in AI (and some of the newest). Namely, what is the semantics and interpretation of what an algorithm is doing? It's been argued that you cannot derive the semantics of an algorithm, only the syntax. I've disagreed with that; arguing that when the internal syntax is sufficiently rich and detailed, and the agent relates well to the real world, then there can be only a a few semantic interpretations of the symbols that make any sense. In more modern terms, this can be seen as the problem of algorithm interpretability, especially when it is applied to out-of-training-set distributions. If a certain neuron triggers when seeing photos of dogs, including photos far from its training distribution, then that neuron is at least superficially connected to the concept of "dog" (or at least "dog photo").

So, what would cause us to believe that $h_2$ is actually trying to predict Bob's cards from his reactions? Well, suppose that $h_2$ was a learning algorithm, and it updated the correct way: when Bob is revealed to have had a good hand, it updates towards a higher value on that input set, and vice versa. Or suppose that $h_2$ also took $A l i c e_c a r d s$ and $b o a r d$ as inputs, and was higher if Alice's cards were better. Then it would seem more justified to see that heuristic as actually trying to estimate the probability of Alice's cards beating Bob's.

Codenames and semantics

In Codenames, one player on a team (Spymaster) tries to signal a card (or a collection of cards) to their teammates. These cards have words on them, and the Spymaster names another word related to the targets. The Spymaster also gives a number, to say how many cards this word refers to.

Suppose there are four remaining codewords, "Dolphin", "New York", "Comet", and "Lice". The Spymaster is Bob, and has only one card remaining to signal; the other player is Alice, who has to interpret Bob's signal.

Note that before receiving Bob's word, the Alice algorithm probably doesn't have time to run through all the possible words he could give. Therefore, in terms of the first post, in the model fragment Alice has of Bob, she expects to be surprised by his action (it's mainly for this reason that I'm introducing this example).

Anyway, after receiving Bob's message - which we will assume is "Aquatic, 1 card" - Alice runs the following algorithm:

\begin{matrix} Alice Codewords algorithm \begin{matrix} 1 : & Input : c o d e w o r d s, B o b_w o r d m a p, B o b_w o r d 2 : & a n s w e r = arg min (for w in c o d e w o r d s) B o b_w o r d m a p (w, B o b_w o r d) 3 : & return a n s w e r \end{matrix} \end{matrix}

The algorithm is very simple: it models Bob as having a wordmap in his head, that measures the distance between words and concepts, and assumes that the word he speaks - $B o b_w o r d$ - is closest to the answer among the remaining $c o d e w o r d s$ .

Let's assume that, through saying "Aquatic", Bob is trying to signal "Dolphin". But let's assume that Alice's model of Bob's wordmap is wrong - it computes that "New York" is the closest word to "Aquatic" (New York is a coastal city, after all).

Let's further assume that Alice has a learning component to her algorithm, so $B o b_w o r d m a p$ gets updated. But the updating is terrible: it actually moves further away from the true wordmap with every data point.

In view of this, can we still say that Alice is trying to win the game, but just has a terrible model of her teammate?

Perhaps. The data above is not enough to say that - if we renamed $B o b_w o r d m a p$ as $N e g a t i v e_B o b_w o r d m a p$ it would seem to be more fitting. But suppose that $B o b_w o r d m a p$ was not a single algorithm, but a sub-component of a much more complicated routine called $B o b_m o d e l$ . And suppose that this component was decently connected to Bob - Alice used it whenever she thinks about what Bob does, it is activated whenever she sees Bob, and it is closely semantically connected to her experience of interacting with Bob.

Then we might be able to say that, though $B o b_m o d e l (w o r d m a p)$ is actually a terrible version of Bob's mental wordmap, the whole $B o b_m o d e l$ is sufficiently clearly a model of Bob, and $w o r d m a p$ , in other contexts, is sufficiently clearly a wordmap, that we can say that Alice is trying and failing to model Bob's thinking (rather than deliberately failing the game).

These are the sort of "interpretability" analyses that we will have to do to figure out what the algorithms in our brains actually want.

fMRI semantics

A final, minor point: things like fMRIs can help bridge the semantic-syntax gap in humans, as it can, to some extent, literally see how certain concepts, ideas, or images are handled by the brain. This could be the human equivalent of having the algorithm laid out as above.

LESSWRONG
LW

Figuring out what Alice wants, part II

15

Ω 7

Different algorithms, same outputs, different goals

Codenames and semantics

fMRI semantics

15

Ω 7