Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

There's a nice recent paper whose authors did the following:

  1. train a small GPT model on lists of moves from Othello games;
  2. verify that it seems to have learned (in some sense) to play Othello, at least to the extent of almost always making legal moves;
  3. use "probes" (regressors whose inputs are internal activations in the network, trained to output things you want to know whether the network "knows") to see that the board state is represented inside the network activations;
  4. use interventions to verify that this board state is being used to decide moves: take a position in which certain moves are legal, use gradient descent to find changes in internal activations that make the output of the probes look like a slightly different position, and then verify that when you run the network but tweak the activations as it runs the network predicts moves that are legal in the modified position.

In other words, it seems that their token-predicting model has built itself what amounts to an internal model of the Othello board's state, which it is using to decide what moves to predict.

The paper is "Emergent world representations: Exploring a sequence model trained on a synthetic task" by Kenneth Li, Aspen Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg; you can find it at

There is a nice expository blog post by Kenneth Li at

Some details that seem possibly-relevant:

  • Their network has a 60-word input vocabulary (four of the 64 squares are filled when the game starts and can never be played in), 8 layers, an 8-head attention mechanism, and a 512-dimensional hidden space. (I don't know enough about transformers to know whether this in fact tells you everything important about the structure.)
  • They tried training on two datasets, one of real high-level Othello games (about 140k games) and one of synthetic games where all moves are random (about 20M games). Their model trained on synthetic games predicted legal moves 99.99% of the time, but the one trained on real well-played games only predicted legal moves about 95% of the time. (This suggests that their network isn't really big enough to capture legality and good strategy at the same time, I guess?)
  • They got some evidence that their network isn't just memorizing game transcripts by training it on a 20M-game synthetic dataset where one of the four possible initial moves is never played. It still predicted legal moves 99.98% of the time when tested on the full range of legal positions. (I don't know what fraction of legal positions are reachable with the first move not having been C4; it will be more than 3/4 since there are transpositions. I doubt it's close to 99.98%, though, so it seems like the model is doing pretty well at finding legal moves in positions it hasn't seen.)
  • Using probes whose output is a linear function of the network activations doesn't do a good job of reconstructing the board state (error rate is ~25%, barely better than attempting the same thing from a randomly initialized network), but training 2-layer MLPs to do it gets the error rate down to ~5% for the network trained on synthetic games and ~12% for the one trained on championship games, whereas it doesn't help at all for the randomly trained network. (This suggests that whatever "world representation" the thing has learned isn't simply a matter of having an "E3 neuron" or whatever.)

I am not at all an expert on neural network interpretability, and I don't know to what extent their findings really justify calling what they've found a "world model" and saying that it's used to make move predictions. In particular, I can't refute the following argument:

"In most positions, just knowing what moves are legal is enough to give you a good idea of most of the board state. Anything capable of determining which moves are legal will therefore have a state from which the board state is somewhat reconstructible. This work really doesn't tell us much beyond what the fact that the model could play legal moves already does. If the probes are doing something close to 'reconstruct board state from legal moves', then the interventions amount to 'change the legal moves in a way that matches those available in the modified position', which of course will make the model predict the moves that are available in the modified position."

(It would be interesting to know whether their probes are more effective at reconstructing the board state in positions where the board state is closer to being determined by the legal moves. Though that seems like it would be hard to distinguish from "the model just works better earlier in the game", which I suspect it does.)


Ω 29

New Comment
19 comments, sorted by Click to highlight new comments since: Today at 11:17 PM

This suggests that their network isn't really big enough to capture legality and good strategy at the same time, I guess?

I'm not sure this is a network size issue. I think it's also plausible that there are multiple different rules of Othello compatible with high-level Othello play, and so it's not obvious which rules are actually generating the dataset, whereas with random play there's a crisper distinction between legal and illegal moves. (Like, lots of legal moves will never be played in the high-level play dataset, but not in the random dataset.)

Yes, that's also possible.

More than that, it's obviously literally true that e.g. no amount of watching top-level games will let you rule out possibilities like "the actual rules of Othello, except that in this position this particular (stupid) move is illegal". What's not clear to me is how plausible it is that there are alternate "roughly equally plausible to a transformer" rulesets compatible with expert play.

I'm pretty sure that a human being watching (say) a thousand top-level Othello games would infer the rules correctly, though of course humans have all sorts of priors on what game rules tend to be like that GPT doesn't.

Games can be quite complicated. Consider chess: how many grandmaster vs grandmaster games of chess would you have to watch offline before you observed pawns being promoted to not just queens, but rooks, bishops, and knights (and observed it enough times to be certain that pawns couldn't be promoted to anything else, such as pawns or kings, or to the opposite color, or that any other piece could be promoted, that promotion is not just a good idea but mandatory, and that it could happen only on the last rank?) I'm going to predict that it would take much more than 1,000 games. And if you miss any of those wrinkles, regardless of how they never come up in actual play (their rarity is, after all, exactly why you failed to learn them), then a hardnosed critic would be justified in saying you have failed to learn some aspect of 'chess'.

To learn this in a small sample size, you need to either have informative priors from knowing how to play other board games which have promotion mechanics, priors from something like pretraining on natural language describing chess rules, or be doing explicit exploration with online RL (eg MuZero) where you can try to not promote a pawn and discover that's illegal etc.

Once upon a time, the rules of chess didn't make it explicit that you have to promote to a piece of your own colour, hence the amusing problem that's first under the heading "Offbeat interpretations of the rules of chess" at

However, as it happens, the rules of Othello are quite simple, and the rules a human would infer from watching even a rather small number of games are the correct rules. Again, it's not in dispute that other rulesets are possible given what one could learn from watching thousands of high-level games, but I don't think there's any real risk that a human would pick the wrong one. That doesn't mean that GPT wouldn't, of course, and I agree that Vaniver's hypothesis is plausible, but it's not clear to me how likely it actually is.

Note that the figure the paper reports isn't "how accurately do its predictions distinguish legal from illegal moves", exactly, it's "how often is the top prediction legal?". So the failure mode when it learns from championship games is that sometimes it predicts illegal moves. That can't be a matter of successfully learning rules that somehow restrict you to only playing good moves.

So I think something along the lines of "not enough network capacity to learn legality as well as good strategy" is still pretty plausible; but now I think about it some more, there's also a simpler explanation: the synthetic training set is much larger than the championship training set, so maybe it just needs more training to learn not to make illegal moves. It's unfortunate that the paper doesn't try the experiment of training with a synthetic training set the same size as the championship training set.

(Speaking of which, I have just noticed that I misread something; the synthetic training set is 20M games, not 4M as I claim. I'll edit the OP to fix this.)

However, as it happens, the rules of Othello are quite simple, and the rules a human would infer from watching even a rather small number of games are the correct rules.

I suspect that if you watched small children play, who have not played many board games, this would not be the case. There will still be countless rule sets consistent with it. Mathematical/esthetic valuation of simplicity are learned, not innate or a priori to either humans or Transformers.

That can't be a matter of successfully learning rules that somehow restrict you to only playing good moves.

Sure it can. Problems with stochastic sampling aside ('sampling can prove the presence of knowledge but not the absence'), this is something we learned back with our chess GPT-2 work: predicting solely legal moves based on a history of moves is actually quite difficult because your non-recurrent Transformer model needs to reconstruct the board state each time from scratch by internally replaying the game. It wasn't playing chess so much as blindfold blitz chess.

If you make any errors in the internal board state reconstruction, then you can easily make what you think is a legal move, and would in fact be a legal move given your reconstruction, but is not a legal move. (Note the mention of substantial error in attempting to extract a board state from the model.) So it's entirely possible that when they test the legal move prediction in a particular board state by feeding in a history (pg4) which would lead to that board state (and not that board state), you are seeing 100% correct rule learning and that 0.01-0.02% error is when the board state errors trip up the choice of move.

Our own conclusion was that since we didn't really need the chess notation to be compact as PGN (games were always much smaller than the GPT-2 context window), we shouldn't train solely on PGN (ie. [action]) but on (move,FEN) (ie. [(action,state)]), to get better performance and a better idea of what GPT-2 was learning in chess unaffected by its limitations in state reconstruction (which presumably reflected uninteresting things like how deep the arch is and so how many serial steps it could compute before running out of time). You'll notice that Decision Transformer work usually doesn't try to do open-loop 'blind' DRL agents, like OP does, but trains the DT on (action,state) pairs.

I take it that by "if you watched small children play" you mean that the small children are the ones watching the games and inferring the rules. You might be right; it would be an interesting experiment to try. I think it might be difficult to distinguish "they haven't yet acquired the priors I have about what constitutes simplicity and elegance in the ruleset of a game" from "they just aren't very good at thinking because their brains are still developing", though.

I have the feeling that we are talking past one another a bit as regards what's going on with the "championship" training set; much of what you write seems like it's arguing against something I wasn't intending to say, or for something  I wasn't intending to deny. In any case, my current theory is that the difference has much more to do with there being 14x more games in the "synthetic" set than in the "championship" set than with either (1) the latter having less diversity of positions (only ones that good players get into) or (2) the latter having less diversity of moves in any given position (only ones that good players would make).

I bet you're right that if you want a transformer to learn to play chess well you should give it the board state on every move. That wouldn't have been appropriate for the work we're discussing, though, since the whole point was to determine whether a transformer trained only on the moves will learn to have an internal representation of the board state, which in turn is suggestive of whether a much larger transformer trained only on text will learn to have an internal representation of the world that the text is about.

That wouldn't have been appropriate for the work we're discussing, though, since the whole point was to determine whether a transformer trained only on the moves will learn to have an internal representation of the board state, which in turn is suggestive of whether a much larger transformer trained only on text will learn to have an internal representation of the world that the text is about.

Sure, I'm not saying they should've done that instead. In addition, but probably they didn't have the time/energy. My point is just that the illegal-move error rate is ambiguous if you (gjm) are interested in whether it has perfectly learned the rules (which is different from what the authors are going after), because there are sources of error beyond "it has failed to learn the rules", like errors reconstructing the board state leading to misapplication of potentially-perfectly-learned rules. To my eyes, a legal move error rate as low as 0.01% in this setup, given the burden of state reconstruction in a unnatural and difficult way, strongly suggests it's actually doing a great job of learning the rules. I predict that if you set it up in a way which more narrowly targeted rule learning (eg behavior cloning: just mapping full game state->expert-action, no history at all), you would find that its illegal move rate would approach 0% much more closely, and you'd have to find some really strange edge-cases like my chess promotion examples to trip it up, (at which point one would be satisfied, because how would one ever learn those unobserved things offline without priors).

I agree that the network trained on the large random-game dataset shows every sign of having learned the rules very well, and if I implied otherwise then that was an error. (I don't think I ever intended to imply otherwise.)

The thing I was more interested in was the difference between that and the network trained on the much smaller championship-game dataset, whose incorrect-move rate is much much higher -- about 5%. I'm pretty sure that either (1) having a lot more games of that type would help a lot or (2) having a bigger network would help a lot or (3) both; my original speculation was that 2 was more important but at that point I hadn't noticed just how big the disparity in game count was. I now think it's probably mostly 1, and I suspect that the difference between "random games" and "well played games" is not a major factor, and in particular I don't think it's likely that seeing only good moves is leading the network to learn a wrong ruleset. (It's definitely not impossible! It just isn't how I'd bet.)

Vaniver's suggestion was that the championship-game-trained network had learned a wrong ruleset on account of some legal moves being very rare. It doesn't seem likely to me that this (as opposed to 1. not having learned very well because the number of games was too small and/or 2. not having learned very well because the positions in the championship games are unrepresentative) is the explanation for having illegal moves as top prediction 5% of the time.

It looked as if you were disagreeing with that, but the arguments you've made in support all seem like cogent arguments against things other than what I was intending to say, which is why I think that at least one of us is misunderstanding the other.

In particular, at no point was I saying anything about the causes of the nonzero but very small error rate (~0.01%) of the network trained on the large random-game dataset, and at no point was I saying that that network had not done an excellent job of learning the rules.

Thanks for sharing! Sounds like an interesting paper that I might otherwise not have noticed

Link doesn't work because of last character.

I've fixed both links.

Thanks! Sorry about that.

Yes, this is very interesting work. I have been exploring how ChatGPT generates simple stories. It exhibits similar behavior in that it seems to have induced a story grammar and it sticks to it. Think of that story grammar as comparable to the model the Othello GPT has built of the game board (its "external world"). That is, the story grammar places constraints on ChatGPT in the way that the Othello board & rules constrain play.

I chose to work with one particular story that has five "moves" from beginning to end. I prompt it by saying I'm going to give it a short story and I want it to create a new one except that ___. And there I'll list a change. Mostly I change the protagonist, e.g. from princess Aurora to William the Lazy. The new story differs from the old in in various ways which dependent on the difference between the old protagonist and the new one. But the new story ALWAYS has the same five moves.

I've made tables that compare the two stories side-by-side, with changes highlighted in yellow. I have four of those tables in this short document.

Your document was very interesting.

Thanks. I have lots of examples like that, though not all of them are laid out in tables. I have them at my blog at this link, stories.

[-][anonymous]2mo 10

In most positions, just knowing what moves are legal is enough to give you a good idea of most of the board state.

This is a very good point worth pondering over. I think it applies to many areas. You can see in this video that Carlsen has incredible memory. Even for someone is not Carlsen but has played a long time, simply being able to eliminate invalid board arrangement in less than a second of thinking time is enough to eliminate a lot of extra work that you can spend on less tedious work that require more processing power. I think this applies to musical improvisation too. A lot of good improvisers don't really need to think about what notes are valid and what aren't. They've practiced so much that it's second nature.

I feel like neural network are similar in a way that the prior is essentially this second nature memory. Every epoch is the model seeking out new information to add to the existing model and finding out where in the model this new information fits, or maybe even throw it out completely or just in the furthest corners of the model's memory.

It's worth noting that in blindfold chess, you by definition don't see the board state, only the history (if you can remember it), and whether you played a legal move. It has a sort of mirror or dual game in the form of Kriegspiel, you can see your pieces but not the enemy's moves/history, and you again can try to play moves and will be told if they are legal or not (and you are expected to 'probe' with possible moves to gain information from the legality thereof). This demonstrates that human players can play satisfying chess with not much more than legality plus some additional information. (It would be interesting to see if with enough practice, humans could play 'blindfold Kriegspiel' reasonably well or if that winds up being too difficult.)

[-][anonymous]2mo 30

It becomes a probabilistic chess game rather than an expansion of the existing chess game. When you can see the pieces yourself, you don't have to solve the probabilistic problem. Reducing complexity is a strategy in and of itself. Now you have to weigh whether you want to focus on getting better probabilities, and whether that gets in the way of regular chess strategies.

New to LessWrong?