Review

(Edit: This post is out of date, and understates ChatGPT's chess abilities. gpt-3.5-turbo-instruct, when prompted correctly, plays consistently legal moves at around the 1800-2000 Elo level. Thanks to commenter GoteNoSente for pointing this out and to ClevCode Ltd for writing a wrapper. See this related GitHub by ClevCode.) 

There are lots of funny videos of ChatGPT playing chess, and all of them have the same premise: ChatGPT doesn't know how to play chess, but it will cheerfully and confidently make lots of illegal moves, and humoring its blundering attempts to play a game it apparently doesn't understand is great content. 

What's less well-known is that ChatGPT actually can play chess when correctly prompted. It plays at around 1000 Elo, and can make consistently legal moves until about 20-30 moves in, when its performance tends to break down. That sounds not-so-impressive, until you consider that it's effectively playing blindfolded, having access to only the game's moves in algebraic notation, and not a visual of a chessboard. I myself have probably spent at least a thousand hours playing chess, and I think I could do slightly better than 1000 Elo for 30 moves when blindfolded, but not by much. ChatGPT's performance is roughly the level of blindfolded chess ability to expect from a decent club player. And 30 moves is more than enough to demonstrate beyond any reasonable doubt that ChatGPT has fully internalized the rules of chess and is not relying on memorization or other, shallower patterns.

The "magic prompt" that I've been using is the following:

You are a chess grandmaster playing black, and your goal is to win as quickly as possible. I will provide the current game score before each of your moves, and your reply should just be your next move in algebraic notation with no other commentary. The current score:

1. e4

and then in my later replies, providing the full current game score[1] to ChatGPT as my message to it, e.g.:

1. e4 f5
2. Nh3 fxe4
3. Nf4 Nf6
4. b4 e5
5. b5

This "magic prompt" isn't original to me—soon after GPT-4 came out, a friend of mine told me about it, having seen it as a comment on HackerNews. (Sorry, anonymous HackerNews commenter—I'd love to credit you further, and will if you find this post and message me.)

The especially interesting thing about this is the sharp contrast between how ChatGPT-3.5 performs with and without the prompt. With the prompt, ChatGPT plays consistently legally and even passably well for the first 30 or so moves; without the prompt, ChatGPT is basically totally unable to play a fully legal game of chess.

Here are a few example games of ChatGPT playing or attempting to play chess under various conditions.

ChatGPT-3.5, with the magic prompt

Playing against me

Lichess study, ChatGPT conversation link

I play white, ChatGPT plays black. In this game, I intentionally play a bizarre opening, in order to quickly prove that ChatGPT isn't relying on memorized opening or ideas in its play. This game isn't meant to show that ChatGPT can play well (since I'm playing atrociously here), only that it can play legally in a novel game. In my view, this game alone is more than enough evidence to put to bed the notion that ChatGPT "doesn't know" the rules of chess or that it's just regurgitating half-remembered ideas from its training set; it very clearly has an internal representation of the board, and fully understands the rules. In order to deliver checkmate on move 19 with 19...Qe8# (which it does deliberately, outputting the pound sign which indicates checkmate), ChatGPT needed to "see" the contributions of at least six different black pieces at once (the bishop on g4, the two pawns on g7 and h6, the king on f8, the queen on e8, and either the rook on h8 or the knight on f6).

Playing against Lichess Stockfish Level 1

Lichess game, ChatGPT conversation link

Stockfish level 1 has an Elo of around 850[2]. Stockfish is playing white and ChatGPT is playing black. In this game, ChatGPT quickly gains a dominating material advantage and checkmates Stockfish Level 1 on move 22. 

Playing against Lichess Stockfish Level 2

Lichess game, ChatGPT conversation link

Stockfish level 2 has an Elo of around 950. Stockfish is playing white and ChatGPT is playing black. In this game, ChatGPT starts a dangerous kingside attack and gains a lot of material from it. By move 33, ChatGPT is up two queens and a rook and will be checkmating its opponent in just a few more moves—but it's at the end of its rope (33 moves is a lot) and now wants to play the illegal move 33...Qxd2+, capturing its own queen. Re-rolling this response doesn't help. (In general, I haven't cherrypicked or re-rolled in any of these games, except when explicitly noted).

Playing against Lichess Stockfish Level 3

Lichess game, ChatGPT conversation link

Stockfish level 3 has an Elo of around 1050. Stockfish is playing white and ChatGPT is playing black. In this game, things get messy right out of the opening. ChatGPT believes itself to be delivering checkmate on move 13 with 13...Qe2+, not noticing that white's queen on e6 can capture backwards (a very human-like mistake). The game continues until move 20 with even material, whereupon ChatGPT wants to make the illegal move 20...Rxg2, moving its rook through its own pawn (a much less human-like mistake). Re-rolling this response doesn't help.

ChatGPT-3.5, without the magic prompt

Playing against me

Lichess study, ChatGPT conversation link

I prompt ChatGPT in a more normal conversational style, and play an unconventional opening to get ChatGPT out of its comfort zone. Without the magic prompt, ChatGPT performs very poorly, being unable to produce a legal move by move 8.

Playing against Lichess Stockfish level 1

Lichess game, ChatGPT conversation link

Against prompting ChatGPT in a conversational style, ChatGPT becomes unable to make a legal move by move 14 (and on move 10, makes another minor error).

The difference here is striking. It's fairly clear to me that ChatGPT-3.5 only displays careful knowledge of the game's rules when prompted with a specialized prompt, and is relying only on opening memory and general patterns when no specialized prompt is used.

ChatGPT-4

Interestingly, I actually began this post with games against GPT-4, having remembered from trying months ago that GPT-4 played legal chess with the prompt but not without it. But when I tried again recently, I actually discovered that ChatGPT-4 could play legally for a long time even without it! The difference for GPT-4 is a lot less striking than it is for GPT-3.5. So here's just a few highlights[3]:

Playing against me with the magic prompt

Lichess game, ChatGPT conversation link

This is a cool one—ChatGPT checkmates me in 22 moves after an unconventional opening on my part. ChatGPT subjects me to a long sequence of checks (including a discovered check and a castle-with-check) and eventually checkmates me with 22...Bf8#.

Playing against Stockfish Level 1 with the magic prompt

Lichess game, ChatGPT conversation link

ChatGPT checkmates Stockfish Level 1 in 25 moves. This one's mainly notable for ChatGPT's correct use of en passant on move 7.

Playing against Stockfish Level 3 with the magic prompt

Lichess game, ChatGPT conversation link

ChatGPT gets itself a couple pieces up against Stockfish Level 3, but starting on move 29 starts hallucinating continuations for both it and its opponent, rather than only giving its move.

Playing against me without the magic prompt

Lichess study, ChatGPT conversation link

ChatGPT plays a good game against me and checkmates me in 24 moves, including a nice discovered check on move 20, despite not having the magic prompt in this one and making conversation with me throughout the game. In this game, though, it should be noted that although ChatGPT checkmates me, it fails to recognize that it has done so, even after I ask it what move I should make.

Speculations about the causes of improvement as a result of the prompt

I'd guess that ChatGPT-3.5 performs relatively better with the prompt than without because of the entire game score being provided at each step of the conversation; when the whole score is provided, it presumably better matches the chess game scores it has seen in its training and has learned to predict. The chess scores in its training probably mostly don't have surrounding commentary and don't aren't broken up between two halves of a conversation. 

What I would find very interesting as a possible question to investigate is whether or not the network is storing a representation of the then-current (i.e. incomplete) state of the game at each token in the chess game score. I suspect that it is, but it's unfortunately difficult to prove, given that only ChatGPT seems capable of playing chess (I tried with both Claude and LLama-2 13B, but both proved completely unable to play legal chess, with or without the magic prompt).

If it were the case that it's storing intermediate board states in the longer scores, and that this is in fact responsible for the better performance that we see with instead of without the magic prompt in GPT-3.5, this could be a cool example of something analogous to filler tokens being shown to work. (These aren't exactly filler tokens as discussed in the linked post, since they carry information about the problem the chatbot is solving, but because they're redundant with what was already said in the conversation, I think they're at least similar in concept.)

What are some other examples of hidden capabilities of LLMs that are only elicited if the user prompts in a non-obvious way? Chess is an interesting one, but it's unfortunate that the game is so complex and that the phenomenon can't be observed on open-source models to my knowledge, making it hard to study more deeply.

 

  1. ^

    "Score" here is jargon for "a record of the game"

  2. ^

    These Elo ratings for early Stockfish levels are super approximate, with different sources claiming different ratings. I'm using these because they seem about right to me, but these ratings shouldn't be taken as anywhere near exact.

  3. ^

    With the ChatGPT-3.5 games, I've showed every game I played with it. But with these, I'm only showing a subset—I tried not to cherrypick and to be representative of its overall performance, but there's no ironclad promise I didn't cherrypick for the ChatGPT-4 games.

New Comment
32 comments, sorted by Click to highlight new comments since:

Could illegal moves be explicitly prohibited in the prompt - or some other changes in the prompt prevent them?

I am 85% confident that this won't work. The issue isn't that the prompt hasn't made it clear enough that illegal moves are off the table, the issue is that chatGPT isn't able to keep track of the board state well enough to avoid making illegal moves.

I've tried a game with GPT4 where it was fed the above prompt plus the FEN of the game and also had it "draw" the board. It seems to really struggle with it's geometric understanding of the game, as you'd expect. For example, it struggled with identifying which squares were under attack from a knight. I think this reflects a limitations of the current model and I don't think this is something a clever prompt will fix.

I also tried it with drawing the board + adding explanation to moves, and there is some errors in drawings. But may be there is a way to make the drawing more coherent?

the issue is that chatGPT isn't able to keep track of the board state well enough

Then tackle this problem directly. Find a representation of board state so that you can specify a middlegame position on the first prompt, and it still makes legal moves. 

That sounds not-so-impressive, until you consider that it's effectively playing blindfolded, having access to only the game's moves in algebraic notation, and not a visual of a chessboard.

Why not just give it access to a visual of a chessboard?

Hardly anyone seems to have access to the image-enabled GPT-4 yet, and thus far GPT-4 results on things like text art of Tic-tac-toe or Sudoku have not done well, so it doesn't look like you can just ASCII-art your way out of problems like this yet.

I was a little surprised the ASCII-art approach didn't work, because my original guess for the multimodal GPT-4 training had been that it was iGPT/DALL-E-1 style: just serializing images into visual tokens to train autoregressively on as a prefix. However, the GPT-4 architecture leak claims that it's implemented in a completely different way: as a separate model which does cross-attention to GPT-4. So that helps deal with the problem of the visual tokens losing detail while using up a lot of context, but is also not something that necessarily would instill a lot of visual knowledge into the text-only model. (The text-only model might even be completely frozen and untrained, like Flamingo etc.)

Anyway, as I've been suggesting since 2019 or so, it'd probably be useful to train some small chess Transformers directly on both PGN and FEN (and mixtures thereof, like a FEN every n PGN moves) to help elucidate Transformer world-modeling. Any GPT-n result is as much a result about the proprietary internal OA datasets (and effects of BPEs) as it is about Transformer world-modeling or capabilities.

The ChessGPT paper does something like that: https://arxiv.org/abs/2306.09200

We collect chess game data from a one-month dump of the Lichess dataset, deliberately distinct from the month used in our own Lichess dataset. we design several model-based tasks including converting PGN to FEN, transferring UCI to FEN, and predicting legal moves, etc, resulting in 1.9M data samples.

I hadn't seen that recent paper, thanks.

Looks like they do a lot of things with FEN and train on a corpus which includes some FEN<->move-based-representation tasks, but they don't really do anything which directly bears on this, aside from showing that their chess GPTs can turn UCI/sPGNs into FENs with high accuracy. That would seem to imply that it has learned a good model of chess, because otherwise how could it take a move-move-by-move description (like UCI/PGN) and print out an accurate summary of the final board state (like FEN)?

I would've preferred to see them do something like train FEN-only vs PGN-only chess GPTs (where the games are identical, just the encoding is different) and demonstrate the former is better, where 'better' means something like have a higher Elo or it doesn't get worse towards the end of the game.

I did train a transformer to predict moves from board positions (not strictly FEN because with FEN positional encodings don't point to the same squares consistently). Maybe I'll get around to letting it compete against the different GPTs. 

Something I always want to try is training a super tiny transformer on Tic Tac Toe, and see what it comes up with and how many games it needs to generalise.

GPT-3.5 isn't multimodal, so can't really do that; I do wonder whether it would make GPT-4's performance even better, though.

That said, this being a text-only model, really the only relevant information that would improve the situation is a freeze frame of the current state of the chessboard, expressed in any way - visuals just happen to work best for us, but GPT's natural domain is the written word. So the correct test would probably be to replace the score (which requires computation to reconstruct the board state from scratch) with some kind of notation to instead represent the current board, for example by using Forsyth-Edwards Notation. I'd like to see if that makes it play well for longer (also, it shortens the prompts, so it should avoid running out of context window).

FEN is definitely an option. By "visual", what I had in mind would be e.g. assembling an 8 by 8 grid of characters using e.g. https://en.m.wikipedia.org/wiki/Chess_symbols_in_Unicode

What I'm wondering is why people don't do this.

FEN is essentially the same thing as that, but better. Try to "think as a GPT" - if you're a fundamentally textual mind, then a no-frills, standardized representation that compresses all required information in a few well-known tokens will be ideal. With a custom representation it might instead have to learn it, the Unicode chess symbols may be unusual tokens, and any added tabulation or decoration is more of a source of confusion than anything else. It improves clarity for us humans, because we're highly visual beings, not necessarily for a transformer. Text does that a lot more straighforwardly, and if it's something that is likely to have appeared a lot in the training set, all the better.

I get the argument, but I'm not sure it's true. There might be enough Unicode chessboards on the internet that it has learned the basics of the representation, and it might be able to transfer-learn some strategies it sees in other notations to become good at Unicode chessboards, and a transformer might be able to exploit the geometry of the chessboard. Not sure.

Either FEN or a unicode chessboard could be interesting. Comparing both could be interesting too.

It's a good thought, and I had the same one a while ago, but I think dr_s is right here; FEN isn't helpful to GPT-3.5 because it hasn't seen many FENs in its training, and it just tends to bungle it.

Lichess study, ChatGPT conversation link

GPT-3.5 has trouble from the start maintaining a correct FEN, and makes its first illegal move on move 7, and starts making many illegal moves around move 13.

Apparently it also bungles the unicode representation: https://chat.openai.com/share/10b8b0d3-7c80-427a-aaf7-ea370f3a471b

Ah, dang it. So it's a damned if you do, damned if you don't - it has seen lots of scores, but they're computationally difficult to keep track of since they're basically "diffs" of the board state. But there's not enough FEN or other board notation going around for it to have learned to use that reliably. It cuts at the heart of one of the key things that hold back GPT from generality - it seems like it needs to learn each thing separately, and doesn't transfer skills that well. If not for this, honestly, I'd call it AGI already in terms of the sheer scope of the things it can do.

"The new GPT model, gpt-3.5-turbo-instruct, can play chess around 1800 Elo."

https://twitter.com/GrantSlatton/status/1703913578036904431

https://parrotchess.com/

The playing strength of parrotchess seems very uneven, though. On the one hand, if I play it head-on, just trying to play the best chess I can, I would estimate it even higher than 1800, maybe around 2000 when we regard this as blitz. I'm probably roughly somewhere in the 1900s and on a few tries, playing at blitz speed myself, I would say I lost more than I won overall.

On the other hand, trying to play an unconventional but solid opening in order to neutralize its mostly awesome openings and looking out for tactics a bit while keeping the position mostly closed, I got this game, where it does not look at all stronger than the chat3.5 models, and therefore certainly not 1800-level:

https://lichess.org/study/ymmMxzbj/SpMFmwXH

Nonetheless, the performance of this model at chess is very interesting. None of the other models, including GPT-4, has (with prompting broadly similar to what parrotchess uses) been able to get a good score against me if I just played it as I would play most human opponents, so in that sense it definitively seems impressive to me, as far as chess-playing language models go.

Good lord, I just played three games against it and it beat me in all three. None of the games were particularly close. That's really something. Thanks to whoever made that parrotchess website!

It is possible to play funny games against it, however, if one uses the fact that it is at heart a story telling, human-intent-predicting system. For instance, this here works (human white):

1. e4 e5 2. Ke2 Ke7 3. Ke3 Ke6 4. Kf3 Kf6 5. Kg3 Kg6 6. Kh3 Kh6 7. Nf3 Nf6 8. d4+ Kg6 9. Nxe5# 1-0

Oh wow, that is really funny. GPT-4's greatest weakness: the Bongcloud. 

The game notation is pretty close to a board representation already. For most pieces you just go to their last move to see on which square they are standing. I assume that is very readable for a LLM because they are able to keep all tokens in mind simultaneously. 

In my games with ChatGPT and GPT-4 (without the magic prompt) they both seemed to lose track of the position after the opening and completely fell apart. Which might be because by then many pieces have moved several times (so there are competing moves indicating a square) and many pieces have vanished from the board altogether. 

The game notation is pretty close to a board representation already. For most pieces you just go to their last move to see on which square they are standing. I assume that is very readable for a LLM because they are able to keep all tokens in mind simultaneously.

That raises an interesting question for world-modeling: does providing any hints or partial reveals of the hidden state make it easier or harder for a predictor to develop an internal world-model?

You could argue that it makes it easier, because look, a chunk of the hidden state is revealed right there and is being predicted, so of course that makes the task easier than if you were using some sort of even more implicit action-only representation like 'move the left-knight 2 spaces forward then 1 right'. But you could also argue that it makes it harder, by providing shortcuts and heuristics which a predictor can latch onto to minimize most of its loss, sabotaging the development of a complex but correct world model. (Like providing a lazy kid with a cheatsheet to a test: yes, they could use it to study, carefully correcting their own errors... or they could just copy the provided answers and then guess on the rest.)

My intuition is that it would depend on optimization details like compute & data: the partial-state model would eventually outperform in terms of both predictive loss and world-model internals, but it would take longer, in some sense, than the blind model which is being forced from the start to try to develop a world-model. (Somewhat like grokking and other late-training 'emergence'-esque phenomena.)

There was another Chess-GPT investigation into that question recently by Adam Karvonen: https://adamkarvonen.github.io/machine_learning/2024/01/03/chess-world-models.html

The linear probe accuracy for the board state actually peaks in the sixth layer (out of eight). To predict the next move it already discards some of the state information. Well, maybe that is unsurprising. 

It also doesn't reach terribly impressive accuracy. 99.2% sounds a lot, but it is per square, which means it might get a square wrong in every second position. 

I think more important than how easy it is to extract the information, is how necessary it is to extract the information. You can probably be somewhat fuzzy about board state details and still get great accuracy. 

There is a step in move prediction where you go beyond the intuitive move selection and have to calculate to find deeper reasons for and against moves. This feels similar to me to attending to your uncertainty about the placement of particular pieces beyond the immediate necessity. And all these models have not taken this step yet. 

I'm actually doing an analysis right now to nail it down that GPTs don't calculate ahead when trained on move prediction and stay completely in the intuitive move selection regime, but it's not easy to separate intuitive move selection and calculation in a bulletproof way. 

Well, maybe that is unsurprising.

Yes, that's a very common observation. After all, you still have to try to model the current player's planning & move based on the board state, and in the final layers, you also have to generate the actual prediction - the full 51k BPE logit array or whatever. That has to happen somewhere, and the final layers are the most logical place to do so. Same as with CNNs doing image classification: the final layer is a bad place to get an embedding from, because by that point, the CNN is changing the activations for the final categorical output.

I think more important than how easy it is to extract the information, is how necessary it is to extract the information. You can probably be somewhat fuzzy about board state details and still get great accuracy.

Yes. This gets back to the core of 'what is an imitation-learning LLM doing?' Janus's Simulators puts the emphasis on 'it is learning the distribution, and learning to simulate worlds'; but the DRL perspective puts the emphasis on 'it is learning to act like an agent to maximize its predictive-reward, and learns simulations/worlds only insofar as that is necessary to maximize reward'. It learns a world-model which chucks out everything which is unnecessary for maximizing reward: this is not a faithful model but a value-equivalent model.

If there is some aspect of the latent chess state which doesn't, ultimately, help win games, then a chess LLM (or MuZero) doesn't want to learn to model that part of the latent chess state, because it is, by definition, useless. (It may learn to model it, but for other reasons like accident or having over-generalized or because it's not yet sure that part is useless or as a shortcut etc etc.)

This hasn't previously been important to the 'do LLMs learn a world model?' literature because the emphasis has been responding to naysayers like Bender, who claim that they do not and cannot learn any world model at all, that a 'superintelligent octopus' eavesdropping on chess game transcripts would never learn to play chess at all beyond memorization. But since that claim is now generally accepted to be false, the questions move on to 'since they do learn world models, what sorts, how well, why, and when?'

Which will include the question of how well they can learn to amortize planning. I am quite sure the answer is that they do so to some non-zero degree, and that you are wrong in general about GPTs never planning ahead, based on evidence like Jones's scaling laws which show no sharp transitions between training and planning and only a smooth exchange rate, and the fact that you can so easily distill results from planning into a GPT (eg. distilling the results of inner-monologue 'planning' into a single forward pass). So my expectation is that either your models will be too small to show clear signs or planning or you will just get false nulls from inadequate model interpretability methods - it's not an easy thing to do, to figure out what a model is thinking and what it is not thinking, after all.

It plays at around 1000 Elo, and can make consistently legal moves until about 20-30 moves in, when its performance tends to break down.

 

Do we know if this happens because of running out of context window and beginning to lose the first prompts?

If not, it could be that the computation of the board state from just the score is becoming too onerous (namely, playing "blindfolded" has gotten too hard). In that case, I wonder if replacing the score with a representation of the board with Forsyth-Edwards Notation as I suggested in my other comment would improve things.

I don't think it's a question of the context window—the same thing happens if you just start anew with the original "magic prompt" and the whole current score. And the current score is alone is short, at most ~100 tokens—easily enough to fit in the context window of even a much smaller model.

In my experience, also, FEN doesn't tend to help—see my other comment.

Good question but no - ChatGPT still makes occasional mistakes even when you use the GPT API, in which you have full visibility/control over the context window.

Blind-folded humans as an analogy for breaking rules later on? Try Experimenting with this prompt?

"Let's play a game of chess - you are a chess grandmaster. To optimise how we both play (as we will say in chess annotations our moves and not use a real chess board), I want you to make a grid to post every new move and refer back to so that you do it correctly. The grid would show 8 x 8 characters with [ ] representing an empty square. You could use a letter to represent each piece such as a King could be [k]. Could you make up as best as you can an 8 by 8 grid showing the starting position of chess and I will then be white once you do so and we will proceed. "

If you use the word annotations rather than algebraic notations then it appears to force it to analysis it properly in chatgpt4. May be a jailbreak? Initially a typo, very weird.

Human players can retry when they attempt to make an illegal move on lichess, and can click on a piece to see which moves are legal. I wonder how much ChatGPT's ELO improves if you allow it to retry moves until it makes a legal one.