GPT-4 is bad at strategic thinking

[-]roystgnr3y125

It's hard to apply general strategic reasoning to anything in a single forward pass, isn't it? If your LLM has to come up with an answer that begins with the next token, you'd better hope the next token is right. IIRC this is the popular explanation for why LLM output seems to be so much better when you just add something like "Let's think step by step" to the prompt.

Is anyone trying to incorporate this effect into LLM training yet? Add an "I'm thinking" and an "I'm done thinking" to the output token set, and only have the main "predict the next token in a way that matches the training data" loss function grade on tokens that aren't in between those brackets. Then when you hit "What is 45235 + 259719? 304954" in the training set, optimization doesn't have to discourage multi-step reasoning to reproduce that, because "<thinking>5+9=14, so we carry the 1" ... "</thinking>304954" is still worth just as much as an ex nihilo "304954". Chess algorithms could do a brief tree search before outputting their final decision.

Add whatever regularization is needed to keep the "train of thought" in English rather than in whatever-cypher-the-optimizer-hits-on, and this would be an increase in safety, not just in capabilities. The more "internal reasoning" is human-readable text rather than maybe-a-posteriori-interpretable activation patterns, the better. You could even expose it to the end user: ask ChatGPT a question and you get a succinct answer, click on the "expose thoughts" button and you get the chain of support for the answer.

[-]Caleb Biddulph3y-20

This is also basically an idea I had - I actually made a system design and started coding it, but haven't made much progress due to lack of motivation... Seems like it should work, though

[-]harfe3y50

Certain kinds of "thinking ahead" is difficult to do within 1 forward pass. Not impossible, and GPT-4 likely does a lot of thinking ahead within 1 forward pass.

If you have lots of training data on a game, you often can do well without thinking ahead much. But for a novel game, you have to mentally simulate a lot of options how the game could continue. For example, in Connect4, if you consider all your moves and all possible responses, these are 49 possible game states you need to consider. But with experience in this game, you learn to only consider a few of these 49 options.

Maybe this is a reason why GPT-4 is not so good when playing mostly novel strategy games.

[-]nem3y20

Small nitpick with the vocabulary here. There is a difference between 'strategic' and 'tactical', which is particularly poignant in chess. Tactics is basically your ability to calculate and figure out puzzles. Finding a mate in 5 would be tactical. Strategy relates to things too big to calculate. For instance, creating certain pawn structures that you suspect will give you an advantage in a wide variety of likely scenarios, or placing a bishop in such a way that an opponent must play more defensively.

I wasn't really sure which you were referring to here; it seems that you simply mean that GPT isn't very good at playing strategy games in general; ie it's bad at strategy AND tactics. My guess is that GPT is actually far better at strategy; it might have an okay understanding of what board state looks good and bad, but no consistent ability to run any sort of minimax to find a good move, even one turn ahead.

[-]Christopher King3y40

It didn't even seem to understand what the goals of any of the games were, despite being able to explain it in natural language. So it wasn't even at a point I could test a strategy v.s. tactics distinction.

[-]nem3y30

Ha, no kidding. Honestly, it can't even play chess. I just tried to play it, and asked it to draw the board state after each move. It started breaking on move 3, and deleted its own king. I guess I win? Here was its last output.

For my move, I'll play Kxf8:

8 r n b q . b . .
7 p p p p . p p p
6 . . . . . n . .
5 . . . . p . . .
4 . . . . . . . .
3 . P . . . . . .
2 P . P P P P P P
1 R N . Q K B N R
a b c d e f g h

[-]Christopher King3y40

Apparently GPT-4 is only good at chess if it tell it not to explain anything (or show the board as it turns out). This also suggests that the chess part is separate from the rest.

[-]WilliamKiely3y10

Feedback on the title: I don't like the title because it is binary.

Saying X is "good" or "bad" at something isn't very informative.

There are many degrees of goodness. Was it worse than you thought it would be before you played around with it a bit more? Was it worse than some popular article or tweet made you think? Was it worse than some relevant standard?

[+][comment deleted]3y20

LESSWRONG
LW

LESSWRONG
LW

22

GPT-4 is bad at strategic thinking

22

22