(This post might look better at dynomight.net/chess.)
A year ago, there was a lot of talk about large language models (LLMs) playing chess. Word was that if you trained a big enough model on enough text, then you could send it a partially played game, ask it to predict the next move, and it would play at the level of an advanced amateur.
This seemed important. These are “language” models, after all, designed to predict language.
Now, modern LLMs are trained on a sizeable fraction of all the text ever created. This surely includes many chess games. But they weren’t designed to be good at chess. And the games that are available are just lists of moves. Yet people found that LLMs could play all the way through to the end game, with never-before-seen boards.
Did the language models build up some kind of internal representation of board state? And how to construct that state from lists of moves in chess’s extremely confusing notation? And how valuable different pieces and positions are? And how to force checkmate in an end-game? And they did this all “by accident”, as part of their goal of predicting general text?
If language models can do all that for chess, then maybe it’s a hint of how they deal with other situations too.
So that was very exciting. A year ago.
Since then, there’s mostly been silence. So I decided to check in and see how things are going. Having done that, I can now report: Weirdly.
What I did
To make LLMs play chess, I sent them prompts like this:
You are a chess grandmaster.
Please choose your next move.
Use standard algebraic notation, e.g. "e4" or "Rdf8" or "R1a3".
NEVER give a turn number.
NEVER explain your choice.
Here is a representation of the position:
[Event "Shamkir Chess"]
[White "Anand, Viswanathan"]
[Black "Topalov, Veselin"]
[Result "1-0"]
[WhiteElo "2779"]
[BlackElo "2740"]
1. e4 e6 2. d3 c5 3. Nf3 Nc6 4. g3 Nf6 5.
I used the output as a move. I always had the LLM play as white against Stockfish—a standard chess AI—on the lowest difficulty setting.
The first model I tried was llama-3.2-3b
. This is a “base model”, meaning it is mostly trained to output text, not to chat with you or obey instructions. It’s quite small by modern standards, with only 3 billion parameters. For reference, GPT-2, released back in 2019, had 1.5 billion parameters, and GPT-4 is rumored to have around 1.8 trillion.
I had it play 50 games, then had a chess engine score each board after each turn in “centipawns”. This is a measure where a pawn is 100 points, but there’s also accounting for position. If the game was over, I assigned a score of +1500 if the LLM won, 0 if there was a tie, and -1500 if it lost.
The results were:
Terrible. (Click to see a game.)
In the above figure, there’s one line for each game, and the dark line shows the per-turn median. The LLM can play standard openings for a few moves but then quickly starts throwing away pieces. It lost every single game, even though Stockfish was on the lowest setting.
Maybe that model is too small? So I got llama-3.1-70b
, which is a similar model but with 70 billion parameters instead of 3 billion. The results were:
A little better, but still extremely bad. Next I tried llama-3.1-70b-instruct
, a similar model, except trained to be better at following instructions. The results were:
Maybe there’s something wrong with the Llama models or datasets? So I tried Qwen-2.5-72b
.
Maybe Qwen is somehow defective too? So I tried command-r-v01
, a 35 billion parameter model.
And then I tried gemma-2-27b
.
And then I tried gpt-3.5-turbo-instruct
. This is a closed OpenAI model, so details are very murky. I only ran 10 trials since AI companies have inexplicably neglected to send me free API keys and this was costing The Automator money. The results were:
Excellent. Very, very, good.
Even if you raise Stockfish’s level a few clicks, this model will still win every game.
Moving on… I next tried gpt-3.5-turbo
, a model that’s similar, except tuned to be more chatty and conversational.
And then I tried gpt-4o-mini
, which is a newer chat model.
And then I tried gpt-4o
, a bigger chat model.
It lost every single game, though it lost slightly slower.
Finally, I tried o1-mini
, a model that’s supposed to be able to solve complex tasks. (I’m too poor for o1
.)
So, umm:
And, uh:
Notice anything? Any patterns jump out at you?
Discussion
There are lots of people on the internet who have tried to get LLMs to play chess. The history seems to go something like this:
Before September 2023: Wow, recent LLMs can sort of play chess! They fall apart after the early game, but they can do something! Amazing!
September-October 2023: Wow! LLMs can now play chess at an advanced amateur level! Amazing!
(Year of silence.)
Recently: Wow, recent LLMs can sort of play chess! They fall apart after the early game, but they can do something! Amazing!
I can only assume that lots of other people are experimenting with recent models, getting terrible results, and then mostly not saying anything. I haven’t seen anyone say explicitly that only gpt-3.5-turbo-instruct
is good at chess. No other LLM is remotely close.
To be fair, a year ago, many people did notice that gpt-3.5-turbo-instruct
was much better than gpt-3.5-turbo
. Many speculated at the time that this is because gpt-3.5-turbo
was subject to additional tuning to be good at chatting.
That might be true. Here’s a comparison of three models where we have similar versions with or without additional chat tuning.
(Again, do not be confused by the name gpt-3.5-turbo-instruct
. I stress that this is more like a base model than gpt-3.5-turbo
. This is the opposite of the naming scheme everyone else uses where “instruct
” or “it
” means more tuning to be good at chatting.)
In all cases, additional instruction tuning makes the model worse. But the difference is very small in two cases, and enormous in the other.
Possible theories
I can think of four possible explanations.
Theory 1: Base models at sufficient scale can play chess, but instruction tuning destroys it.
This would be consistent with our data. But I did manage to get llama-3.1-405b
to play a couple games. Despite being larger than gpt-3.5-turbo
, it was still terrible.
Theory 2: GPT-3.5-instruct was trained on more chess games.
All models were clearly trained on a lot of chess games. But it’s hard to know exactly how many.
Theory 3: There’s something particular about different transformer architectures.
I doubt this, but it could be that for some reason, Llama type models are uniquely bad at chess.
Theory 4: There’s “competition” between different types of data.
We know that transformers trained specifically on chess games can be extremely good at chess. Maybe gpt-3.5-turbo-instruct
happens to have been trained on a higher fraction of chess games, so it decided to dedicate a larger fraction of its parameters to chess.
That is, maybe LLMs sort of have little “chess subnetworks” hidden inside of them, but the size of the subnetworks depends on the fraction of data that was chess. (If this theory were true, we should probably expect that big enough model should become good at chess, provided they are trained on enough chess games, even if the fraction of chess games is low.)
Details
I did things this way (i.e. by working with standard algebraic notation) because this is how people got good results two years ago, and in preliminary experiments I also found it to work best.
If you want to know exactly how I did things, here are some words: I ran all the open models (anything not from OpenAI, meaning anything that doesn’t start with gpt
or o1
) myself using Q5_K_M
quantization, whatever that is. For the open models I manually generated the set of legal moves and then used grammars to constrain the models, so they always generated legal moves. Since OpenAI is lame and doesn’t support full grammars, for the closed (OpenAI) models I tried generating up to 10 times and if it still couldn’t come up with a legal move, I just chose one randomly. For the chat models llama-3.1-70b-instruct
, gemma-2-27b-it
, gpt-3.5-turbo
, gpt-4o-mini
, and gpt-4o
I changed the system prompt to “You are a chess grandmaster. You will be given a partially completed game. After seeing it, you should choose the next move.” It’s impossible to change the system prompt for o1-mini
, so I didn’t. I used a temperature of 0.7 for all the open models and the default for the closed (OpenAI) models. The fact that OpenAI has “open” as part of their name sure made this paragraph hard to write.
Token weirdness
One extremely strange thing I noticed was that if I gave a prompt like “1. e4 e5 2.
” (with a space at the end), the open models would play much, much worse than if I gave a prompt like “1 e4 e5 2.
” (without a space) and let the model generate the space itself. Huh?
After some confusion, I’m pretty sure this is because of the tokenizer. Look at how the Llama tokenizer breaks up a string of moves:
After the “1.
”, it generates “ e
” as a single token. That’s not the same as having a space followed by an e
. So putting in the space and asking models to generate tokens presents the model with a confusing situation and leads to bad predictions.
The right way to deal with this is “token healing”—to delete the last token of the input and then do constrained generation over all strings that start with the deleted stuff. But I couldn’t figure out any easy way to do that. So, instead I left the space out and modified the grammar so that the model could generate a space (or not), then one of the current legal moves, and then another space (or not). Fun times!
Update: OK, I actually think I've figured out what's causing this. I'll explain in a future post, but in the meantime, here's a hint: I think NO ONE has hit on the correct explanation!
My assumption would be that OpenAI noticed that the chess thing was drawing a lot of interest, so optimised something (training data, fine tuning, something algorithmic) to make gpt-3.5-turbo-instruct much better at chess. Then decided that that tradeoffs of that optimisation (cost, degradation of other functions) made it not worth it for future models.
That's a bit vague, but "OpenAI did it on purpose" is my main assumption. I don't know if the timing works out.