It might be interesting to re-run GPT o1-mini without the line "NEVER explain your choice". I've gotten the sense that models sometimes get better results if they can "think out loud".
Part of the motivation for that is simply to make these experiments cheaper! For all the models except o1-mini, I can limit the output to like 7 output tokens.
Anyway, in the pre-o1 era, people experimented with prompts to encourage thinking quite a bit and (AFAIK) no one ever got close to the performance of gpt-3.5-turbo-instruct on raw chess notation. It's odd because, as you say, getting models to think things through usually helps! I think the theory is that there's just so much chess data in the training corpus, that the LLM has some kind of sophisticated machinery to predict the next move. But it's not used to seeing moves with analysis beforehand, it doesn't do that well.
In any case, I thought o1 was supposed to kind of "secretly" think even if you don't ask it? I assume that's why we have to pay for thousands of output tokens, even when we only see "e4"...
This is definitely possible. But this section (https://dynomight.net/chess/#discussion) was supposed to be a bit of an argument against. Basically, if it's the fine-tuning, then why are all the other models without fine tuning also so bad? Maybe they just don't have enough chess data? (Again, possible! But weird.)
Not criticizing the monocausal theories, but LLMs "do a bunch of stuff with a bunch of data" and if you ask them why they did something in particular, you get a hallucination. To be fair, humans will most often give you a moralized post hoc rationalization if you ask them why they did something in particular, so we're not far from hallucination.
To be more specific, the models change BOTH the "bunch of stuff" (training setup and prompts) and the "bunch of data", and those changes interact in deep and chaotic (as in chaos theory) ways.
All of this really makes me think about how we treat other humans. Training an LLM is a one-way operation, you can't really retrain one part of an LLM (as I understand it). You can do prompt engineering, and you can do some more training, but those interact and deep and chaotic ways.
I think you can replace LLM with human in the previous paragraph and not be too far wrong.
I always appreciate a good bit of systematic analysis!
Did you try GPT-4 (not -4o)? I had remembered it being GPT-4 that was originally reported to play a decent game of chess; per https://twitter.com/kenshin9000_/status/1662510532585291779, it was both 3.5 and 4.0. I wonder how they compare in skill.
I think maybe a slight extension of your Theory 1 explains all of the available data? Base models at sufficient scale can play chess, but this capability is fragile and can be destroyed by instruction tuning *or any number of other things*, such as tokenization issues, vagaries in choice of training data, etc.
Some other well-informed folks have suggested to try GPT-4 too. I haven't done that yet, but if the cost isn't too high, we might,
If GPT-4 works, then I think your theory would fit. Although I'd love to understand why llama-3.1-405b seems to be so bad, even though it's bigger than GPT-3.5. (To be fair, I had to run it at insanely slow speed on CPU and only made it through one game. But it was making terrible blunders.)
Yup, for sure would be interesting to understand that.
Did you see the recent thing where someone used interpretability tools to investigate why models keep saying 9.11 > 9.9, and it turned out that the model was activating features that primarily trigger in the context of dates (9/11 comes after 9/9) and Bible verses (9:11 comes after 9:9)?
Confusing and intriguing! I'd also be curious to see results for Claude 3.5 sonnet, it somehow beats o1 on many reasoningy benchmarks and I've subjectively found it surprisingly impressive. (Note it got a significant update a few weeks ago which they stupidly just call the "new sonnet 3.5" rather than incrementing the number)
I suppose it's possible, but I consider that extremely unlikely, because:
1. They've never mentioned doing anything like that.
2. If they did, why remove it in later models?
3. You could get to the same board state via different sequences of moves. Chess engines don't care because they just look at the board state. But gpt-3.5-turbo-instruct behaves very differently!
4. While the model is quite good by human standards, it's still terrible by chess AI standards. (Maybe 1800 ELO.) If they went to such an effort to add an engine, why stop it at such a low level?
1. I assume there’s lots of secret sauce they don’t describe in all their products, and I imagine a lot of attempts to make it appear that LLMs are more capable than they perhaps are.
2. I assume different models have different architectures in various ways. You could add such a subsystem in one model as an experiment, or because Gary really wanted to, or whatever and then not bother to include it in other future architectures.
3. This is interesting, and is a good argument against. Though it’s possible that the imaginary subsystem is a novel ai chess engine that doesn’t behave in predictable ways.
4. I didn’t realise this! This is a really good argument against the possibility. I still think there’s a small chance of them having a weird experimental, but not very good, chess engine subsystem. But it does seem unlikely.
My assumption would be that OpenAI noticed that the chess thing was drawing a lot of interest, so optimised something (training data, fine tuning, something algorithmic) to make gpt-3.5-turbo-instruct much better at chess. Then decided that that tradeoffs of that optimisation (cost, degradation of other functions) made it not worth it for future models.
That's a bit vague, but "OpenAI did it on purpose" is my main assumption. I don't know if the timing works out.
Just to be clear, the grammars meant that the open models always picked a legal move. It's sort of equivalent to repeating an infinite number of times. For the closed (OpenAI) models, I sadly didn't keep detailed numbers. But it happened quite often in the late game for all the models except... gpt-3.5-turbo-express, where it never happened!
(My guess is that imposing a grammar probably doesn't actually improve play very much in practice, because if the model is confused about what moves are legal, it probably doesn't know what moves are good. But it does make these experiments faster!)
BTW, I did try zero temperature with llama-3.1-70b. It seemed to play slightly WORSE that way, oddly. I didn't finish 50 games, but after 30 or so it still lost every time.
1. You're crippling the LLM by forbidding it from talking in the output. Consider asking it _first_ to draw the board state and reason "out loud" for a few sentences about its next move, and _then_ provide its answer at the _end_ of the response. This will fill the context window with helpful information to "steer" the final output tokens where it actually makes a decision. (You can then discard everything except the move at the end.)
(From Jeremy Howard's popular custom instruction https://gist.github.com/siddharthsarda/c58557e21a3bc8aeddf6b2cddc1b325a : "Since you are autoregressive, each token you produce is another opportunity to use computation, therefore you always spend a few sentences explaining background context, assumptions, and step-by-step thinking BEFORE you try to answer a question.")
2. I'm guessing you're running the Q5_K_M variants of local models (5-bit quantization with K-means clustering) to fit them on your computer, but that does degrade the capability of these models a bit. If you can get your hands on a box with 96+ GB of RAM (even if you have to run on CPU instead of GPU), I'd expect 8-bit precision should perform better (though I couldn't tell you if the result will be non-terrible).
3. What about Llama-3.1-Nemotron-70B-Instruct? Ostensibly it's the only weights-available (I hesitate to say "open") LLM that will reliably give correct answers to questions like "How many r in strawberry?", by counting up the letters one by one.
1. That's what I tried first. But I was told to try it this way instead, and indeed, it works much better this way. There are lots of people who've tried this with chess and AFAIK they've all found the same thing. Of course, there's a real art to this kind of thing, and it's certainly possible some other trick would work better. But I have tried. (Also, isn't o1 supposed to do this automatically?)
I actually have Nemotron ready to go—I'll have it play a game and report back!
Theory 5: The people behind gpt-3.5-turbo-instruct noticed that people were excited about LLMs playing chess, so they _built in a redirect to a chess engine_ when it identifies a chess game, in the belief that more hype about how smart LLMs are would be good for business.
Too cynical?
You could actually test this theory: you simply feed the game to a chess engine, and look at how frequently the moves proposed by the LLM match those proposed by the chess engine. This is the same way cheating by human players is often detected. To quote chess.com:
"If the game is entered into an engine with analysis, and the opponents choice has little deviation from the top recommended moves, it leaves one to wonder. Human players can find the computer best move, and certainly the stronger the player, the closer it will often be. But continually hitting the computer recommended best move is a clue. Especially telling but harder to define are nuisance moves that have a computer sense without any threat or underlying reason for the move."
Those are valid points, although the cynic in me says that the answer to (2) is "they realized people were onto them" ;-)
Mind you, although the points you make argue strongly against a redirect to a chess engine, could it still be a redirect to a classification engine like the one you mentioned at https://github.com/sgrvinod/chess-transformers, which are pretty good models but get crushed by Stockfish on anything from level 3 upwards depending on the data set. This would be consistent with an ELO of around 1800, as you said in your linked comment.
Basically, I am profoundly skeptical of any claims of "emergent" capabilities in LLMs purely from throwing lots of training data at them, and really want to see very strong evidence. Especially given that gpt-3.5-turbo-instruct is so _very_ different in performance from its cousins.
everyone missing the real reason for 3.5's surprising performance: an openai employee was working on it in public a while ago, doing finetuning with collected data. obvious next step would've been to throw it into the SFT data pile
My read of that tweet is that you can do fine-tuning on 3.5 after being tuned for chatting, and then 100 examples raise 3.5 in chat mode to the performance of 3.5-instruct. This is definitely helpful information, but it doesn't look like the fine tuning improves over instruct?
I took a look at the transformer models you linked to as being good at chess (and they are indeed pretty good). I found the interesting descriptions below of the various transformers. For those readers not closely familiar with the jargon, what this means is that these are transformers, but NOT in the sense of GenAI, like LLMs, but more in the sense of classification AI. This project has turned chess prediction into either a classification task or a labeling task, things that we already know transformers can be good at, NOT a generation ("text prediction") task like an LLM.
In other words, these models are good at spotting good chess moves in much the same way that other AIs are good at spotting tumors on medical images.
And in other, other words, no amount of chess games ingested by an LLM will teach it to play chess the way these transformer models play chess.
Here are the relevant quotes from GitHub:
1. "This is essentially a sequence (or image) classification task, where the sequence is the current state of the board, and the classes are the various moves that can be made on a chessboard in UCI notation."
2. "This is essentially a sequence (or image) labeling task, where the sequence is the current state of the chessboard, and each square competes to be labeled as the From or To square."
3. "This is essentially a sequence-to-sequence (or image-to-sequence) task, where the input sequence is the current state of the board, and the output sequence is a string of half-moves that will likely occur on the board from that point onwards."
4. "This is essentially a sequence (or image) labeling task, where the sequence is the current state of the chessboard, and each square competes to be labeled as the From or To square."
Thank you, this is very helpful. I was definitely misinterpreting those results! (I thought they were "normal" LLMs)
There are a few other experiments where people I THINK are training normal-ish LLMs with small numbers of parameters and still getting pretty decent results? E.g.
I will definitely take a look later. Right now I have to go and hate-watch a soccer game while doing the ironing, just to prove I am still capable of feats no AI can match.
p.s. In fairness to LLMs, I should also add that my critique boils down to "I don't believe that LLMs are any good at things LLMs were not designed to be good at, and I will be really impressed if it turns out that they are unexpectedly are" :-)
A recent article showed that a trip-planning AI was brittle when it was trained only on routes that an expert driver would choose. When faced with road closings, it hallucinated weird road connections. However, the authors claim that when an AI is trained on the results of random moves, it will better cover the problem space.
The problem could be that the chess games, vacuumed up in training data, did not include sequences with bad moves.
OOOh. Juicy. Thank you.
Very interesting!
It might be interesting to re-run GPT o1-mini without the line "NEVER explain your choice". I've gotten the sense that models sometimes get better results if they can "think out loud".
Part of the motivation for that is simply to make these experiments cheaper! For all the models except o1-mini, I can limit the output to like 7 output tokens.
Anyway, in the pre-o1 era, people experimented with prompts to encourage thinking quite a bit and (AFAIK) no one ever got close to the performance of gpt-3.5-turbo-instruct on raw chess notation. It's odd because, as you say, getting models to think things through usually helps! I think the theory is that there's just so much chess data in the training corpus, that the LLM has some kind of sophisticated machinery to predict the next move. But it's not used to seeing moves with analysis beforehand, it doesn't do that well.
In any case, I thought o1 was supposed to kind of "secretly" think even if you don't ask it? I assume that's why we have to pay for thousands of output tokens, even when we only see "e4"...
Another hypothesis: GPT-3.5 was fine-tuned to be good at Chess, but training it to be better at chat clobbered this fine-tuning.
This is definitely possible. But this section (https://dynomight.net/chess/#discussion) was supposed to be a bit of an argument against. Basically, if it's the fine-tuning, then why are all the other models without fine tuning also so bad? Maybe they just don't have enough chess data? (Again, possible! But weird.)
Not criticizing the monocausal theories, but LLMs "do a bunch of stuff with a bunch of data" and if you ask them why they did something in particular, you get a hallucination. To be fair, humans will most often give you a moralized post hoc rationalization if you ask them why they did something in particular, so we're not far from hallucination.
To be more specific, the models change BOTH the "bunch of stuff" (training setup and prompts) and the "bunch of data", and those changes interact in deep and chaotic (as in chaos theory) ways.
All of this really makes me think about how we treat other humans. Training an LLM is a one-way operation, you can't really retrain one part of an LLM (as I understand it). You can do prompt engineering, and you can do some more training, but those interact and deep and chaotic ways.
I think you can replace LLM with human in the previous paragraph and not be too far wrong.
I always appreciate a good bit of systematic analysis!
Did you try GPT-4 (not -4o)? I had remembered it being GPT-4 that was originally reported to play a decent game of chess; per https://twitter.com/kenshin9000_/status/1662510532585291779, it was both 3.5 and 4.0. I wonder how they compare in skill.
I think maybe a slight extension of your Theory 1 explains all of the available data? Base models at sufficient scale can play chess, but this capability is fragile and can be destroyed by instruction tuning *or any number of other things*, such as tokenization issues, vagaries in choice of training data, etc.
Some other well-informed folks have suggested to try GPT-4 too. I haven't done that yet, but if the cost isn't too high, we might,
If GPT-4 works, then I think your theory would fit. Although I'd love to understand why llama-3.1-405b seems to be so bad, even though it's bigger than GPT-3.5. (To be fair, I had to run it at insanely slow speed on CPU and only made it through one game. But it was making terrible blunders.)
Yup, for sure would be interesting to understand that.
Did you see the recent thing where someone used interpretability tools to investigate why models keep saying 9.11 > 9.9, and it turned out that the model was activating features that primarily trigger in the context of dates (9/11 comes after 9/9) and Bible verses (9:11 comes after 9:9)?
Confusing and intriguing! I'd also be curious to see results for Claude 3.5 sonnet, it somehow beats o1 on many reasoningy benchmarks and I've subjectively found it surprisingly impressive. (Note it got a significant update a few weeks ago which they stupidly just call the "new sonnet 3.5" rather than incrementing the number)
Given that that 3.5 is closed source is there any chance they’re just calling out to a chess engine?
I suppose it's possible, but I consider that extremely unlikely, because:
1. They've never mentioned doing anything like that.
2. If they did, why remove it in later models?
3. You could get to the same board state via different sequences of moves. Chess engines don't care because they just look at the board state. But gpt-3.5-turbo-instruct behaves very differently!
4. While the model is quite good by human standards, it's still terrible by chess AI standards. (Maybe 1800 ELO.) If they went to such an effort to add an engine, why stop it at such a low level?
1. I assume there’s lots of secret sauce they don’t describe in all their products, and I imagine a lot of attempts to make it appear that LLMs are more capable than they perhaps are.
2. I assume different models have different architectures in various ways. You could add such a subsystem in one model as an experiment, or because Gary really wanted to, or whatever and then not bother to include it in other future architectures.
3. This is interesting, and is a good argument against. Though it’s possible that the imaginary subsystem is a novel ai chess engine that doesn’t behave in predictable ways.
4. I didn’t realise this! This is a really good argument against the possibility. I still think there’s a small chance of them having a weird experimental, but not very good, chess engine subsystem. But it does seem unlikely.
My assumption would be that OpenAI noticed that the chess thing was drawing a lot of interest, so optimised something (training data, fine tuning, something algorithmic) to make gpt-3.5-turbo-instruct much better at chess. Then decided that that tradeoffs of that optimisation (cost, degradation of other functions) made it not worth it for future models.
That's a bit vague, but "OpenAI did it on purpose" is my main assumption. I don't know if the timing works out.
How often did the model not satisfy the grammar and so you picked a random move? Ability to satisfy the grammar might be a confounder?
Also did you try using zero temperature?
Just to be clear, the grammars meant that the open models always picked a legal move. It's sort of equivalent to repeating an infinite number of times. For the closed (OpenAI) models, I sadly didn't keep detailed numbers. But it happened quite often in the late game for all the models except... gpt-3.5-turbo-express, where it never happened!
(My guess is that imposing a grammar probably doesn't actually improve play very much in practice, because if the model is confused about what moves are legal, it probably doesn't know what moves are good. But it does make these experiments faster!)
BTW, I did try zero temperature with llama-3.1-70b. It seemed to play slightly WORSE that way, oddly. I didn't finish 50 games, but after 30 or so it still lost every time.
1. You're crippling the LLM by forbidding it from talking in the output. Consider asking it _first_ to draw the board state and reason "out loud" for a few sentences about its next move, and _then_ provide its answer at the _end_ of the response. This will fill the context window with helpful information to "steer" the final output tokens where it actually makes a decision. (You can then discard everything except the move at the end.)
(From Jeremy Howard's popular custom instruction https://gist.github.com/siddharthsarda/c58557e21a3bc8aeddf6b2cddc1b325a : "Since you are autoregressive, each token you produce is another opportunity to use computation, therefore you always spend a few sentences explaining background context, assumptions, and step-by-step thinking BEFORE you try to answer a question.")
2. I'm guessing you're running the Q5_K_M variants of local models (5-bit quantization with K-means clustering) to fit them on your computer, but that does degrade the capability of these models a bit. If you can get your hands on a box with 96+ GB of RAM (even if you have to run on CPU instead of GPU), I'd expect 8-bit precision should perform better (though I couldn't tell you if the result will be non-terrible).
3. What about Llama-3.1-Nemotron-70B-Instruct? Ostensibly it's the only weights-available (I hesitate to say "open") LLM that will reliably give correct answers to questions like "How many r in strawberry?", by counting up the letters one by one.
1. That's what I tried first. But I was told to try it this way instead, and indeed, it works much better this way. There are lots of people who've tried this with chess and AFAIK they've all found the same thing. Of course, there's a real art to this kind of thing, and it's certainly possible some other trick would work better. But I have tried. (Also, isn't o1 supposed to do this automatically?)
I actually have Nemotron ready to go—I'll have it play a game and report back!
Nemotron rekt:
https://lichess.org/uDpZMebp
https://lichess.org/EkvB1Ftu
https://lichess.org/GzIFwd21
As someone else mentioned, the closed models could be compromised by inability to produce valid moves, I'm curious what the failure rate was.
If you didn't know, 4o supports constrained generation now with a feature they call Structured Outputs where you encode the grammar as json. https://openai.com/index/introducing-structured-outputs-in-the-api/
Theory 5: The people behind gpt-3.5-turbo-instruct noticed that people were excited about LLMs playing chess, so they _built in a redirect to a chess engine_ when it identifies a chess game, in the belief that more hype about how smart LLMs are would be good for business.
Too cynical?
You could actually test this theory: you simply feed the game to a chess engine, and look at how frequently the moves proposed by the LLM match those proposed by the chess engine. This is the same way cheating by human players is often detected. To quote chess.com:
"If the game is entered into an engine with analysis, and the opponents choice has little deviation from the top recommended moves, it leaves one to wonder. Human players can find the computer best move, and certainly the stronger the player, the closer it will often be. But continually hitting the computer recommended best move is a clue. Especially telling but harder to define are nuisance moves that have a computer sense without any threat or underlying reason for the move."
(https://www.chess.com/blog/uptophigh/how-to-tell-if-someone-is-cheating-at-chess-com)
I think this is extremely unlikely, but not because it's too cynical. I am pro cynicism! https://dynomight.substack.com/p/chess/comment/77190852
Those are valid points, although the cynic in me says that the answer to (2) is "they realized people were onto them" ;-)
Mind you, although the points you make argue strongly against a redirect to a chess engine, could it still be a redirect to a classification engine like the one you mentioned at https://github.com/sgrvinod/chess-transformers, which are pretty good models but get crushed by Stockfish on anything from level 3 upwards depending on the data set. This would be consistent with an ELO of around 1800, as you said in your linked comment.
Basically, I am profoundly skeptical of any claims of "emergent" capabilities in LLMs purely from throwing lots of training data at them, and really want to see very strong evidence. Especially given that gpt-3.5-turbo-instruct is so _very_ different in performance from its cousins.
everyone missing the real reason for 3.5's surprising performance: an openai employee was working on it in public a while ago, doing finetuning with collected data. obvious next step would've been to throw it into the SFT data pile
https://twitter.com/willdepue/status/1746384311081930790
My read of that tweet is that you can do fine-tuning on 3.5 after being tuned for chatting, and then 100 examples raise 3.5 in chat mode to the performance of 3.5-instruct. This is definitely helpful information, but it doesn't look like the fine tuning improves over instruct?
I took a look at the transformer models you linked to as being good at chess (and they are indeed pretty good). I found the interesting descriptions below of the various transformers. For those readers not closely familiar with the jargon, what this means is that these are transformers, but NOT in the sense of GenAI, like LLMs, but more in the sense of classification AI. This project has turned chess prediction into either a classification task or a labeling task, things that we already know transformers can be good at, NOT a generation ("text prediction") task like an LLM.
In other words, these models are good at spotting good chess moves in much the same way that other AIs are good at spotting tumors on medical images.
And in other, other words, no amount of chess games ingested by an LLM will teach it to play chess the way these transformer models play chess.
Here are the relevant quotes from GitHub:
1. "This is essentially a sequence (or image) classification task, where the sequence is the current state of the board, and the classes are the various moves that can be made on a chessboard in UCI notation."
2. "This is essentially a sequence (or image) labeling task, where the sequence is the current state of the chessboard, and each square competes to be labeled as the From or To square."
3. "This is essentially a sequence-to-sequence (or image-to-sequence) task, where the input sequence is the current state of the board, and the output sequence is a string of half-moves that will likely occur on the board from that point onwards."
4. "This is essentially a sequence (or image) labeling task, where the sequence is the current state of the chessboard, and each square competes to be labeled as the From or To square."
Thank you, this is very helpful. I was definitely misinterpreting those results! (I thought they were "normal" LLMs)
There are a few other experiments where people I THINK are training normal-ish LLMs with small numbers of parameters and still getting pretty decent results? E.g.
https://adamkarvonen.github.io/machine_learning/2024/01/03/chess-world-models.html
Any thoughts on that?
I will definitely take a look later. Right now I have to go and hate-watch a soccer game while doing the ironing, just to prove I am still capable of feats no AI can match.
p.s. In fairness to LLMs, I should also add that my critique boils down to "I don't believe that LLMs are any good at things LLMs were not designed to be good at, and I will be really impressed if it turns out that they are unexpectedly are" :-)
https://github.com/google-deepmind/searchless_chess you might be interested in this chess transformer from google deepmind
A recent article showed that a trip-planning AI was brittle when it was trained only on routes that an expert driver would choose. When faced with road closings, it hallucinated weird road connections. However, the authors claim that when an AI is trained on the results of random moves, it will better cover the problem space.
The problem could be that the chess games, vacuumed up in training data, did not include sequences with bad moves.