Update: OK, I actually think I've figured out what's causing this. I'll explain in a future post, but in the meantime, here's a hint: I think NO ONE has hit on the correct explanation!
I heard somewhere (some recent podcast?) that gpt-3.5-turbo-instruct was specifically trained on millions of chess games? I wish I could cite which podcast I heard that one. Perhaps Dwarkesh interviewing Gwern?
Q: Would an LLM be "smart" if it was really good at chess unless the position was presented with certain spacing?
Possibly the biggest problem with AI is that our extrapolations of what it should be able to do from what we see it actually do, are almost completely invalid.
As you can see, I've went to frankly insane lengths trying to make gpt solve sudoku, and my conclusion was that it was impossible (with those models). The feeling was like trying to build a tower out of mud. The model might get a few numbers correct in a row (i think the record was ~10 correct in a row), but it would eventually make a very very stupid mistake and fall apart.
I assume it's the same with chess - you can get marginal improvements with various tricks, but ultimately, logical chains are too hard without some kind of error correction.
I'm very interested to find out what you think the issue was!
> The feeling was like trying to build a tower out of mud.
This is a VERY good description of what it feels like to mess around with different prompting strategies! I think it's like programming in using some esoteric API where there's no spec and nothing makes sense but sometimes there seem to be functions calls that do amazing things when all the arguments are tweaked just the right way...
Super interesting! I remember this whole subject indeed... There was a french youtuber who made a great video on the subject (though more from the perspective of what this teaches us philosophically about LLMs) https://youtu.be/6D1XIbkm4JE?feature=shared
He links to quite a few articles in the description as well.
In particular, this one https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/ (who the youtuber interviews in the video) does a lot of very similar experiments to you, and looks at lots of different prompting techniques. He would probably be interested in your post and the one you teased in the comments!
LLMs shouldn't be able to play a good chess game pass the opening. It make no sense at all to expect that.
The results confirm that, and knowing OpenAIs usual approach to ethics and transparency, I'd bet money on the bizarre results from gpt-3.5-turbo-instruct being due to cheating, with the model using function calling to a chess engine.
I'd suggest reading through the Arxiv preprint to get some new ideas about how to approach this with chess, particularly their concept of Deterministic Finite Automations (DFA). Given their results, I would be skeptical that LLM's are building coherent models of chess.
BPE tokenisation is so, so stupid. It kind of works, but it gives rise to so many errors that make LLMs look much stupider than they are, eg the Strawberry examples. Gwern has a good post on BPE.
Great post! I’ve also been experimenting with Chess and LLMs but have taken a slightly different approach. Rather than using the LLM as an opponent, I’ve implemented it as a chess tutor to provide feedback on both the user’s and the bot’s moves throughout the game.
Different from your setup, I opted to input the FEN of the current board and the subsequent move in standard algebraic notation to request feedback, as I found these inputs to be clearer for the LLM compared to giving the PGN of the game. Did you try that for yourself?
Btw, the responses vary with the user’s chess level; some find the feedback useful, while others do not. To address this, I’ve integrated a like, dislike, and request new feedback feature into the app, allowing users to actively seek better feedback.
I’m continually exploring ways to enhance the quality of the feedback. I’d appreciate it if you could try it out and share your thoughts.
One of the key findings I remember from ~a year ago was that the models did *much* better when given a representation of the current board state rather than just a list of moves. I think this is especially important if you're not giving them any chain-of-thought output space before answering (which they could use to build a representation of the current state).
Just as if you asked a human player to tell you their next move after reading a series of moves at them and only gave them a couple of seconds and no paper to write on; you'd be testing something a bit different than actual skill at chess.
I'm really interested to see what happens with gpt4 (the base model) and full o1 if you give them board states. I might have a go at this in the next few days.
What dataset did you use for the initial board states?
A recent article showed that a trip-planning AI was brittle when it was trained only on routes that an expert driver would choose. When faced with road closings, it hallucinated weird road connections. However, the authors claim that when an AI is trained on the results of random moves, it will better cover the problem space.
The problem could be that the chess games, vacuumed up in training data, did not include sequences with bad moves.
I took a look at the transformer models you linked to as being good at chess (and they are indeed pretty good). I found the interesting descriptions below of the various transformers. For those readers not closely familiar with the jargon, what this means is that these are transformers, but NOT in the sense of GenAI, like LLMs, but more in the sense of classification AI. This project has turned chess prediction into either a classification task or a labeling task, things that we already know transformers can be good at, NOT a generation ("text prediction") task like an LLM.
In other words, these models are good at spotting good chess moves in much the same way that other AIs are good at spotting tumors on medical images.
And in other, other words, no amount of chess games ingested by an LLM will teach it to play chess the way these transformer models play chess.
Here are the relevant quotes from GitHub:
1. "This is essentially a sequence (or image) classification task, where the sequence is the current state of the board, and the classes are the various moves that can be made on a chessboard in UCI notation."
2. "This is essentially a sequence (or image) labeling task, where the sequence is the current state of the chessboard, and each square competes to be labeled as the From or To square."
3. "This is essentially a sequence-to-sequence (or image-to-sequence) task, where the input sequence is the current state of the board, and the output sequence is a string of half-moves that will likely occur on the board from that point onwards."
4. "This is essentially a sequence (or image) labeling task, where the sequence is the current state of the chessboard, and each square competes to be labeled as the From or To square."
Thank you, this is very helpful. I was definitely misinterpreting those results! (I thought they were "normal" LLMs)
There are a few other experiments where people I THINK are training normal-ish LLMs with small numbers of parameters and still getting pretty decent results? E.g.
p.s. In fairness to LLMs, I should also add that my critique boils down to "I don't believe that LLMs are any good at things LLMs were not designed to be good at, and I will be really impressed if it turns out that they are unexpectedly are" :-)
I will definitely take a look later. Right now I have to go and hate-watch a soccer game while doing the ironing, just to prove I am still capable of feats no AI can match.
everyone missing the real reason for 3.5's surprising performance: an openai employee was working on it in public a while ago, doing finetuning with collected data. obvious next step would've been to throw it into the SFT data pile
My read of that tweet is that you can do fine-tuning on 3.5 after being tuned for chatting, and then 100 examples raise 3.5 in chat mode to the performance of 3.5-instruct. This is definitely helpful information, but it doesn't look like the fine tuning improves over instruct?
Update: OK, I actually think I've figured out what's causing this. I'll explain in a future post, but in the meantime, here's a hint: I think NO ONE has hit on the correct explanation!
Follow-up post here: https://dynomight.net/more-chess/
I heard somewhere (some recent podcast?) that gpt-3.5-turbo-instruct was specifically trained on millions of chess games? I wish I could cite which podcast I heard that one. Perhaps Dwarkesh interviewing Gwern?
I asked SearchGPT this question and the only thing it could point me to on this topic is https://adamkarvonen.github.io/machine_learning/2024/01/03/chess-world-models.html
Gwern has pointed me towards this paper gives some details for the gpt-4 dataset. It's the closest thing I could find. https://arxiv.org/pdf/2312.09390#page=29
Looking forward to it, I can't even begin to guess outside of the theories I've read.
Q: Would an LLM be "smart" if it was really good at chess unless the position was presented with certain spacing?
Possibly the biggest problem with AI is that our extrapolations of what it should be able to do from what we see it actually do, are almost completely invalid.
I have seen on multiple occasions that LLM's regularly try to make illegal moves when playing chess. How did you handle this?
https://dynomight.net/chess/#details
thanks
fwiw, a year ago, I, along with many others, tried a challenge - to make gpt3.5 or gpt4 solve an easy sudoku.
Here you can see the final version of my prompt - https://github.com/istinetz/sudoku/blob/main/Sudoku%20gpt%20v5.ipynb.
As you can see, I've went to frankly insane lengths trying to make gpt solve sudoku, and my conclusion was that it was impossible (with those models). The feeling was like trying to build a tower out of mud. The model might get a few numbers correct in a row (i think the record was ~10 correct in a row), but it would eventually make a very very stupid mistake and fall apart.
I assume it's the same with chess - you can get marginal improvements with various tricks, but ultimately, logical chains are too hard without some kind of error correction.
I'm very interested to find out what you think the issue was!
> The feeling was like trying to build a tower out of mud.
This is a VERY good description of what it feels like to mess around with different prompting strategies! I think it's like programming in using some esoteric API where there's no spec and nothing makes sense but sometimes there seem to be functions calls that do amazing things when all the arguments are tweaked just the right way...
Super interesting! I remember this whole subject indeed... There was a french youtuber who made a great video on the subject (though more from the perspective of what this teaches us philosophically about LLMs) https://youtu.be/6D1XIbkm4JE?feature=shared
He links to quite a few articles in the description as well.
In particular, this one https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/ (who the youtuber interviews in the video) does a lot of very similar experiments to you, and looks at lots of different prompting techniques. He would probably be interested in your post and the one you teased in the comments!
LLMs shouldn't be able to play a good chess game pass the opening. It make no sense at all to expect that.
The results confirm that, and knowing OpenAIs usual approach to ethics and transparency, I'd bet money on the bizarre results from gpt-3.5-turbo-instruct being due to cheating, with the model using function calling to a chess engine.
How did you handle illegal moves? I refuse to believe that didn't happen
https://dynomight.net/chess/#details
Good timing with a new large collaborative study that looked into the implicit world modeling of various transformers: https://news.mit.edu/2024/generative-ai-lacks-coherent-world-understanding-1105
I'd suggest reading through the Arxiv preprint to get some new ideas about how to approach this with chess, particularly their concept of Deterministic Finite Automations (DFA). Given their results, I would be skeptical that LLM's are building coherent models of chess.
BPE tokenisation is so, so stupid. It kind of works, but it gives rise to so many errors that make LLMs look much stupider than they are, eg the Strawberry examples. Gwern has a good post on BPE.
Is this said post? https://gwern.net/gpt-3#bpes
Yes, thanks for linking. That section has a subsection on ‘fixing BPEs’.
Great post! I’ve also been experimenting with Chess and LLMs but have taken a slightly different approach. Rather than using the LLM as an opponent, I’ve implemented it as a chess tutor to provide feedback on both the user’s and the bot’s moves throughout the game.
Different from your setup, I opted to input the FEN of the current board and the subsequent move in standard algebraic notation to request feedback, as I found these inputs to be clearer for the LLM compared to giving the PGN of the game. Did you try that for yourself?
Btw, the responses vary with the user’s chess level; some find the feedback useful, while others do not. To address this, I’ve integrated a like, dislike, and request new feedback feature into the app, allowing users to actively seek better feedback.
I’m continually exploring ways to enhance the quality of the feedback. I’d appreciate it if you could try it out and share your thoughts.
AI Chess GPT
https://apps.apple.com/tr/app/ai-chess-gpt/id6476107978
https://play.google.com/store/apps/details?id=net.padma.app.chess
Thanks.
One of the key findings I remember from ~a year ago was that the models did *much* better when given a representation of the current board state rather than just a list of moves. I think this is especially important if you're not giving them any chain-of-thought output space before answering (which they could use to build a representation of the current state).
Just as if you asked a human player to tell you their next move after reading a series of moves at them and only gave them a couple of seconds and no paper to write on; you'd be testing something a bit different than actual skill at chess.
I'm really interested to see what happens with gpt4 (the base model) and full o1 if you give them board states. I might have a go at this in the next few days.
What dataset did you use for the initial board states?
What do you think of other explorations along these lines, like this one? https://github.com/kagisearch/llm-chess-puzzles
I could have sworn I saw some discussion of LLMs getting ELOs over 2,000 at some point, but some cursory web searching fails to turn them up.
A recent article showed that a trip-planning AI was brittle when it was trained only on routes that an expert driver would choose. When faced with road closings, it hallucinated weird road connections. However, the authors claim that when an AI is trained on the results of random moves, it will better cover the problem space.
The problem could be that the chess games, vacuumed up in training data, did not include sequences with bad moves.
https://github.com/google-deepmind/searchless_chess you might be interested in this chess transformer from google deepmind
I took a look at the transformer models you linked to as being good at chess (and they are indeed pretty good). I found the interesting descriptions below of the various transformers. For those readers not closely familiar with the jargon, what this means is that these are transformers, but NOT in the sense of GenAI, like LLMs, but more in the sense of classification AI. This project has turned chess prediction into either a classification task or a labeling task, things that we already know transformers can be good at, NOT a generation ("text prediction") task like an LLM.
In other words, these models are good at spotting good chess moves in much the same way that other AIs are good at spotting tumors on medical images.
And in other, other words, no amount of chess games ingested by an LLM will teach it to play chess the way these transformer models play chess.
Here are the relevant quotes from GitHub:
1. "This is essentially a sequence (or image) classification task, where the sequence is the current state of the board, and the classes are the various moves that can be made on a chessboard in UCI notation."
2. "This is essentially a sequence (or image) labeling task, where the sequence is the current state of the chessboard, and each square competes to be labeled as the From or To square."
3. "This is essentially a sequence-to-sequence (or image-to-sequence) task, where the input sequence is the current state of the board, and the output sequence is a string of half-moves that will likely occur on the board from that point onwards."
4. "This is essentially a sequence (or image) labeling task, where the sequence is the current state of the chessboard, and each square competes to be labeled as the From or To square."
Thank you, this is very helpful. I was definitely misinterpreting those results! (I thought they were "normal" LLMs)
There are a few other experiments where people I THINK are training normal-ish LLMs with small numbers of parameters and still getting pretty decent results? E.g.
https://adamkarvonen.github.io/machine_learning/2024/01/03/chess-world-models.html
Any thoughts on that?
p.s. In fairness to LLMs, I should also add that my critique boils down to "I don't believe that LLMs are any good at things LLMs were not designed to be good at, and I will be really impressed if it turns out that they are unexpectedly are" :-)
I will definitely take a look later. Right now I have to go and hate-watch a soccer game while doing the ironing, just to prove I am still capable of feats no AI can match.
everyone missing the real reason for 3.5's surprising performance: an openai employee was working on it in public a while ago, doing finetuning with collected data. obvious next step would've been to throw it into the SFT data pile
https://twitter.com/willdepue/status/1746384311081930790
My read of that tweet is that you can do fine-tuning on 3.5 after being tuned for chatting, and then 100 examples raise 3.5 in chat mode to the performance of 3.5-instruct. This is definitely helpful information, but it doesn't look like the fine tuning improves over instruct?