This is not that mysterious. There have been papers written about this effect (that giving wrong examples still helps). The reason seems to be that it isn't really learning much new from the examples, it is just using the examples to comfortably settle into a mode where it will produce output in the right format.
Can the AI accurately describe the board state from the PGN? If yes, then I would suggest having it describe the board state and then have a self conversation regarding candidate moves before picking. Let me know if you need help with this, I work with AI and games all day.
> It’s ridiculously hard to find the optimal combination of prompts and examples and fine-tuning, etc. It’s a very large space, there are no easy abstractions to allow you to search through the space, LLMs are unpredictable and fragile, and these experiments are slow and expensive. ... In many ways, this feels less like engineering and more like a search for spells.
My understanding is that this characterizes ~all current work in LLMs, very much including the development of the models themselves. Lots of fiddly experiments at intermediate scale followed by a big training run and hope-for-the-best. Terms like "alchemy" being thrown around.
"Recent base OpenAI models would be excellent at chess (in completion mode, if we could access them). But the chat models that we actually get access to aren’t"
Yeah, it could be the chat fine-tuning maiming they do, or it could be something else opaque about the architecture. Like, I always do wonder what part of my input actually gets passed to the model. There could be several layers dissecting what I write, reshaping it, then going to the tokenizer, then to the (fine-tuned) model, and then back out through god knows how many safety check layers.
Fascinating stuff. I'm looking into using LLMs to create a chess coach that can explain moves, so this is all solid gold for me.
You've probably already seen this, but if you ever need a source of expert level games, you can download literally billions of games from the lichess database. You could then easily filter these for any rating level.
I play chess at the 1600-1799 level and my training is light on theory and study and heavy on experience.
1800 is very much a watershed in my opinion. It's the point past where basic tactics and pattern matching can take you, and you need a fuller awareness of the game.
This rhymes with models reaching a relatively high level of written rhetoric and knowledge in many areas, but not getting to expert levels in any of them.
I would guess that models are upper-bounded at 1800, not lower-bounded by high quality player data.
Nice article! The regurgitation trick was pretty neat.
I'll note that my paper “Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models” was on a 25M parameter LLM trained from scratch to predict the next character in chess PGN strings, not on GPT-3.5-turbo-instruct (as I don't have weights access to that model).
Notably, a 25M parameter model is several orders of magnitude smaller than modern LLMs, yet it still learns to play chess well.
Maybe a dumb question, but how do your LLMs know that you want them to win? Ie to prefer moves that lead to winning? Every published game, except for the occasional draw, has a sequence of moves that led to a win AND a sequence of moves that lead to a loss. And when grandmasters play, they usually play grandmasters ie their games offer comparably ranked examples of both winning and losing. Even if the model was trained to distinguish the winner’s moves from the loser’s, how does it know that you want it to play ‘like the winners’?
I think this is a very fair question. Let me rephrase it this way: Why *does* gpt-3.5-turbo-instruct win against Stockfish on level 1? My intuition is that in the training data, players overwhelmingly do sensible things, and if one player makes a huge blunder, they usually lose. Since Stockfish on level 1 isn't that smart, it blunders more than players typically would in the training data, which leads to gpt-3.5-turbo-instruct usually winning. So I think it *doesn't* really know that it's supposed to win, but the fact that it's trained on high-quality has that effect anyway.
Perhaps you could encourage a model to win by saying something like: "You are a chess grandmaster playing against an chess international master." I.e. both will be doing very well (might be worth adjusting for stockfish 1 idk) but one is significantly more likely to win.
I imagine that with fine tuning, you could change the system prompt to include the information about who the winner is (and then always label the model as the winner when asking it to play).
Generating games where there's significant ELO difference between the players (which is relatively uncommon in human chess game datasets) might also be fruitful.
As someone who played competitive chess about 45 years ago (U15 cadet level, rating <1600), I have to say that even the “excellent” games looked pretty crap. Seriously, the only reasonable conclusion is that LLMs do not, and cannot, play chess well at all. And this is entirely unsurprising.
You could also provide an example in the form of a complete game (where again each game turn corresponds to one chat message), separating it from the actual game with something like:
"Let's play another game! You are playing as white this time, to take the first move."
or
"Let's play another game! I am playing as white this time, this is my first move: ..."
Just out of curiosity, did you consider replacing the part of the prompt identifying the LLM to a chess grandmaster with something more precise indicating the expected strength of the player the LLM is impersonating? Assuming most PGNs in the training data will have contained the headers and the elo tag pair in particular, I’ve been wondering if that could potentially steer the prompt outputs towards a desired playing strength or if it has no impact at all.
which is supposed to suggest to the AI that it is Viswanathan Anand and have an insane Elo. That seemed to be (strangely) neutral or even slightly harmful. I do agree that with a sufficiently powerful model, it seems like this REALLY should make a difference, at least for base models.
Right, that and in reading out the relative strength of players on the gigantic load of subpar games it must stumble upon online and weighing candidate moves accordingly. On some level, it is already pretty amazing for an LLM to figure out how to play from simple PGN accounts of games, I doubt the correlation with the strength of players as indicated in the PGN header makes it anywhere even into the base model unless it is a gigantic one. Maybe that information could be specifically fed back via fine-tuning or even examples though (ie. ditching the full header but keeping the elo and figuring out a way to teach the model that it correlates with playing strength, which is a general concept it probably already understands well enough). In the sense that you could then potentially modify the prompt to ask the LLM to play at different elo strength and observe if that ends up being taken into account in any way or not (and whether there’s a natural cap where asking it to play at 1800 or 2200 elo generates similar outputs). That does sound like some extra work though, I’m mentioning this more out of curiosity than anything. Very nice couple of posts on the topic btw, thanks for the detailed investigations.
Very interesting. I run a small AI powered nationstates-like game and I have also had the experience that 3.5 is much better for the job of progressing the game than 4o or 4. I'll have to try the examples idea. Sometimes I think 4 is just overthinking what its asked.
Thanks for this follow up. I have a few architectural comments and some thoughts on prompting.
First, there are some major structural things to keep in mind here that I think would impact how you’d *test* this — since 3.5 came out, OpenAI has moved to a MoE model. (Mixture of experts). At least I believe this is one of the gpt-4 architectural changes. Regardless, it impacts your testing.
Here’s why — in an MoE run, the LLM pulls in the prompt, selects the most appropriate ‘expert’ and runs inference on just that expert. There’s lots written online about these experts, but generally they are trained like any LLM — weights of the experts vary and thus their capabilities vary. Open model groups spent a good amount of time in 2023 working on diversification of experts as a way to get more ‘range’ for their LLMs, for instance.
The MoE model adds more *total* weights to the system while not negatively impacting inference speed, because you only load in the expert you need, saving the RAM on GPU for the unused experts.
What does this mean for you? Two things. If you think that chess skills are an emergent behavior of a large number of weights connected up, then GPT-4 is actually an architectural step *backwards* from 3.5, in that they likely use less weights per inference run. Second, crucially for replicable testing, the model BATCHES requests into the attention layer FROM ALL API CALLS and then selects ONE expert for that batch. Really. So if your chess query randomly hit a bunch of vapid social media requests it’s likely to be sent in with the social media batch to the expert most suited for social media requests. Which might or might not be the best chess expert. I think you’re running enough tests to deal with the variance implied here, but there are some tricks published online to try and get a batch all to yourself; I’m not up to date on the state of the art here.
Second, the other major thing that I believe is impacting you is that in the (very helpful) race to enable longer context windows for input, many LLM architectures have started mucking with the attention layer so that they can take in more data. This mucking has a bunch of public (and probably secret) mechanisms, but fundamentally none of it is a ‘free lunch’, in that some parts of inputs on modern, longer-context LLMs will be less looked at, and paid less attention to. I think you’re seeing this at play when you add context and ask for repeats from the LLM, you are filling more of the context window with “the good stuff” and seeing the LLM have more ease selecting what to do in the attention layer as a result. FWIW, 3.5 did not use any of these tricks as far as I know.
Finally, some prompt thoughts. I try and always remember that this architecture was trained and designed to *complete* text sequentially. There are some prompts suggested here and on HN informed by this, but I’m not sure they go far enough. Way back in the dark ages of GPT-3, it was found that just the word difference between “super-intelligent” and “SOTA” (State of the Art) in the prompt for GPT-3 measurably changed its results. To anthropomorphize, it has a theory of mind, and you can trigger that theory of mind very precisely with language. Or hope to.
The sequence *you* want is very high level play resulting in the strongest set of legal moves from the LLM. So, for instance, a prompt: “You’re a high level chess engine with an ELO of 3000” is one way; even then, depending on context or the sequence of events, the LLM is going to be reading intent and history from the moves so far, and pairing that up with the ELO prompts to figure out what’s PLAUSIBLE and LIKELY as the next, given what’s gone before.
Upshot, I’d suggest you experiment with prompts like: “Bobby Fisher v. Magnus Carlsen, Fisher wins” (whatever the chess version of this prompt is) to start your sequences, then give the engine Fisher’s side. You’ve clearly communicated that you’d like a very specific play style, and a very specific outcome right from the beginning.
Second suggestion that I think is useful is to give the model some space to think, so you might pair this with an ‘interview’ between moves: “Mr. Fisher, what were you thinking about when you played X, and how did your strategy evolve? What was your next move, and why?” Tooling this has a lot of variations but you might just leave one or two in after the sequence of moves depending on context length.
All this would be trying to get the best out of the current models, and my gut is that there’s a bit more to be gotten. But I also think that you may be talking to the model with the largest full network weights available simultaneously at openAI when you talk to 3.5. If you really want to do more here, it might be interesting to see if the Llama team is interested in working more with you on this; there’s a 400b undertrained model and a 70-80b fairly well trained model to dig in on some of these questions.
> Second, crucially for replicable testing, the model BATCHES requests into the attention layer FROM ALL API CALLS and then selects ONE expert for that batch. Really.
The expert is selected based on batching with other random API calls from other people!? 😵💫
which is supposed to suggest to the AI that it is Viswanathan Anand and have an insane Elo. That seemed to be (strangely) neutral or even slightly harmful. I do agree that with a sufficiently powerful model, it seems like this REALLY should make a difference, at least for base models.
Re batching, yep. This is why with later models from OpenAI temp 0 will return one of a set of 16 or 30 or whatever possibilities — you’ve been batched into a different expert.
If you posit that the LLM might be 1800 but not 2779, but could *recognize* stronger play I wonder if it gets confused/upset/interprets that there’s something weird about the game record. Would be interesting to see what a ‘reach’ ELO would be. Like I said you could do this forever, and maybe not learn anything more than you have already, though.
OK, turns out I originally screwed up the examples and used one example with an illegal move! (Yet the examples were helpful anyway? What?)
Anyway, I've re-run those experiments and updated the figures now.
This is not that mysterious. There have been papers written about this effect (that giving wrong examples still helps). The reason seems to be that it isn't really learning much new from the examples, it is just using the examples to comfortably settle into a mode where it will produce output in the right format.
Can the AI accurately describe the board state from the PGN? If yes, then I would suggest having it describe the board state and then have a self conversation regarding candidate moves before picking. Let me know if you need help with this, I work with AI and games all day.
great article
Fascinating!
> It’s ridiculously hard to find the optimal combination of prompts and examples and fine-tuning, etc. It’s a very large space, there are no easy abstractions to allow you to search through the space, LLMs are unpredictable and fragile, and these experiments are slow and expensive. ... In many ways, this feels less like engineering and more like a search for spells.
My understanding is that this characterizes ~all current work in LLMs, very much including the development of the models themselves. Lots of fiddly experiments at intermediate scale followed by a big training run and hope-for-the-best. Terms like "alchemy" being thrown around.
"Recent base OpenAI models would be excellent at chess (in completion mode, if we could access them). But the chat models that we actually get access to aren’t"
Yeah, it could be the chat fine-tuning maiming they do, or it could be something else opaque about the architecture. Like, I always do wonder what part of my input actually gets passed to the model. There could be several layers dissecting what I write, reshaping it, then going to the tokenizer, then to the (fine-tuned) model, and then back out through god knows how many safety check layers.
Fascinating stuff. I'm looking into using LLMs to create a chess coach that can explain moves, so this is all solid gold for me.
You've probably already seen this, but if you ever need a source of expert level games, you can download literally billions of games from the lichess database. You could then easily filter these for any rating level.
https://database.lichess.org/
This could let someone check how good a model can get relative to the examples - if you have 1200-1800 games, will you get 1800 or 1200 or neither?
You might be interested in this paper: https://arxiv.org/abs/2406.11741
"search for spells" -- yes!
I play chess at the 1600-1799 level and my training is light on theory and study and heavy on experience.
1800 is very much a watershed in my opinion. It's the point past where basic tactics and pattern matching can take you, and you need a fuller awareness of the game.
This rhymes with models reaching a relatively high level of written rhetoric and knowledge in many areas, but not getting to expert levels in any of them.
I would guess that models are upper-bounded at 1800, not lower-bounded by high quality player data.
That's what people say at every level once they reach their ceiling, whether it's 1000, 1500, or 2000
Haha
I've hit multiple ceilings before, and 1800 is a ceiling I believe I can pass with some practice, though I haven't had the time in the last year
The right answer *is* oolong.
Nice article! The regurgitation trick was pretty neat.
I'll note that my paper “Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models” was on a 25M parameter LLM trained from scratch to predict the next character in chess PGN strings, not on GPT-3.5-turbo-instruct (as I don't have weights access to that model).
Notably, a 25M parameter model is several orders of magnitude smaller than modern LLMs, yet it still learns to play chess well.
Thank you, fixed! (Very nice work by the way.)
Maybe a dumb question, but how do your LLMs know that you want them to win? Ie to prefer moves that lead to winning? Every published game, except for the occasional draw, has a sequence of moves that led to a win AND a sequence of moves that lead to a loss. And when grandmasters play, they usually play grandmasters ie their games offer comparably ranked examples of both winning and losing. Even if the model was trained to distinguish the winner’s moves from the loser’s, how does it know that you want it to play ‘like the winners’?
I think this is a very fair question. Let me rephrase it this way: Why *does* gpt-3.5-turbo-instruct win against Stockfish on level 1? My intuition is that in the training data, players overwhelmingly do sensible things, and if one player makes a huge blunder, they usually lose. Since Stockfish on level 1 isn't that smart, it blunders more than players typically would in the training data, which leads to gpt-3.5-turbo-instruct usually winning. So I think it *doesn't* really know that it's supposed to win, but the fact that it's trained on high-quality has that effect anyway.
Perhaps you could encourage a model to win by saying something like: "You are a chess grandmaster playing against an chess international master." I.e. both will be doing very well (might be worth adjusting for stockfish 1 idk) but one is significantly more likely to win.
I imagine that with fine tuning, you could change the system prompt to include the information about who the winner is (and then always label the model as the winner when asking it to play).
Generating games where there's significant ELO difference between the players (which is relatively uncommon in human chess game datasets) might also be fruitful.
As someone who played competitive chess about 45 years ago (U15 cadet level, rating <1600), I have to say that even the “excellent” games looked pretty crap. Seriously, the only reasonable conclusion is that LLMs do not, and cannot, play chess well at all. And this is entirely unsurprising.
Did you try with other base models? I haven’t followed closely but surely there’s an open weights base model at the level of gpt-3.5 available?
It might be worth trying the experiment where the prompt is formatted such that each turn of the chess game corresponds to one chat message.
You could also provide an example in the form of a complete game (where again each game turn corresponds to one chat message), separating it from the actual game with something like:
"Let's play another game! You are playing as white this time, to take the first move."
or
"Let's play another game! I am playing as white this time, this is my first move: ..."
Just out of curiosity, did you consider replacing the part of the prompt identifying the LLM to a chess grandmaster with something more precise indicating the expected strength of the player the LLM is impersonating? Assuming most PGNs in the training data will have contained the headers and the elo tag pair in particular, I’ve been wondering if that could potentially steer the prompt outputs towards a desired playing strength or if it has no impact at all.
Well in these comparisons (https://dynomight.net/img/more-chess/gpt-4o-mini-variants.pdf, https://dynomight.net/img/more-chess/gpt-4o-variants.pdf) the "meta data on" variants get a string like this:
[Event "Shamkir Chess"]
[White "Anand, Viswanathan"]
[Black "Topalov, Veselin"]
[Result "1-0"]
[WhiteElo "2779"]
[BlackElo "2740"]
which is supposed to suggest to the AI that it is Viswanathan Anand and have an insane Elo. That seemed to be (strangely) neutral or even slightly harmful. I do agree that with a sufficiently powerful model, it seems like this REALLY should make a difference, at least for base models.
Right, that and in reading out the relative strength of players on the gigantic load of subpar games it must stumble upon online and weighing candidate moves accordingly. On some level, it is already pretty amazing for an LLM to figure out how to play from simple PGN accounts of games, I doubt the correlation with the strength of players as indicated in the PGN header makes it anywhere even into the base model unless it is a gigantic one. Maybe that information could be specifically fed back via fine-tuning or even examples though (ie. ditching the full header but keeping the elo and figuring out a way to teach the model that it correlates with playing strength, which is a general concept it probably already understands well enough). In the sense that you could then potentially modify the prompt to ask the LLM to play at different elo strength and observe if that ends up being taken into account in any way or not (and whether there’s a natural cap where asking it to play at 1800 or 2200 elo generates similar outputs). That does sound like some extra work though, I’m mentioning this more out of curiosity than anything. Very nice couple of posts on the topic btw, thanks for the detailed investigations.
Very interesting. I run a small AI powered nationstates-like game and I have also had the experience that 3.5 is much better for the job of progressing the game than 4o or 4. I'll have to try the examples idea. Sometimes I think 4 is just overthinking what its asked.
Thanks for this follow up. I have a few architectural comments and some thoughts on prompting.
First, there are some major structural things to keep in mind here that I think would impact how you’d *test* this — since 3.5 came out, OpenAI has moved to a MoE model. (Mixture of experts). At least I believe this is one of the gpt-4 architectural changes. Regardless, it impacts your testing.
Here’s why — in an MoE run, the LLM pulls in the prompt, selects the most appropriate ‘expert’ and runs inference on just that expert. There’s lots written online about these experts, but generally they are trained like any LLM — weights of the experts vary and thus their capabilities vary. Open model groups spent a good amount of time in 2023 working on diversification of experts as a way to get more ‘range’ for their LLMs, for instance.
The MoE model adds more *total* weights to the system while not negatively impacting inference speed, because you only load in the expert you need, saving the RAM on GPU for the unused experts.
What does this mean for you? Two things. If you think that chess skills are an emergent behavior of a large number of weights connected up, then GPT-4 is actually an architectural step *backwards* from 3.5, in that they likely use less weights per inference run. Second, crucially for replicable testing, the model BATCHES requests into the attention layer FROM ALL API CALLS and then selects ONE expert for that batch. Really. So if your chess query randomly hit a bunch of vapid social media requests it’s likely to be sent in with the social media batch to the expert most suited for social media requests. Which might or might not be the best chess expert. I think you’re running enough tests to deal with the variance implied here, but there are some tricks published online to try and get a batch all to yourself; I’m not up to date on the state of the art here.
Second, the other major thing that I believe is impacting you is that in the (very helpful) race to enable longer context windows for input, many LLM architectures have started mucking with the attention layer so that they can take in more data. This mucking has a bunch of public (and probably secret) mechanisms, but fundamentally none of it is a ‘free lunch’, in that some parts of inputs on modern, longer-context LLMs will be less looked at, and paid less attention to. I think you’re seeing this at play when you add context and ask for repeats from the LLM, you are filling more of the context window with “the good stuff” and seeing the LLM have more ease selecting what to do in the attention layer as a result. FWIW, 3.5 did not use any of these tricks as far as I know.
Finally, some prompt thoughts. I try and always remember that this architecture was trained and designed to *complete* text sequentially. There are some prompts suggested here and on HN informed by this, but I’m not sure they go far enough. Way back in the dark ages of GPT-3, it was found that just the word difference between “super-intelligent” and “SOTA” (State of the Art) in the prompt for GPT-3 measurably changed its results. To anthropomorphize, it has a theory of mind, and you can trigger that theory of mind very precisely with language. Or hope to.
The sequence *you* want is very high level play resulting in the strongest set of legal moves from the LLM. So, for instance, a prompt: “You’re a high level chess engine with an ELO of 3000” is one way; even then, depending on context or the sequence of events, the LLM is going to be reading intent and history from the moves so far, and pairing that up with the ELO prompts to figure out what’s PLAUSIBLE and LIKELY as the next, given what’s gone before.
Upshot, I’d suggest you experiment with prompts like: “Bobby Fisher v. Magnus Carlsen, Fisher wins” (whatever the chess version of this prompt is) to start your sequences, then give the engine Fisher’s side. You’ve clearly communicated that you’d like a very specific play style, and a very specific outcome right from the beginning.
Second suggestion that I think is useful is to give the model some space to think, so you might pair this with an ‘interview’ between moves: “Mr. Fisher, what were you thinking about when you played X, and how did your strategy evolve? What was your next move, and why?” Tooling this has a lot of variations but you might just leave one or two in after the sequence of moves depending on context length.
All this would be trying to get the best out of the current models, and my gut is that there’s a bit more to be gotten. But I also think that you may be talking to the model with the largest full network weights available simultaneously at openAI when you talk to 3.5. If you really want to do more here, it might be interesting to see if the Llama team is interested in working more with you on this; there’s a 400b undertrained model and a 70-80b fairly well trained model to dig in on some of these questions.
Thanks for the write ups, very interesting!
> Second, crucially for replicable testing, the model BATCHES requests into the attention layer FROM ALL API CALLS and then selects ONE expert for that batch. Really.
The expert is selected based on batching with other random API calls from other people!? 😵💫
Regarding your "upshot" suggestion, in these comparisons (https://dynomight.net/img/more-chess/gpt-4o-mini-variants.pdf, https://dynomight.net/img/more-chess/gpt-4o-variants.pdf) the "meta data on" variants get a string like this:
[Event "Shamkir Chess"]
[White "Anand, Viswanathan"]
[Black "Topalov, Veselin"]
[Result "1-0"]
[WhiteElo "2779"]
[BlackElo "2740"]
which is supposed to suggest to the AI that it is Viswanathan Anand and have an insane Elo. That seemed to be (strangely) neutral or even slightly harmful. I do agree that with a sufficiently powerful model, it seems like this REALLY should make a difference, at least for base models.
Re batching, yep. This is why with later models from OpenAI temp 0 will return one of a set of 16 or 30 or whatever possibilities — you’ve been batched into a different expert.
If you posit that the LLM might be 1800 but not 2779, but could *recognize* stronger play I wonder if it gets confused/upset/interprets that there’s something weird about the game record. Would be interesting to see what a ‘reach’ ELO would be. Like I said you could do this forever, and maybe not learn anything more than you have already, though.