> It’s ridiculously hard to find the optimal combination of prompts and examples and fine-tuning, etc. It’s a very large space, there are no easy abstractions to allow you to search through the space, LLMs are unpredictable and fragile, and these experiments are slow and expensive. ... In many ways, this feels less like engineering and more like a search for spells.
My understanding is that this characterizes ~all current work in LLMs, very much including the development of the models themselves. Lots of fiddly experiments at intermediate scale followed by a big training run and hope-for-the-best. Terms like "alchemy" being thrown around.
"Recent base OpenAI models would be excellent at chess (in completion mode, if we could access them). But the chat models that we actually get access to aren’t"
Yeah, it could be the chat fine-tuning maiming they do, or it could be something else opaque about the architecture. Like, I always do wonder what part of my input actually gets passed to the model. There could be several layers dissecting what I write, reshaping it, then going to the tokenizer, then to the (fine-tuned) model, and then back out through god knows how many safety check layers.
Fascinating stuff. I'm looking into using LLMs to create a chess coach that can explain moves, so this is all solid gold for me.
You've probably already seen this, but if you ever need a source of expert level games, you can download literally billions of games from the lichess database. You could then easily filter these for any rating level.
I play chess at the 1600-1799 level and my training is light on theory and study and heavy on experience.
1800 is very much a watershed in my opinion. It's the point past where basic tactics and pattern matching can take you, and you need a fuller awareness of the game.
This rhymes with models reaching a relatively high level of written rhetoric and knowledge in many areas, but not getting to expert levels in any of them.
I would guess that models are upper-bounded at 1800, not lower-bounded by high quality player data.
Nice article! The regurgitation trick was pretty neat.
I'll note that my paper “Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models” was on a 25M parameter LLM trained from scratch to predict the next character in chess PGN strings, not on GPT-3.5-turbo-instruct (as I don't have weights access to that model).
Notably, a 25M parameter model is several orders of magnitude smaller than modern LLMs, yet it still learns to play chess well.
Maybe a dumb question, but how do your LLMs know that you want them to win? Ie to prefer moves that lead to winning? Every published game, except for the occasional draw, has a sequence of moves that led to a win AND a sequence of moves that lead to a loss. And when grandmasters play, they usually play grandmasters ie their games offer comparably ranked examples of both winning and losing. Even if the model was trained to distinguish the winner’s moves from the loser’s, how does it know that you want it to play ‘like the winners’?
I think this is a very fair question. Let me rephrase it this way: Why *does* gpt-3.5-turbo-instruct win against Stockfish on level 1? My intuition is that in the training data, players overwhelmingly do sensible things, and if one player makes a huge blunder, they usually lose. Since Stockfish on level 1 isn't that smart, it blunders more than players typically would in the training data, which leads to gpt-3.5-turbo-instruct usually winning. So I think it *doesn't* really know that it's supposed to win, but the fact that it's trained on high-quality has that effect anyway.
Perhaps you could encourage a model to win by saying something like: "You are a chess grandmaster playing against an chess international master." I.e. both will be doing very well (might be worth adjusting for stockfish 1 idk) but one is significantly more likely to win.
OK, turns out I originally screwed up the examples and used one example with an illegal move! (Yet the examples were helpful anyway? What?)
Anyway, I've re-run those experiments and updated the figures now.
great article
Fascinating!
> It’s ridiculously hard to find the optimal combination of prompts and examples and fine-tuning, etc. It’s a very large space, there are no easy abstractions to allow you to search through the space, LLMs are unpredictable and fragile, and these experiments are slow and expensive. ... In many ways, this feels less like engineering and more like a search for spells.
My understanding is that this characterizes ~all current work in LLMs, very much including the development of the models themselves. Lots of fiddly experiments at intermediate scale followed by a big training run and hope-for-the-best. Terms like "alchemy" being thrown around.
"Recent base OpenAI models would be excellent at chess (in completion mode, if we could access them). But the chat models that we actually get access to aren’t"
Yeah, it could be the chat fine-tuning maiming they do, or it could be something else opaque about the architecture. Like, I always do wonder what part of my input actually gets passed to the model. There could be several layers dissecting what I write, reshaping it, then going to the tokenizer, then to the (fine-tuned) model, and then back out through god knows how many safety check layers.
Fascinating stuff. I'm looking into using LLMs to create a chess coach that can explain moves, so this is all solid gold for me.
You've probably already seen this, but if you ever need a source of expert level games, you can download literally billions of games from the lichess database. You could then easily filter these for any rating level.
https://database.lichess.org/
This could let someone check how good a model can get relative to the examples - if you have 1200-1800 games, will you get 1800 or 1200 or neither?
You might be interested in this paper: https://arxiv.org/abs/2406.11741
"search for spells" -- yes!
I play chess at the 1600-1799 level and my training is light on theory and study and heavy on experience.
1800 is very much a watershed in my opinion. It's the point past where basic tactics and pattern matching can take you, and you need a fuller awareness of the game.
This rhymes with models reaching a relatively high level of written rhetoric and knowledge in many areas, but not getting to expert levels in any of them.
I would guess that models are upper-bounded at 1800, not lower-bounded by high quality player data.
That's what people say at every level once they reach their ceiling, whether it's 1000, 1500, or 2000
The right answer *is* oolong.
Nice article! The regurgitation trick was pretty neat.
I'll note that my paper “Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models” was on a 25M parameter LLM trained from scratch to predict the next character in chess PGN strings, not on GPT-3.5-turbo-instruct (as I don't have weights access to that model).
Notably, a 25M parameter model is several orders of magnitude smaller than modern LLMs, yet it still learns to play chess well.
Maybe a dumb question, but how do your LLMs know that you want them to win? Ie to prefer moves that lead to winning? Every published game, except for the occasional draw, has a sequence of moves that led to a win AND a sequence of moves that lead to a loss. And when grandmasters play, they usually play grandmasters ie their games offer comparably ranked examples of both winning and losing. Even if the model was trained to distinguish the winner’s moves from the loser’s, how does it know that you want it to play ‘like the winners’?
I think this is a very fair question. Let me rephrase it this way: Why *does* gpt-3.5-turbo-instruct win against Stockfish on level 1? My intuition is that in the training data, players overwhelmingly do sensible things, and if one player makes a huge blunder, they usually lose. Since Stockfish on level 1 isn't that smart, it blunders more than players typically would in the training data, which leads to gpt-3.5-turbo-instruct usually winning. So I think it *doesn't* really know that it's supposed to win, but the fact that it's trained on high-quality has that effect anyway.
Perhaps you could encourage a model to win by saying something like: "You are a chess grandmaster playing against an chess international master." I.e. both will be doing very well (might be worth adjusting for stockfish 1 idk) but one is significantly more likely to win.