42 Comments
User's avatar
dynomight's avatar

OK, turns out I originally screwed up the examples and used one example with an illegal move! (Yet the examples were helpful anyway? What?)

Anyway, I've re-run those experiments and updated the figures now.

Expand full comment
John Lewis's avatar

Can the AI accurately describe the board state from the PGN? If yes, then I would suggest having it describe the board state and then have a self conversation regarding candidate moves before picking. Let me know if you need help with this, I work with AI and games all day.

Expand full comment
Douglas Summers-Stay's avatar

This is not that mysterious. There have been papers written about this effect (that giving wrong examples still helps). The reason seems to be that it isn't really learning much new from the examples, it is just using the examples to comfortably settle into a mode where it will produce output in the right format.

Expand full comment
Derek Martin's avatar

Not to anthropomorphize too much, but let's pretend gpt-4o is a person who is an artistic free-thinker with ADHD. What kids of "assignments" might they enjoy?

They would probably like things where they can exercise their creativity.

They would probably NOT like memorizing things or being told exactly what to do.

Being given examples as inspiration for what to do facilitates their creativity.

Being given fine-tuning instructions would constrain their creativity.

Being given a huge list of legal moves would make them feel boxed in, not opened up.

With all of that input they forget what the heck they were even supposed to be doing.

Regurgitation reminds them of what they were thinking, and gets them back on track.

For best results, use regurgitation + examples (aka remind & inspire).

For worst results, make them memorize legal moves and fine-tuning rules.

Does the golden rule apply to LLMs, too?

Are they performing best when we treat them the way we would like to be treated? 🤔

Expand full comment
Derek Martin's avatar

Humans can't play chess. All they're doing is predicting the next move (or series of moves) the opponent is likely to make and countering it. 😬

Expand full comment
Joshua Ohlman's avatar

Personally I would say that these two statements are contradictory:

> recent chat models can play chess quite well

> When using this type of prompt, I first gave the model ten tries to repeat the whole sequence and then give a legal move at the end. If none of those tries succeeded, I gave it another ten tries to at least produce a legal move after the new turn number, even if it didn’t repeat the whole game perfectly. If that still didn’t succeed, I chose a move at random.

If you were playing chess against a human and after 20 tries they couldn't produce a legal move you'd say they don't know how to play chess, no matter how good their randomly chosen legal moves on other occasions might happen to be.

It seems to me like LLMs which are pattern matching machines can do well at predicting which of the mere 384 possible moves are most likely to come next in a game of chess, but this doesn't mean they can play chess. I can write down a list of moves that are very likely to occur in a game of chess relative to other moves, for example Nf5 is more common than Na8, and O-O is more common than Kd5. And with a little more tuning you could improve this a lot, for example by taking whose turn it is and which turn it is.

It seems likely that chat gpt is playing chess something like this:

It has a list of all chess games played so far.

It can check for similarity between two games.

It can 'read' chess notation.

When you give it a game it finds the n most similar games and chooses a random (weighted?) move from one of them at the position of the current game.

With a sufficiently large data set I don't see why this wouldn't produce results which are apparently meaningful but don't actually take the rules of chess into consideration.

Expand full comment
Ben's avatar

This is incredible! I remember being mystified about the same exact thing and I’m glad someone capable did this kind of systematic analysis! I also remember when 3.5-turbo-instruct came out and everyone was convinced it was fine tuned specifically for chess or something because of how good it was.

There were a bunch of grifters who claimed they were gonna use it to make a chess app that could beat stockfish (like 2700 elo stockfish) and I made a ton of fake money on Manifold betting that it wouldn’t work.

Expand full comment
Brecht Savelkoul's avatar

Gotta admit I'm struggling with the dissonance between these two statements: "LLMs can definitely play chess, and we need to make peace with this", and "LLMs sometimes struggle to give legal moves. In these experiments, I try 10 times and if there’s still no legal move, I just pick one at random."

Surely the bare minimum requirement to support the first statement would be that the LLM is able to pick a legal move on the first attempt in the vast majority of cases. From what you've written in the piece it does seem that gpt-3.5-turbo-instruct does reasonably well on this (though no numbers are provided), but reading between the lines I'm under the impression that the other models do not. It would be interesting to see some numbers comparing how capable different models are at picking legal moves. Because simply hacking around the legal moves issue in order to engineer a matchup with Stockfish does feel like it's skipping a rather important step.

Expand full comment
dynomight's avatar

There are numbers for illegals moves in the references at the end of how often gpt-3.5-turbo-instruct chooses illegal moves. Check them. It's amazing how successful it is. The main reason I didn't measure this is because (1) I am lazy, (2) it's already been done, and (3) when models don't know how to make legal moves this overwhelmingly leads to defeat because in most chess positions, a lot of moves are blunders, so I feel like it's being fairly accounted for in loss rates.—choosing a random move is a huge penalty!

Expand full comment
Brecht Savelkoul's avatar

Ok, I found the mumbers in Mathieu Acher's post. I'll just summarize them here for the benefit of other commenters:

- gpt-3.5-turbo-instruct makes an illegal move in 16% of games. Though the majority of those can be read charitably as the model basically saying "I resign", so if we accept that it drops to an illegal move only in 4% of games. Not bad indeed, even though it's still far higher than any human player with a 1000+ ELO.

- gpt-4 makes an illegal move in 32%. All of those seem actually illegal, with no charitable explanation available.

So there's still a mysterious discrepancy between the models in what I personally consider the most important metric.

(When I say most important, I'm not strictly thinking of chess, but trying to evaluate LLM decision making in general. Ignoring the laws of the game in chess is fairly innocuous compared to ignoring the laws of nature in real-life decision making.)

Expand full comment
Felix Neumann's avatar

I’m under the impression that we’re seeing the randomized nature of LLMs here. So instead of saying „ChatGPT 4o can play chess“, it might be more accurate to say „something in ChatGPT 4o can play chess“, and the arising problem is then, how can we force the LLM to use those specific areas/paths.

This is reinforced by another comment here on Substack saying that in the „Mixture of Experts“ LLMs like ChatGPT 4o, your prompt might actually get executed on a non-ideal expert model (for efficiency reasons). The comment even suggested that this happens with a pretty high probability.

Expand full comment
Matthew Farrell's avatar

There’s a better way to get it to regurgitate the whole sequence correctly- you can add an “assistant” message at the end, after your “user” prompt, with the beginning of the assistant response, and it will pick up where that message ends, so you can use that to “prime” the LLM with the exact move history and it should just reply with the next move. His is documentation from Anthropic, but it works with ChatGPT:

https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/prefill-claudes-response

Expand full comment
dynomight's avatar

I think that using this kind of prefill of the agent response is an excellent idea. But to the best of my knowledge, the OpenAI API does not offer exactly this functionality. (You could sorta kinda hack something using their JSON grammar stuff maybe?)

Expand full comment
Matthew Farrell's avatar

You’re right, I did some tests and they are doing something else - sometimes it partially works, but not really. Have you tried Anthropic or Google models? They seem to let you really initialize the response with a prefix.

Expand full comment
PV's avatar

Thanks for this follow up. I have a few architectural comments and some thoughts on prompting.

First, there are some major structural things to keep in mind here that I think would impact how you’d *test* this — since 3.5 came out, OpenAI has moved to a MoE model. (Mixture of experts). At least I believe this is one of the gpt-4 architectural changes. Regardless, it impacts your testing.

Here’s why — in an MoE run, the LLM pulls in the prompt, selects the most appropriate ‘expert’ and runs inference on just that expert. There’s lots written online about these experts, but generally they are trained like any LLM — weights of the experts vary and thus their capabilities vary. Open model groups spent a good amount of time in 2023 working on diversification of experts as a way to get more ‘range’ for their LLMs, for instance.

The MoE model adds more *total* weights to the system while not negatively impacting inference speed, because you only load in the expert you need, saving the RAM on GPU for the unused experts.

What does this mean for you? Two things. If you think that chess skills are an emergent behavior of a large number of weights connected up, then GPT-4 is actually an architectural step *backwards* from 3.5, in that they likely use less weights per inference run. Second, crucially for replicable testing, the model BATCHES requests into the attention layer FROM ALL API CALLS and then selects ONE expert for that batch. Really. So if your chess query randomly hit a bunch of vapid social media requests it’s likely to be sent in with the social media batch to the expert most suited for social media requests. Which might or might not be the best chess expert. I think you’re running enough tests to deal with the variance implied here, but there are some tricks published online to try and get a batch all to yourself; I’m not up to date on the state of the art here.

Second, the other major thing that I believe is impacting you is that in the (very helpful) race to enable longer context windows for input, many LLM architectures have started mucking with the attention layer so that they can take in more data. This mucking has a bunch of public (and probably secret) mechanisms, but fundamentally none of it is a ‘free lunch’, in that some parts of inputs on modern, longer-context LLMs will be less looked at, and paid less attention to. I think you’re seeing this at play when you add context and ask for repeats from the LLM, you are filling more of the context window with “the good stuff” and seeing the LLM have more ease selecting what to do in the attention layer as a result. FWIW, 3.5 did not use any of these tricks as far as I know.

Finally, some prompt thoughts. I try and always remember that this architecture was trained and designed to *complete* text sequentially. There are some prompts suggested here and on HN informed by this, but I’m not sure they go far enough. Way back in the dark ages of GPT-3, it was found that just the word difference between “super-intelligent” and “SOTA” (State of the Art) in the prompt for GPT-3 measurably changed its results. To anthropomorphize, it has a theory of mind, and you can trigger that theory of mind very precisely with language. Or hope to.

The sequence *you* want is very high level play resulting in the strongest set of legal moves from the LLM. So, for instance, a prompt: “You’re a high level chess engine with an ELO of 3000” is one way; even then, depending on context or the sequence of events, the LLM is going to be reading intent and history from the moves so far, and pairing that up with the ELO prompts to figure out what’s PLAUSIBLE and LIKELY as the next, given what’s gone before.

Upshot, I’d suggest you experiment with prompts like: “Bobby Fisher v. Magnus Carlsen, Fisher wins” (whatever the chess version of this prompt is) to start your sequences, then give the engine Fisher’s side. You’ve clearly communicated that you’d like a very specific play style, and a very specific outcome right from the beginning.

Second suggestion that I think is useful is to give the model some space to think, so you might pair this with an ‘interview’ between moves: “Mr. Fisher, what were you thinking about when you played X, and how did your strategy evolve? What was your next move, and why?” Tooling this has a lot of variations but you might just leave one or two in after the sequence of moves depending on context length.

All this would be trying to get the best out of the current models, and my gut is that there’s a bit more to be gotten. But I also think that you may be talking to the model with the largest full network weights available simultaneously at openAI when you talk to 3.5. If you really want to do more here, it might be interesting to see if the Llama team is interested in working more with you on this; there’s a 400b undertrained model and a 70-80b fairly well trained model to dig in on some of these questions.

Thanks for the write ups, very interesting!

Expand full comment
dynomight's avatar

> Second, crucially for replicable testing, the model BATCHES requests into the attention layer FROM ALL API CALLS and then selects ONE expert for that batch. Really.

The expert is selected based on batching with other random API calls from other people!? 😵‍💫

Regarding your "upshot" suggestion, in these comparisons (https://dynomight.net/img/more-chess/gpt-4o-mini-variants.pdf, https://dynomight.net/img/more-chess/gpt-4o-variants.pdf) the "meta data on" variants get a string like this:

[Event "Shamkir Chess"]

[White "Anand, Viswanathan"]

[Black "Topalov, Veselin"]

[Result "1-0"]

[WhiteElo "2779"]

[BlackElo "2740"]

which is supposed to suggest to the AI that it is Viswanathan Anand and have an insane Elo. That seemed to be (strangely) neutral or even slightly harmful. I do agree that with a sufficiently powerful model, it seems like this REALLY should make a difference, at least for base models.

Expand full comment
PV's avatar

Re batching, yep. This is why with later models from OpenAI temp 0 will return one of a set of 16 or 30 or whatever possibilities — you’ve been batched into a different expert.

If you posit that the LLM might be 1800 but not 2779, but could *recognize* stronger play I wonder if it gets confused/upset/interprets that there’s something weird about the game record. Would be interesting to see what a ‘reach’ ELO would be. Like I said you could do this forever, and maybe not learn anything more than you have already, though.

Expand full comment
Louise's avatar

Very interesting. I run a small AI powered nationstates-like game and I have also had the experience that 3.5 is much better for the job of progressing the game than 4o or 4. I'll have to try the examples idea. Sometimes I think 4 is just overthinking what its asked.

Expand full comment
QP's avatar

Just out of curiosity, did you consider replacing the part of the prompt identifying the LLM to a chess grandmaster with something more precise indicating the expected strength of the player the LLM is impersonating? Assuming most PGNs in the training data will have contained the headers and the elo tag pair in particular, I’ve been wondering if that could potentially steer the prompt outputs towards a desired playing strength or if it has no impact at all.

Expand full comment
dynomight's avatar

Well in these comparisons (https://dynomight.net/img/more-chess/gpt-4o-mini-variants.pdf, https://dynomight.net/img/more-chess/gpt-4o-variants.pdf) the "meta data on" variants get a string like this:

[Event "Shamkir Chess"]

[White "Anand, Viswanathan"]

[Black "Topalov, Veselin"]

[Result "1-0"]

[WhiteElo "2779"]

[BlackElo "2740"]

which is supposed to suggest to the AI that it is Viswanathan Anand and have an insane Elo. That seemed to be (strangely) neutral or even slightly harmful. I do agree that with a sufficiently powerful model, it seems like this REALLY should make a difference, at least for base models.

Expand full comment
QP's avatar

Right, that and in reading out the relative strength of players on the gigantic load of subpar games it must stumble upon online and weighing candidate moves accordingly. On some level, it is already pretty amazing for an LLM to figure out how to play from simple PGN accounts of games, I doubt the correlation with the strength of players as indicated in the PGN header makes it anywhere even into the base model unless it is a gigantic one. Maybe that information could be specifically fed back via fine-tuning or even examples though (ie. ditching the full header but keeping the elo and figuring out a way to teach the model that it correlates with playing strength, which is a general concept it probably already understands well enough). In the sense that you could then potentially modify the prompt to ask the LLM to play at different elo strength and observe if that ends up being taken into account in any way or not (and whether there’s a natural cap where asking it to play at 1800 or 2200 elo generates similar outputs). That does sound like some extra work though, I’m mentioning this more out of curiosity than anything. Very nice couple of posts on the topic btw, thanks for the detailed investigations.

Expand full comment
palmik's avatar

It might be worth trying the experiment where the prompt is formatted such that each turn of the chess game corresponds to one chat message.

Expand full comment
palmik's avatar

You could also provide an example in the form of a complete game (where again each game turn corresponds to one chat message), separating it from the actual game with something like:

"Let's play another game! You are playing as white this time, to take the first move."

or

"Let's play another game! I am playing as white this time, this is my first move: ..."

Expand full comment
Corentin Smith's avatar

Did you try with other base models? I haven’t followed closely but surely there’s an open weights base model at the level of gpt-3.5 available?

Expand full comment
Ian Olthof's avatar

As someone who played competitive chess about 45 years ago (U15 cadet level, rating <1600), I have to say that even the “excellent” games looked pretty crap. Seriously, the only reasonable conclusion is that LLMs do not, and cannot, play chess well at all. And this is entirely unsurprising.

Expand full comment
Hugh (Spike) McLarty's avatar

Maybe a dumb question, but how do your LLMs know that you want them to win? Ie to prefer moves that lead to winning? Every published game, except for the occasional draw, has a sequence of moves that led to a win AND a sequence of moves that lead to a loss. And when grandmasters play, they usually play grandmasters ie their games offer comparably ranked examples of both winning and losing. Even if the model was trained to distinguish the winner’s moves from the loser’s, how does it know that you want it to play ‘like the winners’?

Expand full comment
palmik's avatar

I imagine that with fine tuning, you could change the system prompt to include the information about who the winner is (and then always label the model as the winner when asking it to play).

Generating games where there's significant ELO difference between the players (which is relatively uncommon in human chess game datasets) might also be fruitful.

Expand full comment
dynomight's avatar

I think this is a very fair question. Let me rephrase it this way: Why *does* gpt-3.5-turbo-instruct win against Stockfish on level 1? My intuition is that in the training data, players overwhelmingly do sensible things, and if one player makes a huge blunder, they usually lose. Since Stockfish on level 1 isn't that smart, it blunders more than players typically would in the training data, which leads to gpt-3.5-turbo-instruct usually winning. So I think it *doesn't* really know that it's supposed to win, but the fact that it's trained on high-quality has that effect anyway.

Expand full comment
MoltenOak's avatar

Perhaps you could encourage a model to win by saying something like: "You are a chess grandmaster playing against an chess international master." I.e. both will be doing very well (might be worth adjusting for stockfish 1 idk) but one is significantly more likely to win.

Expand full comment
Adam Karvonen's avatar

Nice article! The regurgitation trick was pretty neat.

I'll note that my paper “Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models” was on a 25M parameter LLM trained from scratch to predict the next character in chess PGN strings, not on GPT-3.5-turbo-instruct (as I don't have weights access to that model).

Notably, a 25M parameter model is several orders of magnitude smaller than modern LLMs, yet it still learns to play chess well.

Expand full comment
dynomight's avatar

Thank you, fixed! (Very nice work by the way.)

Expand full comment
Jay's avatar

The right answer *is* oolong.

Expand full comment