OK, I can partly explain the LLM chess…

Nov 21, 2024

("make LLMs play well with one weird trick")

42 Comments

OK, turns out I originally screwed up the examples and used one example with an illegal move! (Yet the examples were helpful anyway? What?)

Anyway, I've re-run those experiments and updated the figures now.

Expand full comment

Reply (2)

John Lewis

Nov 24

Can the AI accurately describe the board state from the PGN? If yes, then I would suggest having it describe the board state and then have a self conversation regarding candidate moves before picking. Let me know if you need help with this, I work with AI and games all day.

Expand full comment

Douglas Summers-Stay

Nov 23

This is not that mysterious. There have been papers written about this effect (that giving wrong examples still helps). The reason seems to be that it isn't really learning much new from the examples, it is just using the examples to comfortably settle into a mode where it will produce output in the right format.

Expand full comment

Derek Martin

Mar 12

Not to anthropomorphize too much, but let's pretend gpt-4o is a person who is an artistic free-thinker with ADHD. What kids of "assignments" might they enjoy?

They would probably like things where they can exercise their creativity.

They would probably NOT like memorizing things or being told exactly what to do.

Being given examples as inspiration for what to do facilitates their creativity.

Being given fine-tuning instructions would constrain their creativity.

Being given a huge list of legal moves would make them feel boxed in, not opened up.

With all of that input they forget what the heck they were even supposed to be doing.

Regurgitation reminds them of what they were thinking, and gets them back on track.

For best results, use regurgitation + examples (aka remind & inspire).

For worst results, make them memorize legal moves and fine-tuning rules.

Does the golden rule apply to LLMs, too?

Are they performing best when we treat them the way we would like to be treated? 🤔

Expand full comment

Derek Martin

Mar 12

Humans can't play chess. All they're doing is predicting the next move (or series of moves) the opponent is likely to make and countering it. 😬

Expand full comment

Joshua Ohlman

Mar 9

Personally I would say that these two statements are contradictory:

> recent chat models can play chess quite well

> When using this type of prompt, I first gave the model ten tries to repeat the whole sequence and then give a legal move at the end. If none of those tries succeeded, I gave it another ten tries to at least produce a legal move after the new turn number, even if it didn’t repeat the whole game perfectly. If that still didn’t succeed, I chose a move at random.

If you were playing chess against a human and after 20 tries they couldn't produce a legal move you'd say they don't know how to play chess, no matter how good their randomly chosen legal moves on other occasions might happen to be.

It seems to me like LLMs which are pattern matching machines can do well at predicting which of the mere 384 possible moves are most likely to come next in a game of chess, but this doesn't mean they can play chess. I can write down a list of moves that are very likely to occur in a game of chess relative to other moves, for example Nf5 is more common than Na8, and O-O is more common than Kd5. And with a little more tuning you could improve this a lot, for example by taking whose turn it is and which turn it is.

It seems likely that chat gpt is playing chess something like this:

It has a list of all chess games played so far.

It can check for similarity between two games.

It can 'read' chess notation.

When you give it a game it finds the n most similar games and chooses a random (weighted?) move from one of them at the position of the current game.

With a sufficiently large data set I don't see why this wouldn't produce results which are apparently meaningful but don't actually take the rules of chess into consideration.

Expand full comment

Ben

Jan 11

This is incredible! I remember being mystified about the same exact thing and I’m glad someone capable did this kind of systematic analysis! I also remember when 3.5-turbo-instruct came out and everyone was convinced it was fine tuned specifically for chess or something because of how good it was.

There were a bunch of grifters who claimed they were gonna use it to make a chess app that could beat stockfish (like 2700 elo stockfish) and I made a ton of fake money on Manifold betting that it wouldn’t work.

Expand full comment

Brecht Savelkoul

Nov 24

Gotta admit I'm struggling with the dissonance between these two statements: "LLMs can definitely play chess, and we need to make peace with this", and "LLMs sometimes struggle to give legal moves. In these experiments, I try 10 times and if there’s still no legal move, I just pick one at random."

Surely the bare minimum requirement to support the first statement would be that the LLM is able to pick a legal move on the first attempt in the vast majority of cases. From what you've written in the piece it does seem that gpt-3.5-turbo-instruct does reasonably well on this (though no numbers are provided), but reading between the lines I'm under the impression that the other models do not. It would be interesting to see some numbers comparing how capable different models are at picking legal moves. Because simply hacking around the legal moves issue in order to engineer a matchup with Stockfish does feel like it's skipping a rather important step.

Expand full comment

Reply (1)

dynomight

Nov 24

There are numbers for illegals moves in the references at the end of how often gpt-3.5-turbo-instruct chooses illegal moves. Check them. It's amazing how successful it is. The main reason I didn't measure this is because (1) I am lazy, (2) it's already been done, and (3) when models don't know how to make legal moves this overwhelmingly leads to defeat because in most chess positions, a lot of moves are blunders, so I feel like it's being fairly accounted for in loss rates.—choosing a random move is a huge penalty!

Expand full comment

Reply (1)

Brecht Savelkoul

Nov 24

Ok, I found the mumbers in Mathieu Acher's post. I'll just summarize them here for the benefit of other commenters:

- gpt-3.5-turbo-instruct makes an illegal move in 16% of games. Though the majority of those can be read charitably as the model basically saying "I resign", so if we accept that it drops to an illegal move only in 4% of games. Not bad indeed, even though it's still far higher than any human player with a 1000+ ELO.

- gpt-4 makes an illegal move in 32%. All of those seem actually illegal, with no charitable explanation available.

So there's still a mysterious discrepancy between the models in what I personally consider the most important metric.

(When I say most important, I'm not strictly thinking of chess, but trying to evaluate LLM decision making in general. Ignoring the laws of the game in chess is fairly innocuous compared to ignoring the laws of nature in real-life decision making.)

Expand full comment

Reply (1)

Felix Neumann

Nov 26

I’m under the impression that we’re seeing the randomized nature of LLMs here. So instead of saying „ChatGPT 4o can play chess“, it might be more accurate to say „something in ChatGPT 4o can play chess“, and the arising problem is then, how can we force the LLM to use those specific areas/paths.

This is reinforced by another comment here on Substack saying that in the „Mixture of Experts“ LLMs like ChatGPT 4o, your prompt might actually get executed on a non-ideal expert model (for efficiency reasons). The comment even suggested that this happens with a pretty high probability.

Expand full comment