Something weird is happening with LLMs and…

Nov 21

Follow-up post here: https://dynomight.net/more-chess/

Expand full comment

Clyde Wright

Nov 19

I heard somewhere (some recent podcast?) that gpt-3.5-turbo-instruct was specifically trained on millions of chess games? I wish I could cite which podcast I heard that one. Perhaps Dwarkesh interviewing Gwern?

I asked SearchGPT this question and the only thing it could point me to on this topic is https://adamkarvonen.github.io/machine_learning/2024/01/03/chess-world-models.html

Expand full comment

Nov 19

Gwern has pointed me towards this paper gives some details for the gpt-4 dataset. It's the closest thing I could find. https://arxiv.org/pdf/2312.09390#page=29

Expand full comment

Christian Chung

Looking forward to it, I can't even begin to guess outside of the theories I've read.

Expand full comment

Nick Hounsome

Jan 2

Q: Would an LLM be "smart" if it was really good at chess unless the position was presented with certain spacing?

Possibly the biggest problem with AI is that our extrapolations of what it should be able to do from what we see it actually do, are almost completely invalid.

Expand full comment

Kent

Nov 27

I have seen on multiple occasions that LLM's regularly try to make illegal moves when playing chess. How did you handle this?

Expand full comment

https://dynomight.net/chess/#details

Nov 27

Expand full comment

Kent

Nov 28

thanks

Expand full comment

истинец

Nov 18

fwiw, a year ago, I, along with many others, tried a challenge - to make gpt3.5 or gpt4 solve an easy sudoku.

Here you can see the final version of my prompt - https://github.com/istinetz/sudoku/blob/main/Sudoku%20gpt%20v5.ipynb.

As you can see, I've went to frankly insane lengths trying to make gpt solve sudoku, and my conclusion was that it was impossible (with those models). The feeling was like trying to build a tower out of mud. The model might get a few numbers correct in a row (i think the record was ~10 correct in a row), but it would eventually make a very very stupid mistake and fall apart.

I assume it's the same with chess - you can get marginal improvements with various tricks, but ultimately, logical chains are too hard without some kind of error correction.

I'm very interested to find out what you think the issue was!

Expand full comment

Esborogardius Antoniopolus

Nov 18

> The feeling was like trying to build a tower out of mud.

This is a VERY good description of what it feels like to mess around with different prompting strategies! I think it's like programming in using some esoteric API where there's no spec and nothing makes sense but sometimes there seem to be functions calls that do amazing things when all the arguments are tweaked just the right way...

Expand full comment

Brzozowski

Nov 17

Super interesting! I remember this whole subject indeed... There was a french youtuber who made a great video on the subject (though more from the perspective of what this teaches us philosophically about LLMs) https://youtu.be/6D1XIbkm4JE?feature=shared

He links to quite a few articles in the description as well.

In particular, this one https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/ (who the youtuber interviews in the video) does a lot of very similar experiments to you, and looks at lots of different prompting techniques. He would probably be interested in your post and the one you teased in the comments!

Expand full comment

LLMs shouldn't be able to play a good chess game pass the opening. It make no sense at all to expect that.

The results confirm that, and knowing OpenAIs usual approach to ethics and transparency, I'd bet money on the bizarre results from gpt-3.5-turbo-instruct being due to cheating, with the model using function calling to a chess engine.

Expand full comment

Elliot Korte

How did you handle illegal moves? I refuse to believe that didn't happen

Expand full comment

https://dynomight.net/chess/#details

Expand full comment

John Papaioannou

Good timing with a new large collaborative study that looked into the implicit world modeling of various transformers: https://news.mit.edu/2024/generative-ai-lacks-coherent-world-understanding-1105

I'd suggest reading through the Arxiv preprint to get some new ideas about how to approach this with chess, particularly their concept of Deterministic Finite Automations (DFA). Given their results, I would be skeptical that LLM's are building coherent models of chess.

Expand full comment

James McDermott

Nov 15Edited

BPE tokenisation is so, so stupid. It kind of works, but it gives rise to so many errors that make LLMs look much stupider than they are, eg the Strawberry examples. Gwern has a good post on BPE.

Expand full comment

Is this said post? https://gwern.net/gpt-3#bpes

Expand full comment

James McDermott

Yes, thanks for linking. That section has a subsection on ‘fixing BPEs’.

Expand full comment

Omer Saatcioglu

https://apps.apple.com/tr/app/ai-chess-gpt/id6476107978

Great post! I’ve also been experimenting with Chess and LLMs but have taken a slightly different approach. Rather than using the LLM as an opponent, I’ve implemented it as a chess tutor to provide feedback on both the user’s and the bot’s moves throughout the game.

Different from your setup, I opted to input the FEN of the current board and the subsequent move in standard algebraic notation to request feedback, as I found these inputs to be clearer for the LLM compared to giving the PGN of the game. Did you try that for yourself?

Btw, the responses vary with the user’s chess level; some find the feedback useful, while others do not. To address this, I’ve integrated a like, dislike, and request new feedback feature into the app, allowing users to actively seek better feedback.

I’m continually exploring ways to enhance the quality of the feedback. I’d appreciate it if you could try it out and share your thoughts.

AI Chess GPT

https://play.google.com/store/apps/details?id=net.padma.app.chess

Thanks.

Expand full comment

drodgers

One of the key findings I remember from ~a year ago was that the models did *much* better when given a representation of the current board state rather than just a list of moves. I think this is especially important if you're not giving them any chain-of-thought output space before answering (which they could use to build a representation of the current state).

Just as if you asked a human player to tell you their next move after reading a series of moves at them and only gave them a couple of seconds and no paper to write on; you'd be testing something a bit different than actual skill at chess.

I'm really interested to see what happens with gpt4 (the base model) and full o1 if you give them board states. I might have a go at this in the next few days.

What dataset did you use for the initial board states?

Expand full comment

Greg G

What do you think of other explorations along these lines, like this one? https://github.com/kagisearch/llm-chess-puzzles

I could have sworn I saw some discussion of LLMs getting ELOs over 2,000 at some point, but some cursory web searching fails to turn them up.

Expand full comment

Erik Olson

A recent article showed that a trip-planning AI was brittle when it was trained only on routes that an expert driver would choose. When faced with road closings, it hallucinated weird road connections. However, the authors claim that when an AI is trained on the results of random moves, it will better cover the problem space.

The problem could be that the chess games, vacuumed up in training data, did not include sequences with bad moves.

Expand full comment

Pradyumna Prasad

https://github.com/google-deepmind/searchless_chess you might be interested in this chess transformer from google deepmind

Expand full comment

J.T. Zelten

I took a look at the transformer models you linked to as being good at chess (and they are indeed pretty good). I found the interesting descriptions below of the various transformers. For those readers not closely familiar with the jargon, what this means is that these are transformers, but NOT in the sense of GenAI, like LLMs, but more in the sense of classification AI. This project has turned chess prediction into either a classification task or a labeling task, things that we already know transformers can be good at, NOT a generation ("text prediction") task like an LLM.

In other words, these models are good at spotting good chess moves in much the same way that other AIs are good at spotting tumors on medical images.

And in other, other words, no amount of chess games ingested by an LLM will teach it to play chess the way these transformer models play chess.

Here are the relevant quotes from GitHub:

1. "This is essentially a sequence (or image) classification task, where the sequence is the current state of the board, and the classes are the various moves that can be made on a chessboard in UCI notation."

2. "This is essentially a sequence (or image) labeling task, where the sequence is the current state of the chessboard, and each square competes to be labeled as the From or To square."

3. "This is essentially a sequence-to-sequence (or image-to-sequence) task, where the input sequence is the current state of the board, and the output sequence is a string of half-moves that will likely occur on the board from that point onwards."

4. "This is essentially a sequence (or image) labeling task, where the sequence is the current state of the chessboard, and each square competes to be labeled as the From or To square."

Expand full comment

https://adamkarvonen.github.io/machine_learning/2024/01/03/chess-world-models.html

Thank you, this is very helpful. I was definitely misinterpreting those results! (I thought they were "normal" LLMs)

There are a few other experiments where people I THINK are training normal-ish LLMs with small numbers of parameters and still getting pretty decent results? E.g.

Any thoughts on that?

Expand full comment

Reply (2)

J.T. Zelten

p.s. In fairness to LLMs, I should also add that my critique boils down to "I don't believe that LLMs are any good at things LLMs were not designed to be good at, and I will be really impressed if it turns out that they are unexpectedly are" :-)

Expand full comment

J.T. Zelten

I will definitely take a look later. Right now I have to go and hate-watch a soccer game while doing the ironing, just to prove I am still capable of feats no AI can match.

Expand full comment

Sherman

https://twitter.com/willdepue/status/1746384311081930790

everyone missing the real reason for 3.5's surprising performance: an openai employee was working on it in public a while ago, doing finetuning with collected data. obvious next step would've been to throw it into the SFT data pile

Expand full comment