64 Comments

Update: OK, I actually think I've figured out what's causing this. I'll explain in a future post, but in the meantime, here's a hint: I think NO ONE has hit on the correct explanation!

Expand full comment

Looking forward to it, I can't even begin to guess outside of the theories I've read.

Expand full comment

I heard somewhere (some recent podcast?) that gpt-3.5-turbo-instruct was specifically trained on millions of chess games? I wish I could cite which podcast I heard that one. Perhaps Dwarkesh interviewing Gwern?

I asked SearchGPT this question and the only thing it could point me to on this topic is https://adamkarvonen.github.io/machine_learning/2024/01/03/chess-world-models.html

Expand full comment

Gwern has pointed me towards this paper gives some details for the gpt-4 dataset. It's the closest thing I could find. https://arxiv.org/pdf/2312.09390#page=29

Expand full comment

Follow-up post here: https://dynomight.net/more-chess/

Expand full comment
Comment removed
Nov 16
Comment removed
Expand full comment

Anyways, why use temp = 0.7 ? Shouldn't 0.0 pick the LLM's preferred move every time? It's not like you need to rely on the sampler to create a non-deterministic strategy.

Expand full comment

I tried temp=0, at least with llama-3.1-70b. It played slightly worse. Other people have reported the same thing, although no one seems to have a good explanation. (That said, I doubt 0.7 is optimal.)

Expand full comment

That's very interesting, if only because it makes no effing sense at all.

Expand full comment

Also did you janny my top level comment or did substack do that?

Expand full comment

It was me. (But let's stay friends.)

Expand full comment

Lame, lol. You deserved some banter for that one! (But I'll be your friend!)

Expand full comment

The only plausible explanation I can think of would be if the output probabilities are *worse* than pure chance. Which would be uh, impressive in a different way. (Or if sampling @ temp=0.0 is literally bugged in some way.)

Expand full comment

OOOh. Juicy. Thank you.

Expand full comment

Very interesting!

It might be interesting to re-run GPT o1-mini without the line "NEVER explain your choice". I've gotten the sense that models sometimes get better results if they can "think out loud".

Expand full comment

Part of the motivation for that is simply to make these experiments cheaper! For all the models except o1-mini, I can limit the output to like 7 output tokens.

Anyway, in the pre-o1 era, people experimented with prompts to encourage thinking quite a bit and (AFAIK) no one ever got close to the performance of gpt-3.5-turbo-instruct on raw chess notation. It's odd because, as you say, getting models to think things through usually helps! I think the theory is that there's just so much chess data in the training corpus, that the LLM has some kind of sophisticated machinery to predict the next move. But it's not used to seeing moves with analysis beforehand, it doesn't do that well.

In any case, I thought o1 was supposed to kind of "secretly" think even if you don't ask it? I assume that's why we have to pay for thousands of output tokens, even when we only see "e4"...

Expand full comment

Another hypothesis: GPT-3.5 was fine-tuned to be good at Chess, but training it to be better at chat clobbered this fine-tuning.

Expand full comment

This is definitely possible. But this section (https://dynomight.net/chess/#discussion) was supposed to be a bit of an argument against. Basically, if it's the fine-tuning, then why are all the other models without fine tuning also so bad? Maybe they just don't have enough chess data? (Again, possible! But weird.)

Expand full comment

I don't mean the models are all good at chess but fine-tuning for chatting makes them bad. I mean GPT-3.5 was *specifically* trained to be good at chess, which makes it much better than the others.

Other people have proposed the same hypothesis and provided some evidence, e.g: https://old.reddit.com/r/slatestarcodex/comments/1gredan/something_weird_is_happening_with_llms_and_chess/lx5rvce/

Expand full comment

Aha, I see what you're saying. I'm now almost 100% convinced the explanation is something different, but it's because of other experiments I've done since publishing this post. I hope to have better evidence next week!

Expand full comment

Not criticizing the monocausal theories, but LLMs "do a bunch of stuff with a bunch of data" and if you ask them why they did something in particular, you get a hallucination. To be fair, humans will most often give you a moralized post hoc rationalization if you ask them why they did something in particular, so we're not far from hallucination.

To be more specific, the models change BOTH the "bunch of stuff" (training setup and prompts) and the "bunch of data", and those changes interact in deep and chaotic (as in chaos theory) ways.

All of this really makes me think about how we treat other humans. Training an LLM is a one-way operation, you can't really retrain one part of an LLM (as I understand it). You can do prompt engineering, and you can do some more training, but those interact and deep and chaotic ways.

I think you can replace LLM with human in the previous paragraph and not be too far wrong.

Expand full comment

I always appreciate a good bit of systematic analysis!

Did you try GPT-4 (not -4o)? I had remembered it being GPT-4 that was originally reported to play a decent game of chess; per https://twitter.com/kenshin9000_/status/1662510532585291779, it was both 3.5 and 4.0. I wonder how they compare in skill.

I think maybe a slight extension of your Theory 1 explains all of the available data? Base models at sufficient scale can play chess, but this capability is fragile and can be destroyed by instruction tuning *or any number of other things*, such as tokenization issues, vagaries in choice of training data, etc.

Expand full comment

Some other well-informed folks have suggested to try GPT-4 too. I haven't done that yet, but if the cost isn't too high, we might,

If GPT-4 works, then I think your theory would fit. Although I'd love to understand why llama-3.1-405b seems to be so bad, even though it's bigger than GPT-3.5. (To be fair, I had to run it at insanely slow speed on CPU and only made it through one game. But it was making terrible blunders.)

Expand full comment

Yup, for sure would be interesting to understand that.

Did you see the recent thing where someone used interpretability tools to investigate why models keep saying 9.11 > 9.9, and it turned out that the model was activating features that primarily trigger in the context of dates (9/11 comes after 9/9) and Bible verses (9:11 comes after 9:9)?

Expand full comment

I can authorize some some spend with our API access if you want to reach out privately.

Expand full comment

OK, I'll take you up on that! (I actually have a theory for this that I want to confirm.) Can you email me? My email is my username at tutanota.com

Expand full comment

Confusing and intriguing! I'd also be curious to see results for Claude 3.5 sonnet, it somehow beats o1 on many reasoningy benchmarks and I've subjectively found it surprisingly impressive. (Note it got a significant update a few weeks ago which they stupidly just call the "new sonnet 3.5" rather than incrementing the number)

Expand full comment

Given that that 3.5 is closed source is there any chance they’re just calling out to a chess engine?

Expand full comment

I suppose it's possible, but I consider that extremely unlikely, because:

1. They've never mentioned doing anything like that.

2. If they did, why remove it in later models?

3. You could get to the same board state via different sequences of moves. Chess engines don't care because they just look at the board state. But gpt-3.5-turbo-instruct behaves very differently!

4. While the model is quite good by human standards, it's still terrible by chess AI standards. (Maybe 1800 ELO.) If they went to such an effort to add an engine, why stop it at such a low level?

Expand full comment

1. I assume there’s lots of secret sauce they don’t describe in all their products, and I imagine a lot of attempts to make it appear that LLMs are more capable than they perhaps are.

2. I assume different models have different architectures in various ways. You could add such a subsystem in one model as an experiment, or because Gary really wanted to, or whatever and then not bother to include it in other future architectures.

3. This is interesting, and is a good argument against. Though it’s possible that the imaginary subsystem is a novel ai chess engine that doesn’t behave in predictable ways.

4. I didn’t realise this! This is a really good argument against the possibility. I still think there’s a small chance of them having a weird experimental, but not very good, chess engine subsystem. But it does seem unlikely.

Expand full comment

My assumption would be that OpenAI noticed that the chess thing was drawing a lot of interest, so optimised something (training data, fine tuning, something algorithmic) to make gpt-3.5-turbo-instruct much better at chess. Then decided that that tradeoffs of that optimisation (cost, degradation of other functions) made it not worth it for future models.

That's a bit vague, but "OpenAI did it on purpose" is my main assumption. I don't know if the timing works out.

Expand full comment

How often did the model not satisfy the grammar and so you picked a random move? Ability to satisfy the grammar might be a confounder?

Also did you try using zero temperature?

Expand full comment

Just to be clear, the grammars meant that the open models always picked a legal move. It's sort of equivalent to repeating an infinite number of times. For the closed (OpenAI) models, I sadly didn't keep detailed numbers. But it happened quite often in the late game for all the models except... gpt-3.5-turbo-express, where it never happened!

(My guess is that imposing a grammar probably doesn't actually improve play very much in practice, because if the model is confused about what moves are legal, it probably doesn't know what moves are good. But it does make these experiments faster!)

Expand full comment

BTW, I did try zero temperature with llama-3.1-70b. It seemed to play slightly WORSE that way, oddly. I didn't finish 50 games, but after 30 or so it still lost every time.

Expand full comment

1. You're crippling the LLM by forbidding it from talking in the output. Consider asking it _first_ to draw the board state and reason "out loud" for a few sentences about its next move, and _then_ provide its answer at the _end_ of the response. This will fill the context window with helpful information to "steer" the final output tokens where it actually makes a decision. (You can then discard everything except the move at the end.)

(From Jeremy Howard's popular custom instruction https://gist.github.com/siddharthsarda/c58557e21a3bc8aeddf6b2cddc1b325a : "Since you are autoregressive, each token you produce is another opportunity to use computation, therefore you always spend a few sentences explaining background context, assumptions, and step-by-step thinking BEFORE you try to answer a question.")

2. I'm guessing you're running the Q5_K_M variants of local models (5-bit quantization with K-means clustering) to fit them on your computer, but that does degrade the capability of these models a bit. If you can get your hands on a box with 96+ GB of RAM (even if you have to run on CPU instead of GPU), I'd expect 8-bit precision should perform better (though I couldn't tell you if the result will be non-terrible).

3. What about Llama-3.1-Nemotron-70B-Instruct? Ostensibly it's the only weights-available (I hesitate to say "open") LLM that will reliably give correct answers to questions like "How many r in strawberry?", by counting up the letters one by one.

Expand full comment

1. That's what I tried first. But I was told to try it this way instead, and indeed, it works much better this way. There are lots of people who've tried this with chess and AFAIK they've all found the same thing. Of course, there's a real art to this kind of thing, and it's certainly possible some other trick would work better. But I have tried. (Also, isn't o1 supposed to do this automatically?)

I actually have Nemotron ready to go—I'll have it play a game and report back!

Expand full comment

As someone else mentioned, the closed models could be compromised by inability to produce valid moves, I'm curious what the failure rate was.

If you didn't know, 4o supports constrained generation now with a feature they call Structured Outputs where you encode the grammar as json. https://openai.com/index/introducing-structured-outputs-in-the-api/

Expand full comment

Theory 5: The people behind gpt-3.5-turbo-instruct noticed that people were excited about LLMs playing chess, so they _built in a redirect to a chess engine_ when it identifies a chess game, in the belief that more hype about how smart LLMs are would be good for business.

Too cynical?

You could actually test this theory: you simply feed the game to a chess engine, and look at how frequently the moves proposed by the LLM match those proposed by the chess engine. This is the same way cheating by human players is often detected. To quote chess.com:

"If the game is entered into an engine with analysis, and the opponents choice has little deviation from the top recommended moves, it leaves one to wonder. Human players can find the computer best move, and certainly the stronger the player, the closer it will often be. But continually hitting the computer recommended best move is a clue. Especially telling but harder to define are nuisance moves that have a computer sense without any threat or underlying reason for the move."

(https://www.chess.com/blog/uptophigh/how-to-tell-if-someone-is-cheating-at-chess-com)

Expand full comment

I think this is extremely unlikely, but not because it's too cynical. I am pro cynicism! https://dynomight.substack.com/p/chess/comment/77190852

Expand full comment

Those are valid points, although the cynic in me says that the answer to (2) is "they realized people were onto them" ;-)

Mind you, although the points you make argue strongly against a redirect to a chess engine, could it still be a redirect to a classification engine like the one you mentioned at https://github.com/sgrvinod/chess-transformers, which are pretty good models but get crushed by Stockfish on anything from level 3 upwards depending on the data set. This would be consistent with an ELO of around 1800, as you said in your linked comment.

Basically, I am profoundly skeptical of any claims of "emergent" capabilities in LLMs purely from throwing lots of training data at them, and really want to see very strong evidence. Especially given that gpt-3.5-turbo-instruct is so _very_ different in performance from its cousins.

Expand full comment

everyone missing the real reason for 3.5's surprising performance: an openai employee was working on it in public a while ago, doing finetuning with collected data. obvious next step would've been to throw it into the SFT data pile

https://twitter.com/willdepue/status/1746384311081930790

Expand full comment

My read of that tweet is that you can do fine-tuning on 3.5 after being tuned for chatting, and then 100 examples raise 3.5 in chat mode to the performance of 3.5-instruct. This is definitely helpful information, but it doesn't look like the fine tuning improves over instruct?

Expand full comment

I took a look at the transformer models you linked to as being good at chess (and they are indeed pretty good). I found the interesting descriptions below of the various transformers. For those readers not closely familiar with the jargon, what this means is that these are transformers, but NOT in the sense of GenAI, like LLMs, but more in the sense of classification AI. This project has turned chess prediction into either a classification task or a labeling task, things that we already know transformers can be good at, NOT a generation ("text prediction") task like an LLM.

In other words, these models are good at spotting good chess moves in much the same way that other AIs are good at spotting tumors on medical images.

And in other, other words, no amount of chess games ingested by an LLM will teach it to play chess the way these transformer models play chess.

Here are the relevant quotes from GitHub:

1. "This is essentially a sequence (or image) classification task, where the sequence is the current state of the board, and the classes are the various moves that can be made on a chessboard in UCI notation."

2. "This is essentially a sequence (or image) labeling task, where the sequence is the current state of the chessboard, and each square competes to be labeled as the From or To square."

3. "This is essentially a sequence-to-sequence (or image-to-sequence) task, where the input sequence is the current state of the board, and the output sequence is a string of half-moves that will likely occur on the board from that point onwards."

4. "This is essentially a sequence (or image) labeling task, where the sequence is the current state of the chessboard, and each square competes to be labeled as the From or To square."

Expand full comment

Thank you, this is very helpful. I was definitely misinterpreting those results! (I thought they were "normal" LLMs)

There are a few other experiments where people I THINK are training normal-ish LLMs with small numbers of parameters and still getting pretty decent results? E.g.

https://adamkarvonen.github.io/machine_learning/2024/01/03/chess-world-models.html

Any thoughts on that?

Expand full comment

I will definitely take a look later. Right now I have to go and hate-watch a soccer game while doing the ironing, just to prove I am still capable of feats no AI can match.

Expand full comment

p.s. In fairness to LLMs, I should also add that my critique boils down to "I don't believe that LLMs are any good at things LLMs were not designed to be good at, and I will be really impressed if it turns out that they are unexpectedly are" :-)

Expand full comment