LLMs predict my coffee

Mar 18

Why not benchmark with physical experiments?

29 Comments

Hi! You might like a book called "The Physics of Filter Coffee". Actually it's out of print. I can send it to you when I'm through with it if you want?

A+

Nice. Ive actually done this experiment for a different reason, to test whether thin or thick walls in ceramic cups actually insulate differently. They do not. But, also, the initial drop can be made sense of since as conduction>>convection+radiation+latent cooling. The water first equilibrates with the cup, then more slowly cools. I wonder what the contributions of latent cooling (evaporation) vs. convection vs radiation are.

Art Kuo

Mar 20

Okay I asked AI to run a simulation, and got a poorer result that looks like a lot like a single exponential (labeled Newton). https://image2url.com/r2/default/images/1774048113152-135b4f32-eaac-47d7-b0a8-342ccaa8b14a.png

GitHub CoPilot CLI running gpt-5.3-codex (medium), same basics as Dynomight's. Agent mostly set up everything on its own, except I had to install OpenFOAM and remind it to use my uv python. It fixed some compatibility issues with the OpenFOAM version and even downloaded a disk image containing tutorials, which it consulted to figure out how to run everything. The first simulation was going to take 1 hr to simulate 5 min, so it offered to do a simpler model which was 10x10x20 mesh with dT=0.1s, runs in about 7s. The plot is for volume-averaged temp.

Although I'm disappointed in the result, I was impressed by CoPilot's ability to set everything up. A human expert would likely need 20 min to do this. Now the model's ready for refinement, where the expert could really go to town. I'm not confident an amateur like myself would be able to generate a decent prediction, but maybe the LLM could guide parameter decisions.

Art Kuo

Mar 20

Here's a quick way to think about speed of cooling between beginning and end. If coffee were a single thermal mass, the temperature should have exponential decay with time constant tau = RC, where R is the thermal resistance and C is thermal capacitance.

But temp drops much more quickly at the beginning. Now think of the coffee as a bunch of "finite element" thermal masses, each with their own R and C, and different R for each side, like the cup wall vs. another interior element. Now there are a bunch of different time constants to temperature, which now decays as the sum of weighted exponentials. The initial temp drop will be dominated by the faster time constants, and later by the slow ones. We don't know the weightings because of the complications dynomight mentioned, but we do expect faster initial drop no matter what.

Rapa-Nui

Mar 20

Anthropic wins again.

Fabulous!

I predict that if I showed this to my friends they would say something like "well there is a firmly established physics equation for this. Dynomight claims that there are many free variables but there really aren't you can just look this up in a textbook."

I wonder if there are any experiments that would surprise them! I guess I could just ask Claude lol.....

Reply (1)

dynomight

Mar 19

Yeah, in retrospect I think this is an experiment that might appear to be much easier to predict than it actually is. (For people who don't know much physics.)

Mira

Mar 18

the real question is whether it's actually predicting or just retrieving similar setups it's seen before... like run something slightly exotic and see if it still works, that's the benchmark that would actually matter

Reply (1)

dynomight

Mar 19

I think they are actually predicting! For Kimi I could read the full chain of thought and it goes through a billion permutations of different mechanistic models and derives the final —not very accurate—equation from that. For Gemini/GPI/Opus you can only see the summarized chain of thought but it still shows a LOT of this kind of thinking.

Reply (1)

Mira

Mar 19

That's genuinely compelling — mechanistic model derivation is much harder to explain as pure retrieval than curve-fitting would be. Though the "not very accurate" endpoint is interesting: does the reasoning process look correct even when the final equation is off, or does it just confidently arrive at the wrong place through plausible-looking steps?

Ignacio

Mar 18

Cool experiments and interesting results!

Could you also share the actual equations of (some of) the models, including their assumptions? I’m curious why two models came to a much lower temperature at t=0

Reply (1)

dynomight

Mar 18

Sure, you can see them all here: https://dynomight.net/coffee/#:~:text=Appendix

Katie

Mar 18

My predictions also would have been different both to the LLM predictions and what actually happened, but the main thing that surprised me here was boiling the water in the microwave :p Do you not own a kettle???

Reply (1)

dynomight

Mar 18

I do! (I used it in a previous middle-school science project https://dynomight.net/fahren-height/) But here I wanted to make sure I had exactly 8 oz of water, so this seemed easier since I could verify that I didn't lose anything significant to the boiling and then pour it quickly into the mug.

Reply (1)

Katie

Mar 19

Ah fair enough I withdraw my critique. Happy to see more progress in the hot drink optimisation field.

Paulin

Mar 18

I thought this was going to be a very technical point about what LLMs *really* predict

As opposed to "the next token", a guess most readers would make after reading the first part of the title (because they are next-token predictors themselves)

But the actual article is equally lovely

Reply (1)

dynomight

Mar 18

Ha, you're saying you parsed the title as "the thing that LLMs really predict is: my coffee"?

(I realized the title wasn't completely clear, but I figured it was in the right part of feature space...)

Reply (2)

Geran Kostecki

Mar 19

You hit the right balance of making me click to see wth you meant without making me disappointed I did

Paulin

Mar 18

Well it could have been something like "the thing they predict is so broad I could say it's a bunch of things, like my coffee"

I didn't feel misled to be clear

Alexander Kaplan

Mar 18

I would have guessed fairly rapid cooling followed by fairly slow cooling because I am a daily coffee drinker and that has been my experience. I suppose there's some lesson about machine cognition vs. human cognition in there.

Also, the fact that you hang out with Dynomight in real life is cool. Good job on the middle-school science experience.

Ben

Mar 18

Now ask them for a prediction for a scenario starting at 95 degrees...

Reply (1)

dynomight

Mar 18Edited

It was 100 degrees (before pouring it into the mug). I checked.

Reply (2)

Harjas Sandhu

Mar 18

It's interesting that none of them thought that the act of pouring the water would dissipate heat, which is something anyone who has made instant noodles knows (since the very act of pouring generates a lot of steam and cools the water)

Reply (1)

dynomight

Mar 18Edited

In their defense, I did skim the chain of thought for a few of them and they considered a LOT of other modeling options. It's just that they decided a simpler model was safer in the end.

I don't remember them specifically mentioning this, though the chain of thought was often like 10,000 words long, so it wouldn't surprise me.

Reply (1)

Harjas Sandhu

Mar 18

wow 10k words is a lot. I'm also quite impressed with them here tbh - I don't think I would've been able to do this myself, and I was a physics major in college (I mean I could draw a graph but idk how good it would've been).

Reply (1)

dynomight

Mar 18

Yeah, I think this is one of those problems that's more impressive when you have more knowledge: You need to know a lot of physics to appreciate how hard it is!

Ben

Mar 18

Thanks for clarifying!