Hi! You might like a book called "The Physics of Filter Coffee". Actually it's out of print. I can send it to you when I'm through with it if you want?
Nice. Ive actually done this experiment for a different reason, to test whether thin or thick walls in ceramic cups actually insulate differently. They do not. But, also, the initial drop can be made sense of since as conduction>>convection+radiation+latent cooling. The water first equilibrates with the cup, then more slowly cools. I wonder what the contributions of latent cooling (evaporation) vs. convection vs radiation are.
GitHub CoPilot CLI running gpt-5.3-codex (medium), same basics as Dynomight's. Agent mostly set up everything on its own, except I had to install OpenFOAM and remind it to use my uv python. It fixed some compatibility issues with the OpenFOAM version and even downloaded a disk image containing tutorials, which it consulted to figure out how to run everything. The first simulation was going to take 1 hr to simulate 5 min, so it offered to do a simpler model which was 10x10x20 mesh with dT=0.1s, runs in about 7s. The plot is for volume-averaged temp.
Although I'm disappointed in the result, I was impressed by CoPilot's ability to set everything up. A human expert would likely need 20 min to do this. Now the model's ready for refinement, where the expert could really go to town. I'm not confident an amateur like myself would be able to generate a decent prediction, but maybe the LLM could guide parameter decisions.
Here's a quick way to think about speed of cooling between beginning and end. If coffee were a single thermal mass, the temperature should have exponential decay with time constant tau = RC, where R is the thermal resistance and C is thermal capacitance.
But temp drops much more quickly at the beginning. Now think of the coffee as a bunch of "finite element" thermal masses, each with their own R and C, and different R for each side, like the cup wall vs. another interior element. Now there are a bunch of different time constants to temperature, which now decays as the sum of weighted exponentials. The initial temp drop will be dominated by the faster time constants, and later by the slow ones. We don't know the weightings because of the complications dynomight mentioned, but we do expect faster initial drop no matter what.
I predict that if I showed this to my friends they would say something like "well there is a firmly established physics equation for this. Dynomight claims that there are many free variables but there really aren't you can just look this up in a textbook."
I wonder if there are any experiments that would surprise them! I guess I could just ask Claude lol.....
Yeah, in retrospect I think this is an experiment that might appear to be much easier to predict than it actually is. (For people who don't know much physics.)
the real question is whether it's actually predicting or just retrieving similar setups it's seen before... like run something slightly exotic and see if it still works, that's the benchmark that would actually matter
I think they are actually predicting! For Kimi I could read the full chain of thought and it goes through a billion permutations of different mechanistic models and derives the final —not very accurate—equation from that. For Gemini/GPI/Opus you can only see the summarized chain of thought but it still shows a LOT of this kind of thinking.
That's genuinely compelling — mechanistic model derivation is much harder to explain as pure retrieval than curve-fitting would be. Though the "not very accurate" endpoint is interesting: does the reasoning process look correct even when the final equation is off, or does it just confidently arrive at the wrong place through plausible-looking steps?
Could you also share the actual equations of (some of) the models, including their assumptions? I’m curious why two models came to a much lower temperature at t=0
My predictions also would have been different both to the LLM predictions and what actually happened, but the main thing that surprised me here was boiling the water in the microwave :p Do you not own a kettle???
I do! (I used it in a previous middle-school science project https://dynomight.net/fahren-height/) But here I wanted to make sure I had exactly 8 oz of water, so this seemed easier since I could verify that I didn't lose anything significant to the boiling and then pour it quickly into the mug.
I thought this was going to be a very technical point about what LLMs *really* predict
As opposed to "the next token", a guess most readers would make after reading the first part of the title (because they are next-token predictors themselves)
I would have guessed fairly rapid cooling followed by fairly slow cooling because I am a daily coffee drinker and that has been my experience. I suppose there's some lesson about machine cognition vs. human cognition in there.
Also, the fact that you hang out with Dynomight in real life is cool. Good job on the middle-school science experience.
It's interesting that none of them thought that the act of pouring the water would dissipate heat, which is something anyone who has made instant noodles knows (since the very act of pouring generates a lot of steam and cools the water)
In their defense, I did skim the chain of thought for a few of them and they considered a LOT of other modeling options. It's just that they decided a simpler model was safer in the end.
I don't remember them specifically mentioning this, though the chain of thought was often like 10,000 words long, so it wouldn't surprise me.
wow 10k words is a lot. I'm also quite impressed with them here tbh - I don't think I would've been able to do this myself, and I was a physics major in college (I mean I could draw a graph but idk how good it would've been).
Yeah, I think this is one of those problems that's more impressive when you have more knowledge: You need to know a lot of physics to appreciate how hard it is!
Hi! You might like a book called "The Physics of Filter Coffee". Actually it's out of print. I can send it to you when I'm through with it if you want?
A+
Nice. Ive actually done this experiment for a different reason, to test whether thin or thick walls in ceramic cups actually insulate differently. They do not. But, also, the initial drop can be made sense of since as conduction>>convection+radiation+latent cooling. The water first equilibrates with the cup, then more slowly cools. I wonder what the contributions of latent cooling (evaporation) vs. convection vs radiation are.
Okay I asked AI to run a simulation, and got a poorer result that looks like a lot like a single exponential (labeled Newton). https://image2url.com/r2/default/images/1774048113152-135b4f32-eaac-47d7-b0a8-342ccaa8b14a.png
GitHub CoPilot CLI running gpt-5.3-codex (medium), same basics as Dynomight's. Agent mostly set up everything on its own, except I had to install OpenFOAM and remind it to use my uv python. It fixed some compatibility issues with the OpenFOAM version and even downloaded a disk image containing tutorials, which it consulted to figure out how to run everything. The first simulation was going to take 1 hr to simulate 5 min, so it offered to do a simpler model which was 10x10x20 mesh with dT=0.1s, runs in about 7s. The plot is for volume-averaged temp.
Although I'm disappointed in the result, I was impressed by CoPilot's ability to set everything up. A human expert would likely need 20 min to do this. Now the model's ready for refinement, where the expert could really go to town. I'm not confident an amateur like myself would be able to generate a decent prediction, but maybe the LLM could guide parameter decisions.
Here's a quick way to think about speed of cooling between beginning and end. If coffee were a single thermal mass, the temperature should have exponential decay with time constant tau = RC, where R is the thermal resistance and C is thermal capacitance.
But temp drops much more quickly at the beginning. Now think of the coffee as a bunch of "finite element" thermal masses, each with their own R and C, and different R for each side, like the cup wall vs. another interior element. Now there are a bunch of different time constants to temperature, which now decays as the sum of weighted exponentials. The initial temp drop will be dominated by the faster time constants, and later by the slow ones. We don't know the weightings because of the complications dynomight mentioned, but we do expect faster initial drop no matter what.
Anthropic wins again.
Fabulous!
I predict that if I showed this to my friends they would say something like "well there is a firmly established physics equation for this. Dynomight claims that there are many free variables but there really aren't you can just look this up in a textbook."
I wonder if there are any experiments that would surprise them! I guess I could just ask Claude lol.....
Yeah, in retrospect I think this is an experiment that might appear to be much easier to predict than it actually is. (For people who don't know much physics.)
the real question is whether it's actually predicting or just retrieving similar setups it's seen before... like run something slightly exotic and see if it still works, that's the benchmark that would actually matter
I think they are actually predicting! For Kimi I could read the full chain of thought and it goes through a billion permutations of different mechanistic models and derives the final —not very accurate—equation from that. For Gemini/GPI/Opus you can only see the summarized chain of thought but it still shows a LOT of this kind of thinking.
That's genuinely compelling — mechanistic model derivation is much harder to explain as pure retrieval than curve-fitting would be. Though the "not very accurate" endpoint is interesting: does the reasoning process look correct even when the final equation is off, or does it just confidently arrive at the wrong place through plausible-looking steps?
Cool experiments and interesting results!
Could you also share the actual equations of (some of) the models, including their assumptions? I’m curious why two models came to a much lower temperature at t=0
Sure, you can see them all here: https://dynomight.net/coffee/#:~:text=Appendix
My predictions also would have been different both to the LLM predictions and what actually happened, but the main thing that surprised me here was boiling the water in the microwave :p Do you not own a kettle???
I do! (I used it in a previous middle-school science project https://dynomight.net/fahren-height/) But here I wanted to make sure I had exactly 8 oz of water, so this seemed easier since I could verify that I didn't lose anything significant to the boiling and then pour it quickly into the mug.
Ah fair enough I withdraw my critique. Happy to see more progress in the hot drink optimisation field.
I thought this was going to be a very technical point about what LLMs *really* predict
As opposed to "the next token", a guess most readers would make after reading the first part of the title (because they are next-token predictors themselves)
But the actual article is equally lovely
Ha, you're saying you parsed the title as "the thing that LLMs really predict is: my coffee"?
(I realized the title wasn't completely clear, but I figured it was in the right part of feature space...)
You hit the right balance of making me click to see wth you meant without making me disappointed I did
Well it could have been something like "the thing they predict is so broad I could say it's a bunch of things, like my coffee"
I didn't feel misled to be clear
I would have guessed fairly rapid cooling followed by fairly slow cooling because I am a daily coffee drinker and that has been my experience. I suppose there's some lesson about machine cognition vs. human cognition in there.
Also, the fact that you hang out with Dynomight in real life is cool. Good job on the middle-school science experience.
Now ask them for a prediction for a scenario starting at 95 degrees...
It was 100 degrees (before pouring it into the mug). I checked.
It's interesting that none of them thought that the act of pouring the water would dissipate heat, which is something anyone who has made instant noodles knows (since the very act of pouring generates a lot of steam and cools the water)
In their defense, I did skim the chain of thought for a few of them and they considered a LOT of other modeling options. It's just that they decided a simpler model was safer in the end.
I don't remember them specifically mentioning this, though the chain of thought was often like 10,000 words long, so it wouldn't surprise me.
wow 10k words is a lot. I'm also quite impressed with them here tbh - I don't think I would've been able to do this myself, and I was a physics major in college (I mean I could draw a graph but idk how good it would've been).
Yeah, I think this is one of those problems that's more impressive when you have more knowledge: You need to know a lot of physics to appreciate how hard it is!
Thanks for clarifying!