This is similar to the multiverse approach to reporting statistical results (https://mariopasquato.substack.com/p/the-multiverse-and-millikan), though the end goal is different. Report the whole map of the garden of forking paths rather than arbitrarily choosing one path and coming up with a story to justify it. I like this!
How to test whether the Scribble Method is any good:
1. Find historical data on phenomena that grew linearly on a semi-log plot for some period of time.
2. Divide the data into "before" and "after" periods.
3. Use Scribble to extrapolate the "before" data through 2+ orders of magnitude on the vertical axis.
4. Does what actually happened historically fall between, say, the 20/80 Scribble percentiles?
(Ideally this would be done in a blinded way such that the forecaster doesn't know what specifically he's forecasting.)
The hard part is finding data from enough things that grow/grew exponentially over time. Semiconductor performance is the most obvious one. The stock market maybe. What else might we use to back-test the Scribble Method?
I'd like to see back tests of the AI 2027 methodology as well. Because frankly, most forecasting methods are bullshit, and almost all long-term nontrivial forecasts are wrong. I wouldn't be surprised if Scribble does better than the more "rigorous" techniques.
I totally agree. I've been vaguely considering something like this, but it seems relatively hard.
Really, it's more than just a method for evaluation—this should be a method for *training*. If you go through a few hundred examples and get feedback, I'm sure you'll improve dramatically.
Love the scribbling averager, that’s super useful.
I feel it’s important to clarify that the reason it’s important/better to attach some math to your forecast when you can is not JUST because it can handle well-defined rules, but also because it allows you to check your conclusions against any existing data (not just a rule, the model is the rule) and generalize out of distribution in a principled fashion. To post-dict as well as predict.
Related to this, one of the most devastating problems pointed out with AI 2027 was its extreme insensitivity to some of the initial conditions. This seems like a flaw, which was revealed by probing the math model, thus demonstrating the utility of attaching some (possibly made up) math to it.
I agree with ~everything you said above. However I want to chime in and say that the "extreme insensitivity to some of the initial conditions" while indeed the most devastating problem, wasn't devastating enough to count as devastating IMO. Like, we have intuitions+arguments, that lots of people share/agree-with, that after some relatively short finite period of tinkering & additional training, horizon lengths might get arbitrarily long in the relevant sense. (e.g. AGI = can do everything humans can do at least as well or better, so would continue to crush METR's benchmark even as they extended it to cover more time horizons.) How to model this? Got any suggestions? Our idea was "Maybe each doubling of horizon length is easier than the previous one, such that you get infinite doublings after finite time." This is sensitive to the historic rate of doubling and to whether it seems to be accelerating, but not to the absolute magnitude of the data points. The fact that it isn't sensitive to the absolute magnitude is indeed troubling and has caused us to search for alternatives, and we are glad it was pointed out, but I continue to think it would be crazy to assign 0% to the hypothesis we were trying to model. And also, besides, even if you do assign 0% to that hypothesis and put 100% on "it's going to be exponential forever" then you still get to the superhuman coder milestone in a few years (admittedly a year or two more than we thought), such that the proper response should be "Oh, they should have made it AI 2029 instead of AI 2027" rather than "Devastating critique, look at those fools"
My critique is more general than specific to your exact failure mode. I think the exercise of mechanistically modeling timelines is very useful, especially as a strawman to critique, but presumably you attached some math to the model for a reason, right? For instance, having a principled way to estimate a date of AGI instead of just declaring it 2027 by fiat. But, if your model predicts the same answer to within +/- 1 year regardless of the initial conditions (and my understanding is that it was insensitive to one or two parameters, not every one of them?), then what is the point of the math?
In your case, the model you built up revealed a superexponential growth that turned out not to be sensitive to the initial conditions. That flaw to me would cause me to reject part of the model architecture, not accept the conclusion, regardless of how well-justified you feel the internals are. You built the internal model based on educated guesses, and the testing revealed they make a deeply, deeply unintuitive prediction! Thus, I conclude there is a model flaw. What is it? Well, I don't know. This is good, though, that's exactly what limit testing like this is supposed to do: reveal the gaps in your intuition!
I should note, MY intuition on model validity comes from physics, and while it *does* occasionally occur that you have a system where your initial conditions cancel out, this is nearly *always* an indicator of either really important structure buried in the physics or a modeling error. And it's usually modeling error.
Edit: I should also add that this flaw in no way decreases my appreciation for creating the model—the attempt to systematize intuition inevitably improves it.
First of all, it was one hypothesis in the model, not the whole model, and secondly, it was sensitive to some initial conditions but not others. (Specifically it is sensitive to the shape and slope of the trend, because this determines how long it takes to get to infinity, but not the absolute magnitude of the data points in the trend, because that just determines what the run-up to infinity looks like so to speak). So it's not true that it gives +/- 1 year regardless of initial conditions.
I also think that the prediction is not deeply unintuitive as I tried to explain above--it seems to me that there's a real chance that horizon lengths get arbitrarily long in finite time. (Indeed when I talk to people about the METR data they'll often volunteer this without me prompting them). That said, I do agree it's unintuitive & we've been trying to think of ways to fix it & are open to suggestions.
Anyhow I totally agree that the attempt to systematize intuition improves it, that's why we are doing it, & that's why I liked your comment.
I agree in general but there's a subtle issue here: Do you want to critique the model, or the particular forecast? There can be situations where a model performs poorly in some situations but performs well in the particular case you actually care about. This is sort of what I was trying to get at with this paragraph: https://dynomight.net/scribbles/#:~:text=weaker
Yeah I suppose you could argue that adding math could make a forecast more wrong by simply increasing the size of the valid domain, and thus making wronger irrelevant predictions.
Eh, either way I love the scribble averager, fantastic idea.
The scribble app seems to have a bug. When I draw a single line that plateaus without ever crossing the 1 mo. threshold, the data analysis plot tells me incorrectly that my model has a 100% chance of surpassing the 1 mo. threshold. Not sure if I'm doing something wrong, but also it's quite apt for the app to have such a bug, seeing as how it was made with an AI :-P
Huh. According to the CSV that line is indeed crossing the 1 month time, which is at PlotY=log10(365.25/12*24*60*60)=6.4199, which doesn't line up at all with the screenshot. Must be some kind of display issue on mobile... (I'm actually amazed it "works" on mobile at all...)
I don't really get the scribbling part. Are you really saying that these were 50 trajectories that were, individually, plausible, ie you were thinking about a specific trajectory when drawing each one? Or just this is a quick way of creating a distribution, equivalent to basically drawing the fastest and slowest possible curves and then filling in between them?
Not sure if this helps, but I tried to imagine a mental distribution over possible futures. Then each of the scribbled line is a "sample" from that distribution.
Is this the same as trying to think of multiple different narrative drivers and seeing what scenarios they can lead to, or are you just sampling from one basic narrative?
Each time you draw a curve, you sort of think "ok in this case maybe compute spending slows down but there's a lot of algorithmic progress and...". And see where that takes you. Do that repeatedly such that the collective of lines represents how likely you think all the different futures are.
Other complexities include Kazakhstan and Ukraine, uncertainty about the date for Israel, and if you should count Belgium, Germany, Italy, Netherlands, Turkey, or Belarus! (https://en.wikipedia.org/wiki/Nuclear_sharing)
This reminds me of a cognitive model of decision-making in a multi-stage task involving uncertainty. The model basically posits that humans randomly sample trajectories and then aggregate that information in some way to make a decision. I figure there are many other stochastic models of cognition
Rather than a diffusion model, I think there is some work by Jared Hotaling simulating sampling from a tree like this (https://imgur.com/VeHmSsk). This isn't well-known work but this jives with what I feel like my mind is doing when I'm imagining outcomes under multi-stage uncertainty (I repeatedly sample different paths that could be taken or I maybe just go with whatever the first sample says)
Sort of like drawing a decision tree? In principle, I think it's nice that you can be quantitative about different decision points with something like that. But I worry that if you tried to draw a tree of all possible outcomes leading to AGI (or not) by 2050 it would have billions of nodes.
If we don't stick to firm nodes that every scribble must traverse, I figure that your scribble thinking is rather analogous to this. Just doodling possible paths and aggregating the trajectories
This is similar to the multiverse approach to reporting statistical results (https://mariopasquato.substack.com/p/the-multiverse-and-millikan), though the end goal is different. Report the whole map of the garden of forking paths rather than arbitrarily choosing one path and coming up with a story to justify it. I like this!
TL;DR on Scribbling: Does Monte Carlo simulation sound too technical? What if we call it Scribble-based forecasting?
How to test whether the Scribble Method is any good:
1. Find historical data on phenomena that grew linearly on a semi-log plot for some period of time.
2. Divide the data into "before" and "after" periods.
3. Use Scribble to extrapolate the "before" data through 2+ orders of magnitude on the vertical axis.
4. Does what actually happened historically fall between, say, the 20/80 Scribble percentiles?
(Ideally this would be done in a blinded way such that the forecaster doesn't know what specifically he's forecasting.)
The hard part is finding data from enough things that grow/grew exponentially over time. Semiconductor performance is the most obvious one. The stock market maybe. What else might we use to back-test the Scribble Method?
I'd like to see back tests of the AI 2027 methodology as well. Because frankly, most forecasting methods are bullshit, and almost all long-term nontrivial forecasts are wrong. I wouldn't be surprised if Scribble does better than the more "rigorous" techniques.
I totally agree. I've been vaguely considering something like this, but it seems relatively hard.
Really, it's more than just a method for evaluation—this should be a method for *training*. If you go through a few hundred examples and get feedback, I'm sure you'll improve dramatically.
Also, maybe we can find "super-scribblers"??
I really thought this was going to be about Squiggle: https://www.squiggle-language.com/
Love the scribbling averager, that’s super useful.
I feel it’s important to clarify that the reason it’s important/better to attach some math to your forecast when you can is not JUST because it can handle well-defined rules, but also because it allows you to check your conclusions against any existing data (not just a rule, the model is the rule) and generalize out of distribution in a principled fashion. To post-dict as well as predict.
Related to this, one of the most devastating problems pointed out with AI 2027 was its extreme insensitivity to some of the initial conditions. This seems like a flaw, which was revealed by probing the math model, thus demonstrating the utility of attaching some (possibly made up) math to it.
I agree with ~everything you said above. However I want to chime in and say that the "extreme insensitivity to some of the initial conditions" while indeed the most devastating problem, wasn't devastating enough to count as devastating IMO. Like, we have intuitions+arguments, that lots of people share/agree-with, that after some relatively short finite period of tinkering & additional training, horizon lengths might get arbitrarily long in the relevant sense. (e.g. AGI = can do everything humans can do at least as well or better, so would continue to crush METR's benchmark even as they extended it to cover more time horizons.) How to model this? Got any suggestions? Our idea was "Maybe each doubling of horizon length is easier than the previous one, such that you get infinite doublings after finite time." This is sensitive to the historic rate of doubling and to whether it seems to be accelerating, but not to the absolute magnitude of the data points. The fact that it isn't sensitive to the absolute magnitude is indeed troubling and has caused us to search for alternatives, and we are glad it was pointed out, but I continue to think it would be crazy to assign 0% to the hypothesis we were trying to model. And also, besides, even if you do assign 0% to that hypothesis and put 100% on "it's going to be exponential forever" then you still get to the superhuman coder milestone in a few years (admittedly a year or two more than we thought), such that the proper response should be "Oh, they should have made it AI 2029 instead of AI 2027" rather than "Devastating critique, look at those fools"
What do you think?
My critique is more general than specific to your exact failure mode. I think the exercise of mechanistically modeling timelines is very useful, especially as a strawman to critique, but presumably you attached some math to the model for a reason, right? For instance, having a principled way to estimate a date of AGI instead of just declaring it 2027 by fiat. But, if your model predicts the same answer to within +/- 1 year regardless of the initial conditions (and my understanding is that it was insensitive to one or two parameters, not every one of them?), then what is the point of the math?
In your case, the model you built up revealed a superexponential growth that turned out not to be sensitive to the initial conditions. That flaw to me would cause me to reject part of the model architecture, not accept the conclusion, regardless of how well-justified you feel the internals are. You built the internal model based on educated guesses, and the testing revealed they make a deeply, deeply unintuitive prediction! Thus, I conclude there is a model flaw. What is it? Well, I don't know. This is good, though, that's exactly what limit testing like this is supposed to do: reveal the gaps in your intuition!
I should note, MY intuition on model validity comes from physics, and while it *does* occasionally occur that you have a system where your initial conditions cancel out, this is nearly *always* an indicator of either really important structure buried in the physics or a modeling error. And it's usually modeling error.
Edit: I should also add that this flaw in no way decreases my appreciation for creating the model—the attempt to systematize intuition inevitably improves it.
First of all, it was one hypothesis in the model, not the whole model, and secondly, it was sensitive to some initial conditions but not others. (Specifically it is sensitive to the shape and slope of the trend, because this determines how long it takes to get to infinity, but not the absolute magnitude of the data points in the trend, because that just determines what the run-up to infinity looks like so to speak). So it's not true that it gives +/- 1 year regardless of initial conditions.
I also think that the prediction is not deeply unintuitive as I tried to explain above--it seems to me that there's a real chance that horizon lengths get arbitrarily long in finite time. (Indeed when I talk to people about the METR data they'll often volunteer this without me prompting them). That said, I do agree it's unintuitive & we've been trying to think of ways to fix it & are open to suggestions.
Anyhow I totally agree that the attempt to systematize intuition improves it, that's why we are doing it, & that's why I liked your comment.
I agree in general but there's a subtle issue here: Do you want to critique the model, or the particular forecast? There can be situations where a model performs poorly in some situations but performs well in the particular case you actually care about. This is sort of what I was trying to get at with this paragraph: https://dynomight.net/scribbles/#:~:text=weaker
Yeah I suppose you could argue that adding math could make a forecast more wrong by simply increasing the size of the valid domain, and thus making wronger irrelevant predictions.
Eh, either way I love the scribble averager, fantastic idea.
The scribble app seems to have a bug. When I draw a single line that plateaus without ever crossing the 1 mo. threshold, the data analysis plot tells me incorrectly that my model has a 100% chance of surpassing the 1 mo. threshold. Not sure if I'm doing something wrong, but also it's quite apt for the app to have such a bug, seeing as how it was made with an AI :-P
Weird. Likely it is a bug but I can't easily reproduce it. Can you export the CSV for your line and paste it here?
Here's a screenshot: https://drive.google.com/file/d/17mZWuI7nGl6p4y_N_OjHC55zmwWf3ZL2/view?usp=drivesdk
FWIW I'm using the app on mobile
# Plot Line Tracer Export
# Calibration Data
# BottomLeft: PixelX=125, PixelY=363.45001220703125, PlotX=2020, PlotY=0
# TopRight: PixelX=733, PixelY=19.45001220703125, PlotX=2050, PlotY=8.5
# Data
Line,Point,PixelX,PixelY,PlotX,PlotY
1,1,26.00,160.85,2015.115132,5.006075
1,2,95.00,147.85,2018.519737,5.327296
1,3,149.00,107.85,2021.184211,6.315668
1,4,254.00,100.85,2026.365132,6.488634
1,5,331.00,77.85,2030.164474,7.056947
Huh. According to the CSV that line is indeed crossing the 1 month time, which is at PlotY=log10(365.25/12*24*60*60)=6.4199, which doesn't line up at all with the screenshot. Must be some kind of display issue on mobile... (I'm actually amazed it "works" on mobile at all...)
I don't really get the scribbling part. Are you really saying that these were 50 trajectories that were, individually, plausible, ie you were thinking about a specific trajectory when drawing each one? Or just this is a quick way of creating a distribution, equivalent to basically drawing the fastest and slowest possible curves and then filling in between them?
Not sure if this helps, but I tried to imagine a mental distribution over possible futures. Then each of the scribbled line is a "sample" from that distribution.
Is this the same as trying to think of multiple different narrative drivers and seeing what scenarios they can lead to, or are you just sampling from one basic narrative?
Each time you draw a curve, you sort of think "ok in this case maybe compute spending slows down but there's a lot of algorithmic progress and...". And see where that takes you. Do that repeatedly such that the collective of lines represents how likely you think all the different futures are.
Your graph of nuclear weapons is missing South Africa, although I can see why you wouldn't include it for simplicities sake.
Other complexities include Kazakhstan and Ukraine, uncertainty about the date for Israel, and if you should count Belgium, Germany, Italy, Netherlands, Turkey, or Belarus! (https://en.wikipedia.org/wiki/Nuclear_sharing)
This reminds me of a cognitive model of decision-making in a multi-stage task involving uncertainty. The model basically posits that humans randomly sample trajectories and then aggregate that information in some way to make a decision. I figure there are many other stochastic models of cognition
Something like this? https://pmc.ncbi.nlm.nih.gov/articles/PMC5112760/
Rather than a diffusion model, I think there is some work by Jared Hotaling simulating sampling from a tree like this (https://imgur.com/VeHmSsk). This isn't well-known work but this jives with what I feel like my mind is doing when I'm imagining outcomes under multi-stage uncertainty (I repeatedly sample different paths that could be taken or I maybe just go with whatever the first sample says)
Sort of like drawing a decision tree? In principle, I think it's nice that you can be quantitative about different decision points with something like that. But I worry that if you tried to draw a tree of all possible outcomes leading to AGI (or not) by 2050 it would have billions of nodes.
If we don't stick to firm nodes that every scribble must traverse, I figure that your scribble thinking is rather analogous to this. Just doodling possible paths and aggregating the trajectories
Oh, I totally agree. It's the same basic principle of sort of "forecasting in detail" and then summarizing.