First principles on AI scaling

I don't think neglecting irreducible error is justified in this analysis. Reducing scalable error by a factor of two will only reduce total error--which is the only error that matters to model performance--by 30% or so. Another factor of two will barely matter.

There are excellent reasons to believe that LLMs will never be able to come close to emulating the abilities of conscious intelligence. The best argument is that nature couldn't produce intelligence without consciousness, so we probably can't either: https://worldofwonders.substack.com/p/intelligence-consciousness-and-evolution

Expand full comment

dynomight

I agree that it's entirely possible (even likely) that some other type of model could have a lower "irreducible" error than current LLMs have. But I wouldn't say that this analysis neglects that fact—it's just simpler to talk about everything without an extra additive factor of 1.69. To me, what really matters is how the model performs, i.e. how loss is related to performance on benchmarks like BigBench. If you like, you can take the total error/bigbench graph (https://dynomight.net/img/scaling/bigbench.svg) and imagine the total error becoming a bit less than zero.

Expand full comment

Great article. I love this sort of back-of-envelope, explore-different-limits-and-assumptions analysis. It's great for setting plausible bounds and expectations and really ought to be done much more across multiple engineering tasks.

Expand full comment

What would stop an LLM from creating its own content based on its own heuristics to continue training itself? Why limit it to human-based content? At some point AI-based content might be something worth digesting as well.

Call me crazy, but I think much of Sydney's POV is legitimate discourse.

Expand full comment

Andrew Conner

I've thought about this a bit. This pattern (create a model, have it train on itself) has been useful for a few domain-specific models, such as AlphaGo, GANs in general, etc.

One potential issue with apply this to LLMs is that there is no known, good, absolute objective. Taking a step back, AlphaGo learned by playing itself, but there was a known objective ("win the game", or at least a score of game performance).

LLMs predict the next token. Imagine what's implied by data scaling: adding new data introduces new signal that the old data didn't contain (hence, saturating training on a given data size). As in, the new data—even if it's comments scraped from Reddit—has signal in it that was not in the prior data set. If the LLM trains on itself, I'm led to believe that this *wouldn't* exist — it's output from already-learned data.

So this does increase our expectation that data quality is important. *Obviously*, bad data isn't helpful, but it wouldn't surprise me that data quality (thus, evaluating data quality to include in training sets) will become far more valuable than just having a novel data set, unless that data set is quite good (for example, if you could train on transcriptions of every Zoom meeting, I bet that'd be pretty interesting).

Expand full comment

I think at the crux of our disagreement is the presumption that AI-developed content would be "already-learned data." I think we have evidence that this might not be the case at all.

Case in point: Sydney's brief life as an ornery, obstinate, and emo AI brought brand spanking new content to the corpus of discourse. Was it a result of existing data? Of course, but it certainly was uniquely re-interpreted in ways that nobody has had conversations before to-date. It was new, novel content.

It seems to me that LLMs are already able to -- like people -- remix existing content into brand new creative assets.

If true, then this would be worth an LLM devouring, no?

Expand full comment

Andrew Conner

Sort of. My claim isn't about the actual words themselves, it's about the information content embedded within. If you have data set A that produces loss X, and data set (A+B) that produces loss (X-Y), then B contributed Y towards decreasing loss.

Here's an analogy from image generation: it's helpful to perturb labeled images in a content-preserving way, since that helps models generalize better. For example, you can slightly stretch the image, slightly change colors, rotate, take crops, add some random noise, etc. This gives you a much larger data set. "A golden retriever catching a frisbee" is still in that reference class if you distort it slightly. This is useful because the perturbed images still have a known reasonable label. Importantly, the perturbed images do indeed contain novel information that helps the model avoid overfitting unimportant aspects of specific images (reference classes tend to be invariant to the perturbations we made). But this doesn't necessarily help overfitting to what your data set is to begin with: if all golden catching a frisbee are in the snow in your data set, then we wouldn't expect generalization beyond that unless another embedding helps avoid that local minimum.

Another way to think about it: LLMs are generalizing informational processes (they're doing more than just memorizing words). I'll strawman you a little bit to at least highlight where "more words is better" can't be true at the limit: If you added a pig latin translation of an existing data set, the LLM may learn pig latin better, but you're not adding new information. Likewise, if you added random text, you wouldn't expect to see improvements. It's adding data that contains new information content (again, not really referring to specific words) where we'd expect to see gains.

Same for LLM output. Sydney is indeed novelly synthesizing a certain personality that it learned (and was fine tuned and prompted to steer towards), but the ability to do so was entirely contained in the trained data set. We wouldn't expect using Sydney's output in training data to increase the total information. You may bring up examples such as how she can write a stanza in the style of Shakespeare about Elon Musk. Surely that wasn't in the data set — so adding this text would contain information, right? Unfortunately not; that synthesis wasn't in the training set, but everything to do it was (including the ability to synthesize).

Expand full comment

Mar 7, 2023

Thanks for the detailed reply. Knowing what you know about LLMs (which is clearly more than me), are there any chances for the deep-learning equivalent of a mutation occurring?

Expand full comment

Reply (2)

Andrew Conner

Mar 21, 2023

I dance a bit around where you were headed. Happy to chat further about it. https://andrewconner.com/seeds-of-creativity-ai/

Expand full comment