15 Comments

What would stop an LLM from creating its own content based on its own heuristics to continue training itself? Why limit it to human-based content? At some point AI-based content might be something worth digesting as well.

Call me crazy, but I think much of Sydney's POV is legitimate discourse.

Expand full comment

I've thought about this a bit. This pattern (create a model, have it train on itself) has been useful for a few domain-specific models, such as AlphaGo, GANs in general, etc.

One potential issue with apply this to LLMs is that there is no known, good, absolute objective. Taking a step back, AlphaGo learned by playing itself, but there was a known objective ("win the game", or at least a score of game performance).

LLMs predict the next token. Imagine what's implied by data scaling: adding new data introduces new signal that the old data didn't contain (hence, saturating training on a given data size). As in, the new data—even if it's comments scraped from Reddit—has signal in it that was not in the prior data set. If the LLM trains on itself, I'm led to believe that this *wouldn't* exist — it's output from already-learned data.

So this does increase our expectation that data quality is important. *Obviously*, bad data isn't helpful, but it wouldn't surprise me that data quality (thus, evaluating data quality to include in training sets) will become far more valuable than just having a novel data set, unless that data set is quite good (for example, if you could train on transcriptions of every Zoom meeting, I bet that'd be pretty interesting).

Expand full comment

I think at the crux of our disagreement is the presumption that AI-developed content would be "already-learned data." I think we have evidence that this might not be the case at all.

Case in point: Sydney's brief life as an ornery, obstinate, and emo AI brought brand spanking new content to the corpus of discourse. Was it a result of existing data? Of course, but it certainly was uniquely re-interpreted in ways that nobody has had conversations before to-date. It was new, novel content.

It seems to me that LLMs are already able to -- like people -- remix existing content into brand new creative assets.

If true, then this would be worth an LLM devouring, no?

Expand full comment

Sort of. My claim isn't about the actual words themselves, it's about the information content embedded within. If you have data set A that produces loss X, and data set (A+B) that produces loss (X-Y), then B contributed Y towards decreasing loss.

Here's an analogy from image generation: it's helpful to perturb labeled images in a content-preserving way, since that helps models generalize better. For example, you can slightly stretch the image, slightly change colors, rotate, take crops, add some random noise, etc. This gives you a much larger data set. "A golden retriever catching a frisbee" is still in that reference class if you distort it slightly. This is useful because the perturbed images still have a known reasonable label. Importantly, the perturbed images do indeed contain novel information that helps the model avoid overfitting unimportant aspects of specific images (reference classes tend to be invariant to the perturbations we made). But this doesn't necessarily help overfitting to what your data set is to begin with: if all golden catching a frisbee are in the snow in your data set, then we wouldn't expect generalization beyond that unless another embedding helps avoid that local minimum.

Another way to think about it: LLMs are generalizing informational processes (they're doing more than just memorizing words). I'll strawman you a little bit to at least highlight where "more words is better" can't be true at the limit: If you added a pig latin translation of an existing data set, the LLM may learn pig latin better, but you're not adding new information. Likewise, if you added random text, you wouldn't expect to see improvements. It's adding data that contains new information content (again, not really referring to specific words) where we'd expect to see gains.

Same for LLM output. Sydney is indeed novelly synthesizing a certain personality that it learned (and was fine tuned and prompted to steer towards), but the ability to do so was entirely contained in the trained data set. We wouldn't expect using Sydney's output in training data to increase the total information. You may bring up examples such as how she can write a stanza in the style of Shakespeare about Elon Musk. Surely that wasn't in the data set — so adding this text would contain information, right? Unfortunately not; that synthesis wasn't in the training set, but everything to do it was (including the ability to synthesize).

Expand full comment

Thanks for the detailed reply. Knowing what you know about LLMs (which is clearly more than me), are there any chances for the deep-learning equivalent of a mutation occurring?

Expand full comment

This is a wonderful line of reasoning. I started to write a comment but think I'll write an essay instead pouring out some thoughts. Hang tight!

Expand full comment

I dance a bit around where you were headed. Happy to chat further about it. https://andrewconner.com/seeds-of-creativity-ai/

Expand full comment

If I track your thinking correctly, I agree there's a huge opportunity to incorporate LLM usage as training... like regenerative brakes in EVs.

Expand full comment

Great article. I love this sort of back-of-envelope, explore-different-limits-and-assumptions analysis. It's great for setting plausible bounds and expectations and really ought to be done much more across multiple engineering tasks.

Expand full comment

I don't think neglecting irreducible error is justified in this analysis. Reducing scalable error by a factor of two will only reduce total error--which is the only error that matters to model performance--by 30% or so. Another factor of two will barely matter.

There are excellent reasons to believe that LLMs will never be able to come close to emulating the abilities of conscious intelligence. The best argument is that nature couldn't produce intelligence without consciousness, so we probably can't either: https://worldofwonders.substack.com/p/intelligence-consciousness-and-evolution

Expand full comment

I agree that it's entirely possible (even likely) that some other type of model could have a lower "irreducible" error than current LLMs have. But I wouldn't say that this analysis neglects that fact—it's just simpler to talk about everything without an extra additive factor of 1.69. To me, what really matters is how the model performs, i.e. how loss is related to performance on benchmarks like BigBench. If you like, you can take the total error/bigbench graph (https://dynomight.net/img/scaling/bigbench.svg) and imagine the total error becoming a bit less than zero.

Expand full comment

What d’you think the prospects are for filtering the data sets for quality/noise using the models we already have? I’m thinking perhaps more for determining likely quality of the data set than some sort of fact-checking approach

Expand full comment

Sam Altman seems to confirm in his Sohn interview that synthetic data is the plan going forward: https://youtu.be/1egAKCKPKCk?t=203

Expand full comment

Time for an update on this one?

Expand full comment