The real data wall is billions of years of…

Oct 3, 2024

Careful with those human analogies

59 Comments

Have you considered the possibility that humans (or, really, biological intelligences in general) do not use algorithms to think? It seems to me that the weight of the evidence leans so strongly in that direction, and yet I've never seen anyone attempt to explicitly demonstrate why the opposite might be true--they just take the premise as given and run with it, as here.

Expand full comment

Strictly speaking, I guess I've *considered* the possibility that we have immortal souls that tug at our pineal glands or whatever, but I consider it very unlikely. If we assume that all human behavior can in principle be explained by physics then I feel like that would qualify as some generalized version of an "algorithm"?

Expand full comment

Just to clarify I'm not trying to make a point about pineal glands or whatever--how confident are you that all physics can be described by algorithms? How exact can those algorithms be? How much computation must those algorithms require, in relation to the physics that you're trying to describe? (for example, if I'm trying to model the behavior of one electron to a certain degree of confidence, how much physical resources and time might I need? I would assume the answer is much more than "one electron and a few picoseconds")

Expand full comment

I think it's very likely that simulating physics exactly to the tiniest detail would require huge resources. (It depends on how close your computer is to the system being simulated. The best simulation of an electron is another electron. This sounds stupid, but think about how quantum computers will speed up simulations of quantum physics)

So I suspect it might be very expensive to "exactly" simulate a human brain. But I guess the assumption I'm making is that the brain isn't *that* precise. Even with imperfect simulation, the same qualitative behaviors emerge. Or, at least, even with a more coarse simulation, another "algorithm" could be designed that would have qualitatively the same behaviors.

While I do believe that, I think you make a good point that this an assumption most people are making without stating!

Expand full comment

I think that's where we differ--my theory is that brains are so heavily optimized by evolutionary constraints, and so fundamentally built on different, "squishier" hardware, (a neuron is a massively complex biological machine, for example, and a lot of that complexity is lost by representing it as a set of weights and a threshold) that it seems to me that the burden of proof is on those claim that this process can be replicated in a digital algorithm, even in theory. So hey, thanks for hearing me out!

Expand full comment

To be clear, I'm not that confident, and I think your view is reasonable! I think it's still an open question what level of granularity you'd need to simulate a brain, and/or if you can achieve the same level of computation with other (cheaper) hardware. Many moons ago, I actually spent some time working on simulations of groups of neurons. There were some that used like 2 differential equations, and there were others that used tens of thousands. I wondered, how many do you need? The answer, as far as I can tell, was "nobody knows". (And if you need differential equations at all, that would be very expensive with our current hardware.)

Expand full comment

More inquiry in this specific direction might be valuable

Expand full comment

James McDermott

Self-plug related https://open.substack.com/pub/jmmcd/p/no-need-for-innateness-in-ai

Expand full comment

I think the main difference between human and AI learning is that humans and other lifeforms learn by doing. We kick, poke, climb trees, yell... Words are a secondary form of learning. Ai will be able to learn much more with less data when it can run its own experiments and gather experience that integrates sensing and manipulating capabilities.

Expand full comment

Intriguing post, as are most from Dynomight! But I'm surprised that this conversation is so tied to data and algorithms (with the exception of the commenter who says we are "etheric" beings).

Humans understand computers and we use that analogy too freely. Animal consciousness and intelligence is not computational. We don't have an "algorithm" for learning, it's something that happens due to the very nature of the brain, which is beyond our understanding at this point. Arguably something interesting is going on inside neural networks, but whether that has any relation to what brains do is speculation.

Brains do things that are far beyond any LLM, no matter how much training it gets, because the underlying mechanism is something completely different. As much as we've learned about the brain, we don't know where consciousness and subjective experience come from, and it's pretty hard to talk about "how we learn" without understanding those key elements. Evolution is surely part of the answer, since it gave rise to all the creatures on the planet, but evolution might be more of a guiding force. Somehow what arose out of the muck can turn electrical signals into thoughts within a subjective reality. The computational analogy is all we have, but it's a misdirection.

Expand full comment

Forrest Bennett

The same argument was made by Eric B. Baum in his 2003 book, "What is Thought?"

Expand full comment

David Saint-Germain

The genetic code is analogous to the AI model itself, not its training data. It is an extremely compressed and energy efficient model for general survival. DNA encodes complex molecular feedback loops, packing and unpacking genes and RNA to generate hundreds of thousands of 3D molecules (proteins, all RNA types, etc.) interacting to produce complex feedback systems. Reducing the information contained in this system to our four letter code representation completely ignores that the relative positions of these structures in 3D encode a shitload of information. The folding and expression by ribosomes and other structures, what you refer to as the compiler, definitely add orders of magnitude to those numbers.

-- David Saint-Germain

Expand full comment

Humans have an amazing huge associative memory with a sophisticated (though imperfect) search algorithm used by our innate learning methods. I can ask you what color hair the receptionist in your office has and you have a good chance at recalling it. That's missing from current AI efforts, isn't it?

Expand full comment

I actually love to ask random questions of people to obtain odd information. For example, if you ask a man in which grade he first remembers a girl have a nice behind, it is a good estimate that he reached sexual maturity that year. (I find that hair color is one many men will miss, but not an attractive female rear end).

Expand full comment

Here's my guess at the human learning strategy:

Observe other humans/agents behaviors

Mimic other humans/agents behaviors and observe effects

Remember observations and effects of behaviors

Reuse behaviors when the environment matches

Experiment/play when the stakes are low

There's more steps than that, but those are the big pieces. The evolved complexity you talk about in the post above is equivalent to the complexity that is embedded in the LLMs. Are we sure what DNA does? Are we sure what a specific weight in linear algebra does? There's probably value in the answer, but I think we will still be able to make progress in developing more intelligent agents without knowing it. The question will transition from "do we have enough high quality data" to "do we have simulated environments with sufficient resolution". I think that as long as you can create environments that can reward the above learning strategies, and sufficient compute to train the agents that would be trained within them, you will see significant growth in automated intelligence.

Expand full comment

I always had this same thought. There should be some kind of core mimicry in a variety of environments, then the training and vector weights. I always attributed my misunderstanding on ignorance (usually the case), but your comment makes me feel warm and fuzzy inside to think that I was not a moron perhaps in yet another thought.

Expand full comment

Great post! I especially appreciate the insight that "The human learning strategy might be vast and inscrutable". Just as the result of training an LLM on trillions of tokens yields an inscrutable matrix, it seems very easy to believe that millions of years of nervous system evolution would yield an inscrutable learning algorithm. In fact it would be astonishing if it didn't.

Nit: for an apples-to-apples comparison, when you discount language training data as only conveying 2.3 bits/word, I think it would be appropriate to also discount the number of bits per base pair of DNA (according to some notion of the incompressible information content of DNA).

Not a nit, and amplifying your central point: in your first paragraph, you note that transmitting the idea of large data scale back to 2005 would have a big impact – even though that idea can be conveyed in, say, a few hundred bits. Clearly some bits are more important than others. By this logic, the information encoded in our DNA (and epigenetics and physical structures) could in general have a very high impact per Shannon bit.

Expand full comment

I totally agree that the correct apples-to-apples comparison should be to the actual amount of "information" in DNA. While writing the above post, I actually went down a huge rabbit hole trying to estimate that. Eventually I decided it was too complicated and I needed to make it a post of its own. (Compression, algorithms, non-coding RNA genes, introns) But my tentative conclusion is that the actual "information" in DNA is *fairly* close to the capacity, but probably more like 1.8 bits per base pair rather than 2. Though it's hard to be sure since people still don't know exactly how much of DNA is actually functional, and it can get a little philosophical. ("long terminal repeats" are DNA that makes proteins that seem to have no other function except to go insert another copy of the DNA elsewhere. Should those count as information?)

Expand full comment

Wow! I would not have guessed that the genome would be so incompressible. That's very interesting.

Expand full comment

Just compression is ALSO a sort of philosophical question! Some algorithms claim a compression ratio of 100x. But those (sort of) just store a difference to a reference genome. That works very well because something like 98.5% of DNA is shared between people. But I think it's "cheating" to not count that reference genome, if that makes sense. Without the reference, nobody can do much better than than reducing size by around 10%.

Expand full comment

things I noted while reading:

* it was 10^13 in 2022 and it's 10^14 words soon

* ⁠great section on humans with modalities missing; never read an accounting for touch before

* ⁠using lm perplexity or compression capacity as a metric for how many true bits/word there are is probably not correct in some way. at bare minimum I would add the size of the model into the compression ratio's denominator

* modern text datasets have good filtering. https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1 is a good starting point to learn from && you can scroll through the data; much less advertising or other trash than you've naively expect.

* ⁠the evolution-as-huge-data-regime angle is probably correct, but I would put the millions of years as data (that just happens to only get saved through a very tiny and inefficient bottleneck) and the algorithm as incredibly bad compared to the linalg tricks

* ⁠deep learning is unfortunately not going to hit a data wall or algorithm wall within the next 3 years (/ 100x flops relative to gpt4). we ran out of human text data but openai is using majority synthetic data on their current massive run. good luck with the near future

Expand full comment

Can they not use video and audio to learn? I would imagine that must have some value (lectures, movies, television,...). That is a decent volume of data. I am surprised that they were able to get through all text in all human languages that fast. I guess those H100's are pretty good.

Expand full comment

Probably the biggest misconception in the naive model of data scaling is in how much non-text data, and also poor-quality text data, really helps with reasoning.

Audio is an especially instructive example. To a close approximation you'll get as "good" (i.e. horrible since most audio is semantic trash) performance on MMLU out of training on just text transcripts versus burning however many times compute to train on some representation of the actual audio itself (plus some shared text data so it's comparable).

(and if you *do* use pure transcripts, you won't get more than a few trillion extra useful tokens)

The inclusion of other modalities (and other language data) in current models serves the purpose of allowing the model to be capable in those domains && in cross-domain tasks, no more no less.

Expand full comment

James McDermott

So audio is either semantic trash, or is an inefficient representation of text. I actually agree with this for early models. But with good models, when we're talking about already having consumed most of the text there is, I think it changes. Raw audio becomes more usable, because the model becomes smart enough to interpret and learn from audio that was previously semantic trash.

Expand full comment

you spend the same amount of compute regardless of what the model discerns so it doesn't matter. the optimal gradient update from a fixed audio token budget is substantially worse than that of text

Expand full comment

James McDermott

About the compute, agreed of course.

But I'm less sure about the best possible weight update. Audio allows inference about physical properties of the world, in a way text does not.

Expand full comment

substack app swallowed my long reply :/

shortened: yes ceteris paribus worldmodelling +info. but 1) bringing encoded audio token bitrate anywhere close to BPE token bitrate is inherently hard and 2) even equicompressed on semantics, human audio is has dumber content on avg

I think the perspective gap comes from thinking about what audio tokens are. Your case is simply true in the world where there are magic audio tokens that happen to contain fineweb-like semantic density + extra audio-derived world info. No lab has the tech and/or data for that.

Expand full comment

Continue thread →

Oct 4, 2024Edited

BTW thanks for the reference to fineweb. Very helpful to actually be able to look at random bits of data. I agree it's not nearly as bad as you'd naively expect from just random web. There's a decent amount of low-quality stuff, but a significant minority seems quite intelligent. And I guess LLMs can essentially learn that "intelligent text is followed by more intelligent text" so the spam probably doesn't do much damage? (And presumably the companies with billions of VC funding have even better filtering...)

Expand full comment

I was basing the number of words on llama3.1. But checking it now, I think I screwed it up. Llama3.1 was using 15 trillion tokens which should be around 10^13 words, not 10^12 as I claimed. I think this is an important enough error that I should fix it. Do you have a citation for 10^14? I'm sure things are heading there soon but for most of the frontier models the numbers seems increasingly secret.

(BTW if I had to guess, I'd also think that we're not going to hit a data wall, and I'd be quite surprised if it happened in the next 2-3 years. I tend to think of synthetic data more as "algorithmic progress" than "more data" but that's largely just semantics.)

Expand full comment

OK, I (finally) updated the post to 10^13 words.

Expand full comment

I can't help but wonder if there is a way to project the rules in the DNA into a thought skeleton for AI training. Together with the "embodied information" they could help to reduce the possibility state thus improve the quality of learning strategy.

Expand full comment

Adding to your point, evolution isn't "survival of the smartest/strongest", it's the survival of the organisms that are most adaptable to a changing environment. The reason we're the dominant organism isn't just that we have big brains, it's that those brains make us extremely adaptable and let us squeeze all the utility we can out of anything. If we watch someone start a fire, we learn how to start a fire.

Also, I would add an intermediate loop - society "evolves" exponentially and even faster than evolution, which lets us gain from things we don't directly experience. Compare the way information is retained across species:

An ant colony: ants can go outside and leave temporary chemicals.

An ape troop: if one ape learns something, it can inform the others, but information will be lost if it's not important enough to remember.

A human: I can read things written centuries ago. (Graffiti written in Pompeii nearly 2000 years ago tells us that Gaius and Aulus were friends.) If I don't understand something I read, I can read something else, and eventually I'll understand it.

In other words, we can retain information over time, and evolution has programmed us to squeeze all the juice we can out of the information we gather. In contrast, LLMs learn very inefficiently. They struggled with math at first because having verbal descriptions of algorithms doesn't mean it can use those algorithms. (They have gotten better, but I don't know whether that's because they know to call a separate tool to crunch the numbers or not.)

Expand full comment

I thought about including cultural evolution here, but I wasn't sure what to make of it. I basically buy the arguments that this is core to our species' success. But It seems like language data would give them access to much of what we get from culture? So provided that their programming made them equally capable of taking advantage of it as (genetic) evolution made us, I don't see why we'd have much advantage.

Expand full comment

Sorry, I was a bit unclear there. I disagree with your last sentence - I believe that their programming uses the information they gather very inefficiently compared to humans.

I would say evolution has wired us to learn as if we're going to use that information to our advantage, which makes us engage with it more deeply. When a monkey knows how to make a fire, a slight improvement in knowing good/bad situations to apply that information makes it much more likely to survive - it'll have more benefits (protection from predators, eating cooked food) with lower risk (less wildfires).

For a more modern example: we learn Calculus I in ~10^6 words. 500 pgs of a textbook is ~400000 words, 225 minutes of a lecture weekly for 15 weeks is ~540000 words, supplemental materials increase that.

In contrast, I'm sure GPT-3 got fed well over 5 * 10^8 words about math alone and it was still bad at it. It's still unclear whether math skill is a genuine emergent property of LLMs - if so, it only emerged after "reading" _far_ more words about calculus than all the words I'll ever see/hear combined. (It could just be that it calls an external tool anyway.)

LLMs can "learn" the words it reads in the context windows, but it doesn't understand the concepts that those words convey. That happened in my math, CS, and biology classes all the time - I'd learn one thing that put something in a previous class into context in a deeper way.

Expand full comment

Michael Dickens

> I’ve been feeling grumpy about algorithms deciding what we read, so I’ve decided to experiment with links to writing that I think deserves more attention.

I, for one, think this experiment is a good idea. :)

FWIW I did not find your cartoon confusing.

I think it's interesting how neural networks train on vastly more data than humans and (probably) use much simpler strategies, but they still look more human-like than other kinds of computer systems. Neural networks, like humans, occasionally get things wrong that they really ought to get right. On tasks like video games, traditional computer programs can win by being way faster than humans, whereas neural networks win by playing like humans but better.

Expand full comment

for what its worth I think the chart makes a lot of sense. the x-axis is a log scale, and the y-axis probably is too

Expand full comment

I guess technically it would be better linear but with both of the differences greatly exaggerated. But probably best to think of them as being "vibes" scales.

Expand full comment

I love that.

Expand full comment

This is very insightful, IMO. It is kinda sorta along the lines of my thoughts on consciousness. We ignore evolution at our peril. https://www.mattball.org/2022/09/robots-wont-be-conscious.html

Expand full comment

"Even though it comes from evolution, humans are still using some learning algorithm. Just by definition, isn’t it possible to build an AI using the same tricks?"

No Sir, not in my humble opinion. Difference will always be we are etheric (spirit) first, poured in to the physical second. Computers of whatever form are physical first, they will never *have* spirit - unless spirit is defined as electromagnetism ... I guess it probably could be... but AI does't spontaneously form from biological sex via love so they can't possibly 'evolve' ancestrally.

I do love the idea of Joi in Bladerunner 2049, where she - the AI - stores preferences over time and she's so sweet and cool, but... it's a program. There is no authentic, genuine Will to pass along. No subtle EM sheathes and ancestral memory over millennia which define the human state.

The illusion of Joi is just a stick that breaks if it gets stepped on. But it had a form of wisdom... and her sophisticated AI predecessor Rachel gave birth, which is Hollywood using metaphors to spark complex metaphysic ideas...

Expand full comment

Either your "etheric stuff" has a causal influence on the physical world and we could measure it (though no one has ever succeeded), or it has no causal influence (in which case it is just empty words, and AI could, in theory, become indistinguishable from humans)

Expand full comment

That nobody has ever succeeded in measuring etheric causal stuff nor measured it sure makes my face red when you think that we're communicating through the ether via physical mechanisms, using electricity that is measured and paid for.

Expand full comment

So what you call "spirit" is simply electricity. Then, AI already has "spirit"!

By the way, we are not communicating through the ether since the theory of ether was rejected by the experiments more than one century ago.

Expand full comment

p.s. The modern term for ether is plasma.

https://www.youtube.com/watch?v=Q3LKbfRwnr4

Over and out.

Expand full comment

I'm sorry to tell you that the video you linked is pseudoscience. The nineteenth century theory of ether has nothing to do with plasma at all. And even if you want to use the word "ether" for "plasma", our electricity doesn't actually travel on plasma, but on copper wires.

Expand full comment

You are a real buzzkill. What next, cocaine is bad for you?

I focus on astrologic science myself. You know why? So I never, ever need to take classes like Riemannian Geometry ever again.

Expand full comment

Yes.

And we all know how intelligent man has become in the last century.

Expand full comment

Whether humans have become more or less intelligent during the last century has nothing to do with our topic (which was whether AI could use the same learning algorithm than humans).

Expand full comment

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts