I expect that AI will open the door to philosophical conversations considered vital for most of recorded history, but widely considered “meaningless” for the last century or so. Such as: what does “good” mean?
Obviously there’s some relationship to evolution here, because before we learned the scientific method,, the idea of “good” was tied to the idea of vitality. People thought there was a right way for an individual to live and a society to operate. They used a predictive model that said, when persons or societies deviated from that right way of living, Bad Things happened. This looks to me like a conceptual model of long term evolutionary fitness.
I’m looking forward to seeing people try to train models with explicit value systems, using those same value systems to do things like, selectively pay attention to some data and not others. This is something human beings _can_ do, and I think that’s essential for our performance.
Yeah, I think technology is on the way to making what seemed like idle philosophical questions into real practical things that we face. (The scenarios in Reasons and Persons seem more relevant every day!) What I wonder is how much philosophy will end up shaping our cultural reaction. Will we really "think" about things, or will we all just sort of collectively decide. For example, I suspect that philosophy of consciousness probably won't ultimately have much influence on if people decide AIs are conscious or not. People will just interact with the AIs and decide what they decide.
Thanks for writing this. It seems obvious to me that evolution accounts for 95+% of human learning and 100% of human intelligence. A better way to think of it is how many bits is needed to store the algorithm for learning and/or intelligence, and I think the answer is that it's probably pretty low. The problem is finding it, which clearly takes enormous optimization pressure and large amounts of data and computation, which is what evolution has been doing for the past billion years.
Another way to phrase the same thing is to consider the size of current models. Clearly most of those parameter values are spent storing very impressive amounts of encyclopedic knowledge that no human comes close to matching. I'm confident the parameters of a future very intelligent model without as much world knowledge can fit on a thumb drive, but actually getting to that specific set of descriptive bits will require a fire hose of data and computation, just as evolution has needed.
Thanks, that's a great summary. I guess I largely agree, although I'm less confident. I certainly agree that evolution has had access to an insane amount of "data" and "compute" that will be very hard to copy. But also... maybe not? Maybe evolution hasn't actually been optimizing for intelligence all that hard and the strategy that brings humans intelligence isn't that complicated? Hard to be sure!
This is a really helpful breakdown of how evolution is a kind of learning — abstract in all the good ways, and without having to rely on words like "ontogenetic"! (My only recommendation for the graph, actually, is to switch where the labels go — I first misread "one short and fleeting lifetime", for example, as being about the y-axis.) D'you have any recommendations for what someone might read who wants to get an even fuller understanding of this?
I'll take a look at both of those; thanks! For the figure, I drew up a version that makes more intuitive sense to me (warning, I suck at reading charts, and never succeeded at understanding a single one in that second-to-last page of The Economist): https://tinyurl.com/ykp2rj5u
Some changes:
1. I parallelized the titles of the axes: "amount of data" became "quantity of data"; "quality of learning method" became "quality of learning".
2. I made the colors correspond to the axes, rather than to the areas. (This lets you avoid arrows entirely. If that renders the areas too confusing, you could give each a different monochrome pattern; I can share ideas if that's confusing.)
3. I moved "area = intelligence" to the bottom, where I think it benefits from being a sort of stinging punchline. (It could also be moved to the top.)
If this is useful, use it! I really like your writing.
You could be right, but there's one thing I think I forgot to to explain. In the paper I mentioned with "KS" (https://doi.org/10.1007/s00221-021-06037-4) apparently "KS" has had since birth both zero touch and zero proprioception. So if that's right, it's at least possible to be reasonably intelligent without it.
Not that this is totally conclusive. It's only one person, mentioned in one paper.
If other modalities are key, my suspicion is that it's important that we get them with interaction. Proprioception surely helps us learn how to use our muscles. KS seems to have been able to do some approximation of that using vision, but still. There's no evidence that human intelligence is possible without being able to "experiment", and the fact that so many animals have such strong "play" instincts seems important.
"Even though it comes from evolution, humans are still using some learning algorithm. Just by definition, isn’t it possible to build an AI using the same tricks?"
No Sir, not in my humble opinion. Difference will always be we are etheric (spirit) first, poured in to the physical second. Computers of whatever form are physical first, they will never *have* spirit - unless spirit is defined as electromagnetism ... I guess it probably could be... but AI does't spontaneously form from biological sex via love so they can't possibly 'evolve' ancestrally.
I do love the idea of Joi in Bladerunner 2049, where she - the AI - stores preferences over time and she's so sweet and cool, but... it's a program. There is no authentic, genuine Will to pass along. No subtle EM sheathes and ancestral memory over millennia which define the human state.
The illusion of Joi is just a stick that breaks if it gets stepped on. But it had a form of wisdom... and her sophisticated AI predecessor Rachel gave birth, which is Hollywood using metaphors to spark complex metaphysic ideas...
Either your "etheric stuff" has a causal influence on the physical world and we could measure it (though no one has ever succeeded), or it has no causal influence (in which case it is just empty words, and AI could, in theory, become indistinguishable from humans)
That nobody has ever succeeded in measuring etheric causal stuff nor measured it sure makes my face red when you think that we're communicating through the ether via physical mechanisms, using electricity that is measured and paid for.
Whether humans have become more or less intelligent during the last century has nothing to do with our topic (which was whether AI could use the same learning algorithm than humans).
I'm sorry to tell you that the video you linked is pseudoscience. The nineteenth century theory of ether has nothing to do with plasma at all. And even if you want to use the word "ether" for "plasma", our electricity doesn't actually travel on plasma, but on copper wires.
I guess technically it would be better linear but with both of the differences greatly exaggerated. But probably best to think of them as being "vibes" scales.
> I’ve been feeling grumpy about algorithms deciding what we read, so I’ve decided to experiment with links to writing that I think deserves more attention.
I, for one, think this experiment is a good idea. :)
FWIW I did not find your cartoon confusing.
I think it's interesting how neural networks train on vastly more data than humans and (probably) use much simpler strategies, but they still look more human-like than other kinds of computer systems. Neural networks, like humans, occasionally get things wrong that they really ought to get right. On tasks like video games, traditional computer programs can win by being way faster than humans, whereas neural networks win by playing like humans but better.
Adding to your point, evolution isn't "survival of the smartest/strongest", it's the survival of the organisms that are most adaptable to a changing environment. The reason we're the dominant organism isn't just that we have big brains, it's that those brains make us extremely adaptable and let us squeeze all the utility we can out of anything. If we watch someone start a fire, we learn how to start a fire.
Also, I would add an intermediate loop - society "evolves" exponentially and even faster than evolution, which lets us gain from things we don't directly experience. Compare the way information is retained across species:
An ant colony: ants can go outside and leave temporary chemicals.
An ape troop: if one ape learns something, it can inform the others, but information will be lost if it's not important enough to remember.
A human: I can read things written centuries ago. (Graffiti written in Pompeii nearly 2000 years ago tells us that Gaius and Aulus were friends.) If I don't understand something I read, I can read something else, and eventually I'll understand it.
In other words, we can retain information over time, and evolution has programmed us to squeeze all the juice we can out of the information we gather. In contrast, LLMs learn very inefficiently. They struggled with math at first because having verbal descriptions of algorithms doesn't mean it can use those algorithms. (They have gotten better, but I don't know whether that's because they know to call a separate tool to crunch the numbers or not.)
I thought about including cultural evolution here, but I wasn't sure what to make of it. I basically buy the arguments that this is core to our species' success. But It seems like language data would give them access to much of what we get from culture? So provided that their programming made them equally capable of taking advantage of it as (genetic) evolution made us, I don't see why we'd have much advantage.
Sorry, I was a bit unclear there. I disagree with your last sentence - I believe that their programming uses the information they gather very inefficiently compared to humans.
I would say evolution has wired us to learn as if we're going to use that information to our advantage, which makes us engage with it more deeply. When a monkey knows how to make a fire, a slight improvement in knowing good/bad situations to apply that information makes it much more likely to survive - it'll have more benefits (protection from predators, eating cooked food) with lower risk (less wildfires).
For a more modern example: we learn Calculus I in ~10^6 words. 500 pgs of a textbook is ~400000 words, 225 minutes of a lecture weekly for 15 weeks is ~540000 words, supplemental materials increase that.
In contrast, I'm sure GPT-3 got fed well over 5 * 10^8 words about math alone and it was still bad at it. It's still unclear whether math skill is a genuine emergent property of LLMs - if so, it only emerged after "reading" _far_ more words about calculus than all the words I'll ever see/hear combined. (It could just be that it calls an external tool anyway.)
LLMs can "learn" the words it reads in the context windows, but it doesn't understand the concepts that those words convey. That happened in my math, CS, and biology classes all the time - I'd learn one thing that put something in a previous class into context in a deeper way.
I can't help but wonder if there is a way to project the rules in the DNA into a thought skeleton for AI training. Together with the "embodied information" they could help to reduce the possibility state thus improve the quality of learning strategy.
* great section on humans with modalities missing; never read an accounting for touch before
* using lm perplexity or compression capacity as a metric for how many true bits/word there are is probably not correct in some way. at bare minimum I would add the size of the model into the compression ratio's denominator
* the evolution-as-huge-data-regime angle is probably correct, but I would put the millions of years as data (that just happens to only get saved through a very tiny and inefficient bottleneck) and the algorithm as incredibly bad compared to the linalg tricks
* deep learning is unfortunately not going to hit a data wall or algorithm wall within the next 3 years (/ 100x flops relative to gpt4). we ran out of human text data but openai is using majority synthetic data on their current massive run. good luck with the near future
I was basing the number of words on llama3.1. But checking it now, I think I screwed it up. Llama3.1 was using 15 trillion tokens which should be around 10^13 words, not 10^12 as I claimed. I think this is an important enough error that I should fix it. Do you have a citation for 10^14? I'm sure things are heading there soon but for most of the frontier models the numbers seems increasingly secret.
(BTW if I had to guess, I'd also think that we're not going to hit a data wall, and I'd be quite surprised if it happened in the next 2-3 years. I tend to think of synthetic data more as "algorithmic progress" than "more data" but that's largely just semantics.)
BTW thanks for the reference to fineweb. Very helpful to actually be able to look at random bits of data. I agree it's not nearly as bad as you'd naively expect from just random web. There's a decent amount of low-quality stuff, but a significant minority seems quite intelligent. And I guess LLMs can essentially learn that "intelligent text is followed by more intelligent text" so the spam probably doesn't do much damage? (And presumably the companies with billions of VC funding have even better filtering...)
Can they not use video and audio to learn? I would imagine that must have some value (lectures, movies, television,...). That is a decent volume of data. I am surprised that they were able to get through all text in all human languages that fast. I guess those H100's are pretty good.
Probably the biggest misconception in the naive model of data scaling is in how much non-text data, and also poor-quality text data, really helps with reasoning.
Audio is an especially instructive example. To a close approximation you'll get as "good" (i.e. horrible since most audio is semantic trash) performance on MMLU out of training on just text transcripts versus burning however many times compute to train on some representation of the actual audio itself (plus some shared text data so it's comparable).
(and if you *do* use pure transcripts, you won't get more than a few trillion extra useful tokens)
The inclusion of other modalities (and other language data) in current models serves the purpose of allowing the model to be capable in those domains && in cross-domain tasks, no more no less.
So audio is either semantic trash, or is an inefficient representation of text. I actually agree with this for early models. But with good models, when we're talking about already having consumed most of the text there is, I think it changes. Raw audio becomes more usable, because the model becomes smart enough to interpret and learn from audio that was previously semantic trash.
you spend the same amount of compute regardless of what the model discerns so it doesn't matter. the optimal gradient update from a fixed audio token budget is substantially worse than that of text
shortened: yes ceteris paribus worldmodelling +info. but 1) bringing encoded audio token bitrate anywhere close to BPE token bitrate is inherently hard and 2) even equicompressed on semantics, human audio is has dumber content on avg
I think the perspective gap comes from thinking about what audio tokens are. Your case is simply true in the world where there are magic audio tokens that happen to contain fineweb-like semantic density + extra audio-derived world info. No lab has the tech and/or data for that.
Great post! I especially appreciate the insight that "The human learning strategy might be vast and inscrutable". Just as the result of training an LLM on trillions of tokens yields an inscrutable matrix, it seems very easy to believe that millions of years of nervous system evolution would yield an inscrutable learning algorithm. In fact it would be astonishing if it didn't.
Nit: for an apples-to-apples comparison, when you discount language training data as only conveying 2.3 bits/word, I think it would be appropriate to also discount the number of bits per base pair of DNA (according to some notion of the incompressible information content of DNA).
Not a nit, and amplifying your central point: in your first paragraph, you note that transmitting the idea of large data scale back to 2005 would have a big impact – even though that idea can be conveyed in, say, a few hundred bits. Clearly some bits are more important than others. By this logic, the information encoded in our DNA (and epigenetics and physical structures) could in general have a very high impact per Shannon bit.
I totally agree that the correct apples-to-apples comparison should be to the actual amount of "information" in DNA. While writing the above post, I actually went down a huge rabbit hole trying to estimate that. Eventually I decided it was too complicated and I needed to make it a post of its own. (Compression, algorithms, non-coding RNA genes, introns) But my tentative conclusion is that the actual "information" in DNA is *fairly* close to the capacity, but probably more like 1.8 bits per base pair rather than 2. Though it's hard to be sure since people still don't know exactly how much of DNA is actually functional, and it can get a little philosophical. ("long terminal repeats" are DNA that makes proteins that seem to have no other function except to go insert another copy of the DNA elsewhere. Should those count as information?)
Just compression is ALSO a sort of philosophical question! Some algorithms claim a compression ratio of 100x. But those (sort of) just store a difference to a reference genome. That works very well because something like 98.5% of DNA is shared between people. But I think it's "cheating" to not count that reference genome, if that makes sense. Without the reference, nobody can do much better than than reducing size by around 10%.
Mimic other humans/agents behaviors and observe effects
Remember observations and effects of behaviors
Reuse behaviors when the environment matches
Experiment/play when the stakes are low
There's more steps than that, but those are the big pieces. The evolved complexity you talk about in the post above is equivalent to the complexity that is embedded in the LLMs. Are we sure what DNA does? Are we sure what a specific weight in linear algebra does? There's probably value in the answer, but I think we will still be able to make progress in developing more intelligent agents without knowing it. The question will transition from "do we have enough high quality data" to "do we have simulated environments with sufficient resolution". I think that as long as you can create environments that can reward the above learning strategies, and sufficient compute to train the agents that would be trained within them, you will see significant growth in automated intelligence.
I always had this same thought. There should be some kind of core mimicry in a variety of environments, then the training and vector weights. I always attributed my misunderstanding on ignorance (usually the case), but your comment makes me feel warm and fuzzy inside to think that I was not a moron perhaps in yet another thought.
Humans have an amazing huge associative memory with a sophisticated (though imperfect) search algorithm used by our innate learning methods. I can ask you what color hair the receptionist in your office has and you have a good chance at recalling it. That's missing from current AI efforts, isn't it?
I actually love to ask random questions of people to obtain odd information. For example, if you ask a man in which grade he first remembers a girl have a nice behind, it is a good estimate that he reached sexual maturity that year. (I find that hair color is one many men will miss, but not an attractive female rear end).
The genetic code is analogous to the AI model itself, not its training data. It is an extremely compressed and energy efficient model for general survival. DNA encodes complex molecular feedback loops, packing and unpacking genes and RNA to generate hundreds of thousands of 3D molecules (proteins, all RNA types, etc.) interacting to produce complex feedback systems. Reducing the information contained in this system to our four letter code representation completely ignores that the relative positions of these structures in 3D encode a shitload of information. The folding and expression by ribosomes and other structures, what you refer to as the compiler, definitely add orders of magnitude to those numbers.
I expect that AI will open the door to philosophical conversations considered vital for most of recorded history, but widely considered “meaningless” for the last century or so. Such as: what does “good” mean?
Obviously there’s some relationship to evolution here, because before we learned the scientific method,, the idea of “good” was tied to the idea of vitality. People thought there was a right way for an individual to live and a society to operate. They used a predictive model that said, when persons or societies deviated from that right way of living, Bad Things happened. This looks to me like a conceptual model of long term evolutionary fitness.
I’m looking forward to seeing people try to train models with explicit value systems, using those same value systems to do things like, selectively pay attention to some data and not others. This is something human beings _can_ do, and I think that’s essential for our performance.
Yeah, I think technology is on the way to making what seemed like idle philosophical questions into real practical things that we face. (The scenarios in Reasons and Persons seem more relevant every day!) What I wonder is how much philosophy will end up shaping our cultural reaction. Will we really "think" about things, or will we all just sort of collectively decide. For example, I suspect that philosophy of consciousness probably won't ultimately have much influence on if people decide AIs are conscious or not. People will just interact with the AIs and decide what they decide.
Thanks for writing this. It seems obvious to me that evolution accounts for 95+% of human learning and 100% of human intelligence. A better way to think of it is how many bits is needed to store the algorithm for learning and/or intelligence, and I think the answer is that it's probably pretty low. The problem is finding it, which clearly takes enormous optimization pressure and large amounts of data and computation, which is what evolution has been doing for the past billion years.
Another way to phrase the same thing is to consider the size of current models. Clearly most of those parameter values are spent storing very impressive amounts of encyclopedic knowledge that no human comes close to matching. I'm confident the parameters of a future very intelligent model without as much world knowledge can fit on a thumb drive, but actually getting to that specific set of descriptive bits will require a fire hose of data and computation, just as evolution has needed.
Thanks, that's a great summary. I guess I largely agree, although I'm less confident. I certainly agree that evolution has had access to an insane amount of "data" and "compute" that will be very hard to copy. But also... maybe not? Maybe evolution hasn't actually been optimizing for intelligence all that hard and the strategy that brings humans intelligence isn't that complicated? Hard to be sure!
This is a really helpful breakdown of how evolution is a kind of learning — abstract in all the good ways, and without having to rely on words like "ontogenetic"! (My only recommendation for the graph, actually, is to switch where the labels go — I first misread "one short and fleeting lifetime", for example, as being about the y-axis.) D'you have any recommendations for what someone might read who wants to get an even fuller understanding of this?
Can you help me understand what you're thinking with the figure? Something like this: https://ibb.co/xmB6dY1 ?
As for more in-depth stuff, well... Hmmm. There's many different aspects. For biological anchors, maybe this? https://www.cold-takes.com/forecasting-transformative-ai-the-biological-anchors-method-in-a-nutshell/ (as well as Ajeya Cotra's report). For understanding AI scaling, I think this isn't terrible: https://dynomight.net/scaling/
I'll take a look at both of those; thanks! For the figure, I drew up a version that makes more intuitive sense to me (warning, I suck at reading charts, and never succeeded at understanding a single one in that second-to-last page of The Economist): https://tinyurl.com/ykp2rj5u
Some changes:
1. I parallelized the titles of the axes: "amount of data" became "quantity of data"; "quality of learning method" became "quality of learning".
2. I made the colors correspond to the axes, rather than to the areas. (This lets you avoid arrows entirely. If that renders the areas too confusing, you could give each a different monochrome pattern; I can share ideas if that's confusing.)
3. I moved "area = intelligence" to the bottom, where I think it benefits from being a sort of stinging punchline. (It could also be moved to the top.)
If this is useful, use it! I really like your writing.
Thanks! I'll ask some other people what they think. I've been staring at this figure so long I have no idea what's clearest anymore.
I wouldn't rule out proprioception being central to intelligence too hastily
You could be right, but there's one thing I think I forgot to to explain. In the paper I mentioned with "KS" (https://doi.org/10.1007/s00221-021-06037-4) apparently "KS" has had since birth both zero touch and zero proprioception. So if that's right, it's at least possible to be reasonably intelligent without it.
Not that this is totally conclusive. It's only one person, mentioned in one paper.
If other modalities are key, my suspicion is that it's important that we get them with interaction. Proprioception surely helps us learn how to use our muscles. KS seems to have been able to do some approximation of that using vision, but still. There's no evidence that human intelligence is possible without being able to "experiment", and the fact that so many animals have such strong "play" instincts seems important.
"Even though it comes from evolution, humans are still using some learning algorithm. Just by definition, isn’t it possible to build an AI using the same tricks?"
No Sir, not in my humble opinion. Difference will always be we are etheric (spirit) first, poured in to the physical second. Computers of whatever form are physical first, they will never *have* spirit - unless spirit is defined as electromagnetism ... I guess it probably could be... but AI does't spontaneously form from biological sex via love so they can't possibly 'evolve' ancestrally.
I do love the idea of Joi in Bladerunner 2049, where she - the AI - stores preferences over time and she's so sweet and cool, but... it's a program. There is no authentic, genuine Will to pass along. No subtle EM sheathes and ancestral memory over millennia which define the human state.
The illusion of Joi is just a stick that breaks if it gets stepped on. But it had a form of wisdom... and her sophisticated AI predecessor Rachel gave birth, which is Hollywood using metaphors to spark complex metaphysic ideas...
Either your "etheric stuff" has a causal influence on the physical world and we could measure it (though no one has ever succeeded), or it has no causal influence (in which case it is just empty words, and AI could, in theory, become indistinguishable from humans)
That nobody has ever succeeded in measuring etheric causal stuff nor measured it sure makes my face red when you think that we're communicating through the ether via physical mechanisms, using electricity that is measured and paid for.
So what you call "spirit" is simply electricity. Then, AI already has "spirit"!
By the way, we are not communicating through the ether since the theory of ether was rejected by the experiments more than one century ago.
Yes.
And we all know how intelligent man has become in the last century.
Whether humans have become more or less intelligent during the last century has nothing to do with our topic (which was whether AI could use the same learning algorithm than humans).
p.s. The modern term for ether is plasma.
https://www.youtube.com/watch?v=Q3LKbfRwnr4
Over and out.
I'm sorry to tell you that the video you linked is pseudoscience. The nineteenth century theory of ether has nothing to do with plasma at all. And even if you want to use the word "ether" for "plasma", our electricity doesn't actually travel on plasma, but on copper wires.
You are a real buzzkill. What next, cocaine is bad for you?
I focus on astrologic science myself. You know why? So I never, ever need to take classes like Riemannian Geometry ever again.
This is very insightful, IMO. It is kinda sorta along the lines of my thoughts on consciousness. We ignore evolution at our peril. https://www.mattball.org/2022/09/robots-wont-be-conscious.html
for what its worth I think the chart makes a lot of sense. the x-axis is a log scale, and the y-axis probably is too
I guess technically it would be better linear but with both of the differences greatly exaggerated. But probably best to think of them as being "vibes" scales.
I love that.
> I’ve been feeling grumpy about algorithms deciding what we read, so I’ve decided to experiment with links to writing that I think deserves more attention.
I, for one, think this experiment is a good idea. :)
FWIW I did not find your cartoon confusing.
I think it's interesting how neural networks train on vastly more data than humans and (probably) use much simpler strategies, but they still look more human-like than other kinds of computer systems. Neural networks, like humans, occasionally get things wrong that they really ought to get right. On tasks like video games, traditional computer programs can win by being way faster than humans, whereas neural networks win by playing like humans but better.
Adding to your point, evolution isn't "survival of the smartest/strongest", it's the survival of the organisms that are most adaptable to a changing environment. The reason we're the dominant organism isn't just that we have big brains, it's that those brains make us extremely adaptable and let us squeeze all the utility we can out of anything. If we watch someone start a fire, we learn how to start a fire.
Also, I would add an intermediate loop - society "evolves" exponentially and even faster than evolution, which lets us gain from things we don't directly experience. Compare the way information is retained across species:
An ant colony: ants can go outside and leave temporary chemicals.
An ape troop: if one ape learns something, it can inform the others, but information will be lost if it's not important enough to remember.
A human: I can read things written centuries ago. (Graffiti written in Pompeii nearly 2000 years ago tells us that Gaius and Aulus were friends.) If I don't understand something I read, I can read something else, and eventually I'll understand it.
In other words, we can retain information over time, and evolution has programmed us to squeeze all the juice we can out of the information we gather. In contrast, LLMs learn very inefficiently. They struggled with math at first because having verbal descriptions of algorithms doesn't mean it can use those algorithms. (They have gotten better, but I don't know whether that's because they know to call a separate tool to crunch the numbers or not.)
I thought about including cultural evolution here, but I wasn't sure what to make of it. I basically buy the arguments that this is core to our species' success. But It seems like language data would give them access to much of what we get from culture? So provided that their programming made them equally capable of taking advantage of it as (genetic) evolution made us, I don't see why we'd have much advantage.
Sorry, I was a bit unclear there. I disagree with your last sentence - I believe that their programming uses the information they gather very inefficiently compared to humans.
I would say evolution has wired us to learn as if we're going to use that information to our advantage, which makes us engage with it more deeply. When a monkey knows how to make a fire, a slight improvement in knowing good/bad situations to apply that information makes it much more likely to survive - it'll have more benefits (protection from predators, eating cooked food) with lower risk (less wildfires).
For a more modern example: we learn Calculus I in ~10^6 words. 500 pgs of a textbook is ~400000 words, 225 minutes of a lecture weekly for 15 weeks is ~540000 words, supplemental materials increase that.
In contrast, I'm sure GPT-3 got fed well over 5 * 10^8 words about math alone and it was still bad at it. It's still unclear whether math skill is a genuine emergent property of LLMs - if so, it only emerged after "reading" _far_ more words about calculus than all the words I'll ever see/hear combined. (It could just be that it calls an external tool anyway.)
LLMs can "learn" the words it reads in the context windows, but it doesn't understand the concepts that those words convey. That happened in my math, CS, and biology classes all the time - I'd learn one thing that put something in a previous class into context in a deeper way.
I can't help but wonder if there is a way to project the rules in the DNA into a thought skeleton for AI training. Together with the "embodied information" they could help to reduce the possibility state thus improve the quality of learning strategy.
things I noted while reading:
* it was 10^13 in 2022 and it's 10^14 words soon
* great section on humans with modalities missing; never read an accounting for touch before
* using lm perplexity or compression capacity as a metric for how many true bits/word there are is probably not correct in some way. at bare minimum I would add the size of the model into the compression ratio's denominator
* modern text datasets have good filtering. https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1 is a good starting point to learn from && you can scroll through the data; much less advertising or other trash than you've naively expect.
* the evolution-as-huge-data-regime angle is probably correct, but I would put the millions of years as data (that just happens to only get saved through a very tiny and inefficient bottleneck) and the algorithm as incredibly bad compared to the linalg tricks
* deep learning is unfortunately not going to hit a data wall or algorithm wall within the next 3 years (/ 100x flops relative to gpt4). we ran out of human text data but openai is using majority synthetic data on their current massive run. good luck with the near future
I was basing the number of words on llama3.1. But checking it now, I think I screwed it up. Llama3.1 was using 15 trillion tokens which should be around 10^13 words, not 10^12 as I claimed. I think this is an important enough error that I should fix it. Do you have a citation for 10^14? I'm sure things are heading there soon but for most of the frontier models the numbers seems increasingly secret.
(BTW if I had to guess, I'd also think that we're not going to hit a data wall, and I'd be quite surprised if it happened in the next 2-3 years. I tend to think of synthetic data more as "algorithmic progress" than "more data" but that's largely just semantics.)
OK, I (finally) updated the post to 10^13 words.
BTW thanks for the reference to fineweb. Very helpful to actually be able to look at random bits of data. I agree it's not nearly as bad as you'd naively expect from just random web. There's a decent amount of low-quality stuff, but a significant minority seems quite intelligent. And I guess LLMs can essentially learn that "intelligent text is followed by more intelligent text" so the spam probably doesn't do much damage? (And presumably the companies with billions of VC funding have even better filtering...)
Can they not use video and audio to learn? I would imagine that must have some value (lectures, movies, television,...). That is a decent volume of data. I am surprised that they were able to get through all text in all human languages that fast. I guess those H100's are pretty good.
Probably the biggest misconception in the naive model of data scaling is in how much non-text data, and also poor-quality text data, really helps with reasoning.
Audio is an especially instructive example. To a close approximation you'll get as "good" (i.e. horrible since most audio is semantic trash) performance on MMLU out of training on just text transcripts versus burning however many times compute to train on some representation of the actual audio itself (plus some shared text data so it's comparable).
(and if you *do* use pure transcripts, you won't get more than a few trillion extra useful tokens)
The inclusion of other modalities (and other language data) in current models serves the purpose of allowing the model to be capable in those domains && in cross-domain tasks, no more no less.
So audio is either semantic trash, or is an inefficient representation of text. I actually agree with this for early models. But with good models, when we're talking about already having consumed most of the text there is, I think it changes. Raw audio becomes more usable, because the model becomes smart enough to interpret and learn from audio that was previously semantic trash.
you spend the same amount of compute regardless of what the model discerns so it doesn't matter. the optimal gradient update from a fixed audio token budget is substantially worse than that of text
About the compute, agreed of course.
But I'm less sure about the best possible weight update. Audio allows inference about physical properties of the world, in a way text does not.
substack app swallowed my long reply :/
shortened: yes ceteris paribus worldmodelling +info. but 1) bringing encoded audio token bitrate anywhere close to BPE token bitrate is inherently hard and 2) even equicompressed on semantics, human audio is has dumber content on avg
I think the perspective gap comes from thinking about what audio tokens are. Your case is simply true in the world where there are magic audio tokens that happen to contain fineweb-like semantic density + extra audio-derived world info. No lab has the tech and/or data for that.
Great post! I especially appreciate the insight that "The human learning strategy might be vast and inscrutable". Just as the result of training an LLM on trillions of tokens yields an inscrutable matrix, it seems very easy to believe that millions of years of nervous system evolution would yield an inscrutable learning algorithm. In fact it would be astonishing if it didn't.
Nit: for an apples-to-apples comparison, when you discount language training data as only conveying 2.3 bits/word, I think it would be appropriate to also discount the number of bits per base pair of DNA (according to some notion of the incompressible information content of DNA).
Not a nit, and amplifying your central point: in your first paragraph, you note that transmitting the idea of large data scale back to 2005 would have a big impact – even though that idea can be conveyed in, say, a few hundred bits. Clearly some bits are more important than others. By this logic, the information encoded in our DNA (and epigenetics and physical structures) could in general have a very high impact per Shannon bit.
I totally agree that the correct apples-to-apples comparison should be to the actual amount of "information" in DNA. While writing the above post, I actually went down a huge rabbit hole trying to estimate that. Eventually I decided it was too complicated and I needed to make it a post of its own. (Compression, algorithms, non-coding RNA genes, introns) But my tentative conclusion is that the actual "information" in DNA is *fairly* close to the capacity, but probably more like 1.8 bits per base pair rather than 2. Though it's hard to be sure since people still don't know exactly how much of DNA is actually functional, and it can get a little philosophical. ("long terminal repeats" are DNA that makes proteins that seem to have no other function except to go insert another copy of the DNA elsewhere. Should those count as information?)
Wow! I would not have guessed that the genome would be so incompressible. That's very interesting.
Just compression is ALSO a sort of philosophical question! Some algorithms claim a compression ratio of 100x. But those (sort of) just store a difference to a reference genome. That works very well because something like 98.5% of DNA is shared between people. But I think it's "cheating" to not count that reference genome, if that makes sense. Without the reference, nobody can do much better than than reducing size by around 10%.
Here's my guess at the human learning strategy:
Observe other humans/agents behaviors
Mimic other humans/agents behaviors and observe effects
Remember observations and effects of behaviors
Reuse behaviors when the environment matches
Experiment/play when the stakes are low
There's more steps than that, but those are the big pieces. The evolved complexity you talk about in the post above is equivalent to the complexity that is embedded in the LLMs. Are we sure what DNA does? Are we sure what a specific weight in linear algebra does? There's probably value in the answer, but I think we will still be able to make progress in developing more intelligent agents without knowing it. The question will transition from "do we have enough high quality data" to "do we have simulated environments with sufficient resolution". I think that as long as you can create environments that can reward the above learning strategies, and sufficient compute to train the agents that would be trained within them, you will see significant growth in automated intelligence.
I always had this same thought. There should be some kind of core mimicry in a variety of environments, then the training and vector weights. I always attributed my misunderstanding on ignorance (usually the case), but your comment makes me feel warm and fuzzy inside to think that I was not a moron perhaps in yet another thought.
Humans have an amazing huge associative memory with a sophisticated (though imperfect) search algorithm used by our innate learning methods. I can ask you what color hair the receptionist in your office has and you have a good chance at recalling it. That's missing from current AI efforts, isn't it?
I actually love to ask random questions of people to obtain odd information. For example, if you ask a man in which grade he first remembers a girl have a nice behind, it is a good estimate that he reached sexual maturity that year. (I find that hair color is one many men will miss, but not an attractive female rear end).
The genetic code is analogous to the AI model itself, not its training data. It is an extremely compressed and energy efficient model for general survival. DNA encodes complex molecular feedback loops, packing and unpacking genes and RNA to generate hundreds of thousands of 3D molecules (proteins, all RNA types, etc.) interacting to produce complex feedback systems. Reducing the information contained in this system to our four letter code representation completely ignores that the relative positions of these structures in 3D encode a shitload of information. The folding and expression by ribosomes and other structures, what you refer to as the compiler, definitely add orders of magnitude to those numbers.
-- David Saint-Germain
The same argument was made by Eric B. Baum in his 2003 book, "What is Thought?"