Say you have a time machine. You can only use it once, to send a single idea back to 2005. If you wanted to speed up the development of AI, what would you send back? Many people suggest attention or transformers. But I’m convinced that the answer is “brute-force”—to throw as much data at the problem as possible.
AI has recently been improving at a harrowing rate. If trends hold, we are in for quite a show. But some suggest AI progress might falter due to a “data wall”. Current language models are trained on datasets fast approaching “all the text, ever”. What happen when it runs out?
Many argue this data wall won’t be a problem, because humans have excellent language and reasoning despite seeing far less language data. They say that humans must be leveraging visual data and/or using a more data-efficient learning algorithm. Whatever trick humans are using, they say, we can copy it and avoid the data wall.
I am dubious of these arguments. In this post, I will explain how you can be dubious, too.
The math checks out—humans see much less language data
Every day, an average person reads a few thousand words, and hears perhaps 16 to 40 thousand. So a well-educated 40-year old might have encountered 5×10⁸ words in their lifetime. Recent language models are trained on upwards of 10¹² words—20,000 times more. It’s not even close.
Imagine a fast reader who did nothing but read 300 words/minute for 80 years, never pausing to eat or sleep. They’d still see 100 times fewer words than AIs do.
Vision is not the key to human intelligence
So then how do humans generalize so well from so little language data? Is “pre-training” on visual data the secret to our success?
No.
Because… blind people? What are we doing here?
Deaf people show that (non-verbal) sound isn’t critical either.
Could it be touch? There is a disease called congenital insensitivity to pain with anhidrosis or CIPA. People with CIPA often have intellectual disabilities, but much of that is surely due to the (horrible) consequences of CIPA or the protein misfolding issue that causes it. And anyway, many people with CIPA have normal intelligence. Miall et al. (2021) describe a person known as “KS” who is not paralyzed but has had no sense of touch at all since birth. They don’t do any IQ tests, but do mention that KS graduated from law school.
It seems unlikely that intelligence would be based on smell or proprioception.
Maybe we need vision or sound? At first, I thought Helen Keller was a counter-example to this. Clearly she was very smart, but she apparently had sight and hearing before losing them to meningitis at the age of 19 months. Other people are deafblind since birth. They often have intellectual disabilities, but Larsen and Dammeyer (2020) report that many don’t if given early access to language though tactile signing. However, they only report the fraction of people with IQs above 70, and I can’t find tell if anyone born deafblind went on to have an average IQ.
Now, don’t write off other modalities. It could be that human brains are so adaptable, that we just need exposure to language plus some kind of high-resolution sensory data. Or maybe it’s critical that we interact with our environment. We have no examples of paralyzed people with no senses that somehow survive and passively absorb language for decades.
Maybe! Or maybe all that other sensory data is irrelevant. I don’t know. But that’s kind of the point—the example of humans just isn’t very useful for predicting how helpful other modalities might be for AI.
Humans get information from evolution
Many comparisons between humans and AIs seem to be based on the following analogy:
AI systems “learn” from data.
Human babies “learn” from experience.
The issue with this analogy is that humans are born with extremely sophisticated programming, provided by evolution. That programming integrates information from all our ancestors, arguably going back to the origin of life on earth.
When you train an AI, you have to learn lots of stuff that babies get “for free”. Your intelligence is based on “data” from your whole evolutionary history, not just your lifetime.
Now, a skeptic might accept that human babies get some information from evolution, but object that it can’t be much information. After all, a single month of CommonCrawl (used by all current models) is around 200 terabytes. Yet human DNA has around 2.9 billion base pairs, each of which can take 4 values. That adds up to only around 6 billion bits or 690 megabytes. That’s 30,000 times less.
And DNA has lots of other jobs beyond intelligence, like making ribosomes or making teeth or running an immune system. Is DNA too small to matter?
DNA is not that small
Your DNA probably contains more information than all the words you’ll encounter in your whole life.
Claude Shannon, the father of information theory, famously estimated that English language text on average has 11.8 bits/word. But recent language models suggest that it’s only 2.3 bits/word or perhaps even less. So the 5×10⁸ words a person might have heard probably only contains 1.2 billion bits of actual information—less than the 6 billion bits in DNA.
(People used to think that most DNA was “junk” because it didn’t directly code for proteins. But research increasingly suggests it plays other important roles, like determining how DNA folds or regulating the expression of other genes. This is an active research area with credible people on both sides. I’ll stick with 6 billion bits for simplicity.)
Now a modern AI dataset of 10¹² words surely does contain much more information than DNA—perhaps 380x as much. But a huge fraction of that is things like, “to cure all health problems, buy CBD gummies now”, which isn’t super useful for intelligence. So the information in DNA isn’t that far off.
It’s not just DNA
Evolution provides lots of other information beyond just what’s written down in the DNA. And I’m not just talking about epigenetics—I mean all the information embodied in the physical structure of cells.
(I can feel everyone squinting right now.)
DNA is a “blueprint” for a cell. But information is needed to interpret that blueprint. Imagine a machine that could take in a DNA sequence and build a human cell. How many bits would be needed to describe that machine? A lot, right?
Of course, there’s a recursive “chicken and the egg” issue here: The machines that actually make human cells from DNA are… other human cells. But you need some information to get the loop started!
Here’s an analogy for programmers: Say I invent a new programming language called Dynoscript. (“Strongly typed immutable arrays and existential angst.”) And then I write a Dynoscript compiler in Dynoscript. Can I now run programs written in Dynoscript? No, because I have no way of running the compiler.
Where in DNA does it say that DNA is supposed to have a double-helix structure? Where does it say that “A” means C⁵H⁵N⁵? That information is represented in the physical configuration of the atoms in the cell, and is physically propagated when cells divide. (I’m sure there are better examples, but biologists get very stressed when pressured to think this way.) I have no idea how to quantify the amount of “embodied information” like this, but I’m sure it’s substantial.
But if you still think DNA is too small and you don’t believe in embodied information, that’s fine, because…
“Learning” is just one execution of the inner-loop
Here’s a cartoon showing how I think about the different contributions to human and artificial intelligence:
All of my test readers said that cartoon was confusing and futilely begged me to delete it. So probably I should explain.
We have been optimized by evolution. Partly evolution tuned our base instincts, like “food good” or “hypothermia bad”. But it also tuned the algorithm that we use to learn from our life experiences.
Human intelligence is the product of a “double-loop” optimization. In the outer loop, evolution tinkers with different learning strategies. In the inner loop, we are born as babies, we grow into adults following the strategy evolution gave us, we reproduce (or not) and we die. Then evolution picks the strategies that led to more offspring and uses them as the basis for further tinkering.
AI models are the product of a double-loop optimization, too. In the outer loop, human engineers tinker with different machine learning tricks. In the inner loop those algorithms are loaded into a giant cluster and run against data. The engineers pick out the strategies that work and repeat.
The most salient difference between these is who is in charge of the outer loop. But don’t get distracted by that.
The most important difference between these is that the evolution outer loop has executed many, many more times.
Also important is that each iteration of the evolution outer loop runs on “fresh data”. Imagine that GPUs became billions of times cheaper, and you hired billions of engineers so you could test billions of machine learning tricks in parallel. Problem solved? Not necessarily, because you’ll eventually “overfit” to whatever data you hold out to test generalization. There’s only so much you can squeeze out of a finite dataset.
Each bit from evolution integrates experience from millions of years of life and so may have a “multiplicative” effect on how effective in-lifetime learning is.
The human learning strategy might be vast and inscrutable
Even though it comes from evolution, humans are still using some learning algorithm. Just by definition, isn’t it possible to build an AI using the same tricks?
In principle, yes! But it might be very hard in practice. The key question is to what degree the human learning strategy “makes sense”. If it’s something simple, then probably we’ll eventually copy it. If it’s a collection of millions of unintelligible interacting “hacks” tuned to statistical properties of the environment, then maybe not.
Just because humans learn efficiently doesn’t necessarily mean their strategy will be easy to copy.
Caution on these cautions
Now hear me. I am not arguing that we will hit a data wall. Just because humans don’t need to pre-train on visual data doesn’t mean that visual data won’t be useful for AI. And just because human learning strategies integrate vast amounts of information from our evolutionary history doesn’t mean that algorithmic progress is impossible. I’m just saying that if you want to argue for visual data or algorithmic progress, a direct argument is more convincing than gesturing at some human babies.
P.S. I’ve been feeling grumpy about algorithms deciding what we read, so I’ve decided to experiment with links to writing that I think deserves more attention. Today’s link is to Philosophical Multicore on Outlive: A Critical Review.
Well joke’s on you, I already exercised today, and now I’m back to over-analyze saturated fat. My assessment:
Saturated fat is unhealthy in expectation: likely true (credence: 85%).
It’s a good idea for most people to reduce their SFA intake: possible (credence: 50%).
It’s a good idea for people with high cholesterol to reduce their SFA intake: likely true (credence: 70%).
The data are unclear: unclear. (Yes, it’s unclear whether the data are unclear. It depends on how much clarity you want.)
In the same spirit, if you’ve written a blogpost-sized response to this—or any—post let me know and I’ll strongly consider sending it out at the bottom of a future post. (Unless it’s really bad, in which case I won’t.) And in the spirit of that spirit, if you liked this post, consider sending it to a friend.
I expect that AI will open the door to philosophical conversations considered vital for most of recorded history, but widely considered “meaningless” for the last century or so. Such as: what does “good” mean?
Obviously there’s some relationship to evolution here, because before we learned the scientific method,, the idea of “good” was tied to the idea of vitality. People thought there was a right way for an individual to live and a society to operate. They used a predictive model that said, when persons or societies deviated from that right way of living, Bad Things happened. This looks to me like a conceptual model of long term evolutionary fitness.
I’m looking forward to seeing people try to train models with explicit value systems, using those same value systems to do things like, selectively pay attention to some data and not others. This is something human beings _can_ do, and I think that’s essential for our performance.
Thanks for writing this. It seems obvious to me that evolution accounts for 95+% of human learning and 100% of human intelligence. A better way to think of it is how many bits is needed to store the algorithm for learning and/or intelligence, and I think the answer is that it's probably pretty low. The problem is finding it, which clearly takes enormous optimization pressure and large amounts of data and computation, which is what evolution has been doing for the past billion years.
Another way to phrase the same thing is to consider the size of current models. Clearly most of those parameter values are spent storing very impressive amounts of encyclopedic knowledge that no human comes close to matching. I'm confident the parameters of a future very intelligent model without as much world knowledge can fit on a thumb drive, but actually getting to that specific set of descriptive bits will require a fire hose of data and computation, just as evolution has needed.