How much information is in DNA?

One potential issue with your definition of "makes a recognizable you" is that it discards all of the information further down the evolutionary line. If, as you say, much of the "redundant" parts are error-correction for reproduction, then they're not truly redundant; removing them might lead to the first person being the same, but any children they have would be different. This is semi-pedantic, but seems less so than the "your cells are lighter" sort of thing, since it would lead to very noticeable and important-to-humans changes after maybe just a generation or two.

Absolutely, if the "you" with the most compressed possible genome was created, they'd basically be destroyed by mutations when they tried to reproduce. Personally, I think that's OK and possibly a feature, depending on what you want to do with your measure of information in DNA. Personally, I'm most interested in that number for the implications it has for data scaling rates for AI, in which case "information used to fight mutations" doesn't seem relevant. But it all depends.

An interesting question would be: Is there a precise definition of information that does include all the anti-mutation machinery but still somehow excludes meaningless stuff like retrotransposons (when meaningless)? Perhaps you could just change the definition to "most compressed genome while still able to reproduce without suffering higher mutational load than current humans"?

Isaac King

Great article.

I think it's worth noting by the way that some of the information that makes you "you" may not be held in your DNA at all. Any time a cell reproduces, the child cell needs to be built by something, it can't bootstrap from solely DNA. These various ancillary mechanisms could be conveying their own information into the child cell, separate from the DNA. (Epigenetic inheritance.) In particular, they describe how the DNA is interpreted, which is itself information. Arguably DNA is less analogous to code and more analogous to data, with the cell's mechanisms that interact with it being the code that reads the data and decides what to do with it.

Perhaps another measure of complexity that takes this into account would be to take a page out of the Hutter prize's book and use Kolmogorov complexity but including both the "code" and the "data"; that is, look for the fewest number of atoms that could, on their own, grow into a recognizable human. But this is unfortunately still subjective of course, because an egg needs a pleasant environment to grow in, this one could argue that it isn't truly standalone. (Same issue with the Huttter prize, the code needs a known computing platform to run on. If you ported it to an alien's computer that had been developed totally separately, different processor architecture, different programming languages, etc, it wouldn't run.)

Totally agree with your first point: https://dynomight.net/data-wall/#its-not-just-dna

Regarding Kolmogorov complexity, note that technically speaking, Kolmogorov complexity is only defined up to a constant that depends on the "computer" that would run the code. I think that as we restrict ourselves to "reasonable" computers (that don't, say, include a copy of a reference genome!) then the distinction between "code" and "data" wouldn't really exist and this would line up pretty well with what you're suggesting.

Adam Quirk

The bigger question for me is “specified complexity” (borrowed from Dembski) and the idea that functional information can’t arise by undirected processes. Or at least it has never done so to our knowledge and has not been proven possible either theoretically or experimentally.

Step 1 – Put an upper bound on the probabilistic resources of the universe.

Seth Lloyd’s 10¹²⁰ figure is the largest number of discrete yes/no operations the universe could possibly have executed since the Big Bang.

Step 2 – Compare it with the combinatorial search-space.

For a modest protein of 150 amino acids Orgel & Joyce quote 10¹⁶⁴ possibilities—already 44 orders of magnitude bigger than the cosmic limit.

For a 200-base self-replicating RNA (the minimum length most chemists think plausible) the space is 10¹²⁰, right at the same threshold.

This is why I think multiverse theory sneaks into the abiogenesis discussion.

Jeff Geraghty

How much information is in a molecule of water? How can such a simple molecule express so much meaning at different levels? (I’m thinking of quenched thirst, flowing rivers, poetry, and the symbols and metaphors and meanings we can convey with water)

"Information is the bridge between chaos and clarity, a whisper that shapes the world."

Jeff Geraghty

How many bits, though?

Thanks for another fascinating article.

Now that we have a tentative answer to "How much information is in DNA?", my follow-up question is: how much additional information is in a human organism -- or if you want to simplify, a human embryo or zygote?

Consider the following thought experiment: We discover an alien civilization whose biology has nothing in common with ours, and we want to instruct them on how to make humans who can live on their planet. (Disregard for now practical problems such as what these humans would eat.) Our technology does not allow a personal visit, so we can only communicate by radio (with high latency). We send them a chemical description of DNA and a human DNA sequence. What is the minimum amount of additional information we must send in order for them to construct a viable human from scratch?

I have no idea how to attack this problem from the perspective of information theory. Can you think of a way to approach it?

I totally agree with the point you're making. There's lots of extra information embodied in the physical structure of the cells outside the DNA. (I've discussed this before: https://dynomight.net/data-wall/#its-not-just-dna)

I like the way you're thinking about it: Define the extra information as the (sort of) Kolmogorov for instructions to build a machine that could take a DNA sequence and build a cell from it. Unfortunately I don't have the slightest idea how to estimate this.

Reply (2)

Throw Fence 🔶

I wonder if all that information isn't actually in the DNA after all? I guess this question boils down to this: if you send this alien species only the information in the DNA itself, and if they were sufficiently intelligent, would that not be enough (together with the knowledge of the rules of our shared physical universe) to reverse engineer how it needs to work? At least if they assume, or we tell them with additional bits of information, what it's supposed to be (i.e. the description of self replicating molecules). It seems like this just boils down to a question akin to cracking a weak form of cryptography.

To extend the analogy to Dynoscript: would a sufficiently intelligent system (AGI or alien or human) be able to reverse engineer how the Dynoscript language is supposed to work if all they find is the source code of the Dynoscript compiler written in Dynoscript, labeled "Dynoscript compiler"?

I think maybe, yeah? I feel like _I_ would be able to do this, if Dynoscript is sufficiently close to the semantics of languages I know, so it feels like this could be extended to arbitrarily foreign languages and semantics, it would just take (probably exponentially) more work to decipher (and maybe more intelligence and patience than I have, but it seems fundamentally doable). Surely the information is _in_ there, if you know it's a self hosted compiler? Like a recursive rosetta stone of programming languages.

Reply (2)

Very interesting argument. That seems likely to me, at least for Dynoscript. But I'm unsure if that might be a weakness of the analogy. Could it really be that, given human DNA, the physical configuration of a human cell is the only "interpreter" that makes sense? I'm highly unsure, but it does seem possible. So you've made me update towards giving a higher chance that the embodied information could actually be pretty small. Thank you!

Throw Fence 🔶

It's like a quine, right? So it's not just that it's the only interpreter that can work on the DNA, it's the only interpreter that is also described by that process.

I think the analogy is pretty good. Consider this. To bootstrap the process, the aliens would create a machine that looks nothing like a human cell, but which nevertheless reads the DNA and does the correct thing, which would produce a human cell (since this is what the information is encoding), and then you're off. Similarly, I would write an interpreter in a language that is nothing like Dynoscript, but then I could compile the self hosted Dyno compiler, and there you go -- the output of that process, really would be a Dyno compiler written in Dyno that could compile itself.

Maybe a fixed point is indeed how you (or the aliens) find the solution? If it's recurisvely a fixed point, you know you have a solution.

Whether there conceivably could be multiple fixed points seems like an open question, I would admit in the general case; but thinking specifically about human cells: could there really be a totally different looking machine which, when reading the DNA, would produce a copy of itself, that is not in fact a human cell? And considering Dynolang, could there really be a different artifact which when run on the Dyno compiler source code, would produce itself? That seems implausible, if not downright paradoxical. How else could it faithfully be a Dyno compiler, if not to produce _the_ Dyno compiler specificed by the source code of the compiler?

Well, to strengthen your argument a bit—if there are multiple fixed points, then doesn't the embodied information all ultimately boil down to telling you which fixed point is right?

Throw Fence 🔶

Yes I would agree with that! Very interesting, so that actually gives you a very concrete way to tell how many bits of information it is (log2 number-of-fixed-points)!

And I think it's not entirely impossible for that to be the case, like a very obvious idea is that maybe a human cell could be entirely mirrored? I don't know enough physics or biology, but I'm sure a there's enough symmetry in our universe to allow at least one mirroring of literally every component, and then there is no way for the aliens to tell which ones we are (unless we tell them, using one extra bit of information). (Are double helixes (heli?) directional?) (Would a mirrored human cell need other molecules to run?)

Unfortunately I have realized a flaw in the analogy: a simple quine in any language, say python or js, could ignore its input (the Dyno source code) and produce itself, another js or python quine that ignores its output. (Or maybe the input-output here should be two artifacts, not source codes?) I'm confusing myself I think. Anyway, the point is that such a solution is kind of "trivial" in a sense, and by virtue of being supposedly "recursive" in some sense, it should really contain its own semantics or information in some way, like how there is a literal and physical copy of the DNA in a cell after replication. But maybe we can just postulate that such a "zero" fixed point doesn't count.

Comment deleted

Comment deleted

But the electrical wiring of the heart _does_ emerge from cells reading and interpreting the DNA and building out based only on that. So surely that information is in there somehow, even if implicitly?

Comment deleted

Comment deleted

Yeah I think that makes sense. But I also think this information really is there, it's not magic. And I think making a close-enough-to-perfect simulation and watch it play out is indistinguishable (in the limit) from "thinking about it", just you need to be really really smart to do the sufficient thinking. So I still maintain that the information is encoded in the genome, but it's obviously encoded in such a way that it emerges when the laws of physics and the corresponding cell (also encoded in the genome, recursively confusing us) does its thing -- which is to say, reads the genome to know what to do.

What do you make of the analogy to the hypothetical Dynoscript in the parallel series of comment of mine and dynomight here?

In this case it seems like when defining this sort of "phenotypic" complexity there is some sort of assumed conditional complexity built in. For example, if we were communicating with the aliens here, may also be assuming they have the ability to match the physical conditions required to build a human cell. Building that machine might add more complexity to the program. But, if they already have some cell machinery, or even another eukaryote like yeast, the relative complexity is reduced to something else (still not quantifiable currently (?))

In the article, I would say the phenotypic Kolmogorov complexity is conditional on having a way to develop the human. The human developmental program requires a lot of specific environmental cues. These were also presumably optimized under mutational assault and are fundamentally stochastic - not sure if there's a way to obtain any sort of absolute phenotypic complexity in these cases.

May 8, 2025Edited

There's some related discussion over at Asimov: https://www.asimov.press/p/dna-information/comment/115590276

Basically, I agree that the phenotypic Kolmogorov complexity wouldn't capture all the information needed to develop a human. But technically, that's not my goal, just to measure how much information is contributed by DNA. For that purpose, I think it's OK.

At one point, this article actually had a long rant about the reasons for studying this question. (Removed during editing.) To me, it came up in the context of another post trying to think about how much of human intelligence comes from "in-life" learning vs opposed to evolution: https://dynomight.net/data-wall/#its-not-just-dna

Warren

Thanks for writing this! Though I feel like this article is talking about different ways to compress strings, and it’s only related to dna because we can give names to nucleotides and think of that sequence of characters as a string. But like. Changing some bases in some spots can give you blue eyes. Why! Why that, instead of a billion other possible phenotypic changes? How expressible is dna really? It can code for every living thing on earth so it’s expressive but how expressive really? I’d love to read more about that.

The bit in this post about mutations for generating lightning is the only thing I’ve found in that direction I think: https://www.lesswrong.com/posts/4Bwr6s9dofvqPWakn/science-as-attire

> different ways to compress strings

I object! :) My main takeaway is that compressing strings doesn't capture what we mean by information, and what we should be doing is compressing *phenotypes*.

Comment deleted

Comment deleted