39 Comments
User's avatar
Never Supervised's avatar

DNA doesn't exist in isolation. It makes assumptions about the behavior of proteins with which it co-evolved. Even if you take the Kolmogorov approach, there's all of chemistry and physics that those genes somehow "learned".

Expand full comment
Shivadatta Padhi's avatar

Splendid and lucid explanation……

Expand full comment
Isaac King's avatar

One potential issue with your definition of "makes a recognizable you" is that it discards all of the information further down the evolutionary line. If, as you say, much of the "redundant" parts are error-correction for reproduction, then they're not truly redundant; removing them might lead to the first person being the same, but any children they have would be different. This is semi-pedantic, but seems less so than the "your cells are lighter" sort of thing, since it would lead to very noticeable and important-to-humans changes after maybe just a generation or two.

Expand full comment
dynomight's avatar

Absolutely, if the "you" with the most compressed possible genome was created, they'd basically be destroyed by mutations when they tried to reproduce. Personally, I think that's OK and possibly a feature, depending on what you want to do with your measure of information in DNA. Personally, I'm most interested in that number for the implications it has for data scaling rates for AI, in which case "information used to fight mutations" doesn't seem relevant. But it all depends.

An interesting question would be: Is there a precise definition of information that does include all the anti-mutation machinery but still somehow excludes meaningless stuff like retrotransposons (when meaningless)? Perhaps you could just change the definition to "most compressed genome while still able to reproduce without suffering higher mutational load than current humans"?

Expand full comment
Isaac King's avatar

Great article.

I think it's worth noting by the way that some of the information that makes you "you" may not be held in your DNA at all. Any time a cell reproduces, the child cell needs to be built by something, it can't bootstrap from solely DNA. These various ancillary mechanisms could be conveying their own information into the child cell, separate from the DNA. (Epigenetic inheritance.) In particular, they describe how the DNA is interpreted, which is itself information. Arguably DNA is less analogous to code and more analogous to data, with the cell's mechanisms that interact with it being the code that reads the data and decides what to do with it.

Perhaps another measure of complexity that takes this into account would be to take a page out of the Hutter prize's book and use Kolmogorov complexity but including both the "code" and the "data"; that is, look for the fewest number of atoms that could, on their own, grow into a recognizable human. But this is unfortunately still subjective of course, because an egg needs a pleasant environment to grow in, this one could argue that it isn't truly standalone. (Same issue with the Huttter prize, the code needs a known computing platform to run on. If you ported it to an alien's computer that had been developed totally separately, different processor architecture, different programming languages, etc, it wouldn't run.)

Expand full comment
dynomight's avatar

Totally agree with your first point: https://dynomight.net/data-wall/#its-not-just-dna

Regarding Kolmogorov complexity, note that technically speaking, Kolmogorov complexity is only defined up to a constant that depends on the "computer" that would run the code. I think that as we restrict ourselves to "reasonable" computers (that don't, say, include a copy of a reference genome!) then the distinction between "code" and "data" wouldn't really exist and this would line up pretty well with what you're suggesting.

Expand full comment
Adam Quirk's avatar

The bigger question for me is “specified complexity” (borrowed from Dembski) and the idea that functional information can’t arise by undirected processes. Or at least it has never done so to our knowledge and has not been proven possible either theoretically or experimentally.

Step 1 – Put an upper bound on the probabilistic resources of the universe.

Seth Lloyd’s 10¹²⁰ figure is the largest number of discrete yes/no operations the universe could possibly have executed since the Big Bang.

Step 2 – Compare it with the combinatorial search-space.

For a modest protein of 150 amino acids Orgel & Joyce quote 10¹⁶⁴ possibilities—already 44 orders of magnitude bigger than the cosmic limit.

For a 200-base self-replicating RNA (the minimum length most chemists think plausible) the space is 10¹²⁰, right at the same threshold.

This is why I think multiverse theory sneaks into the abiogenesis discussion.

Expand full comment
John's avatar

DNA is not enough to replicate a human. For example, if my mom drank a lot during pregnancy, it would have changed me. So you need at least as much information to lead to a person like me as is encoded in my DNA. The DNA sibling cannot be proven to act the same under all circumstances. (Simple counter example: he does a DNA test and acts on the information.)

Expand full comment
dynomight's avatar

True, DNA is not enough to replicate a human. But that's OK—we're not trying to measure the total information needed to create a fully developed human, just how much information is contributed *by DNA*.

Expand full comment
John's avatar

I am sorry I mixed up my thoughts. I was thinking about this paragraph:

"Limit yourself to changes that still lead to a "person" that would still look like you and have all the same capabilities you do. Now, compress each of those representations. The smallest compressed representation is the "information" in your DNA."

I am still a bit confused. What really does “ same capabilities” mean? For me it meant acting the same in the same situation. So also being immune to the same virus that targets a specific random sequence in the genome.

Expand full comment
dynomight's avatar

Aha, I basically think you're pointing out a valid ambiguity in the definition. I tried to acknowledge this in the post, but it's hidden in a footnote:

> This definition isn’t totally precise, because I’m not saying how precisely the phenotype needs to match. Even if there’s some completely useless section of DNA and we remove it, that would make all your cells a tiny bit lighter. We need to tolerate *some* level of approximation. The idea is that it should be *very* close, but it’s hard to make this precise.

I think ultimately the answer depends on what you want to use the number for. I was thinking (but didn't clearly state) that you should be equally capable of enduring viruses during your lifetime, but you didn't have to worry about retrotransposons and mutations during reproduction. But you might want a different definition depending on your purpose. See also this discussion: https://dynomight.substack.com/p/dna/comment/115852163

Expand full comment
Jeff Geraghty's avatar

How much information is in a molecule of water? How can such a simple molecule express so much meaning at different levels? (I’m thinking of quenched thirst, flowing rivers, poetry, and the symbols and metaphors and meanings we can convey with water)

Expand full comment
dynomight's avatar

"Information is the bridge between chaos and clarity, a whisper that shapes the world."

Expand full comment
Jeff Geraghty's avatar

How many bits, though?

Expand full comment
DH's avatar

Thanks for another fascinating article.

Now that we have a tentative answer to "How much information is in DNA?", my follow-up question is: how much additional information is in a human organism -- or if you want to simplify, a human embryo or zygote?

Consider the following thought experiment: We discover an alien civilization whose biology has nothing in common with ours, and we want to instruct them on how to make humans who can live on their planet. (Disregard for now practical problems such as what these humans would eat.) Our technology does not allow a personal visit, so we can only communicate by radio (with high latency). We send them a chemical description of DNA and a human DNA sequence. What is the minimum amount of additional information we must send in order for them to construct a viable human from scratch?

I have no idea how to attack this problem from the perspective of information theory. Can you think of a way to approach it?

Expand full comment
dynomight's avatar

I totally agree with the point you're making. There's lots of extra information embodied in the physical structure of the cells outside the DNA. (I've discussed this before: https://dynomight.net/data-wall/#its-not-just-dna)

I like the way you're thinking about it: Define the extra information as the (sort of) Kolmogorov for instructions to build a machine that could take a DNA sequence and build a cell from it. Unfortunately I don't have the slightest idea how to estimate this.

Expand full comment
Throw Fence's avatar

I wonder if all that information isn't actually in the DNA after all? I guess this question boils down to this: if you send this alien species only the information in the DNA itself, and if they were sufficiently intelligent, would that not be enough (together with the knowledge of the rules of our shared physical universe) to reverse engineer how it needs to work? At least if they assume, or we tell them with additional bits of information, what it's supposed to be (i.e. the description of self replicating molecules). It seems like this just boils down to a question akin to cracking a weak form of cryptography.

To extend the analogy to Dynoscript: would a sufficiently intelligent system (AGI or alien or human) be able to reverse engineer how the Dynoscript language is supposed to work if all they find is the source code of the Dynoscript compiler written in Dynoscript, labeled "Dynoscript compiler"?

I think maybe, yeah? I feel like _I_ would be able to do this, if Dynoscript is sufficiently close to the semantics of languages I know, so it feels like this could be extended to arbitrarily foreign languages and semantics, it would just take (probably exponentially) more work to decipher (and maybe more intelligence and patience than I have, but it seems fundamentally doable). Surely the information is _in_ there, if you know it's a self hosted compiler? Like a recursive rosetta stone of programming languages.

Expand full comment
Brozhanksy's avatar

I do not believe they would be able to reverse engineer the developmental program from the genome alone. The embryo has behaviour that is not reducible to genes - the same way you there are no instructions in the genome about the electrical wiring of the heart or the immune response to an infection.

Even the proper processing of a single protein is not 'in the genome' - they probably wouldn't be able to go from genetic sequence to properly functioning protein in many cases. Splicing, folding, chaperones, glycosylation etc etc etc.

We would need to send instructions on how to build an egg cell. They would then have to synthesize the proteins and other macromolecules, manually arrange sub-cellular structures, kickstart metabolism? Maybe some structures would self-assemble but not all

We would basically have to send them a complete simulation

Expand full comment
Throw Fence's avatar

But the electrical wiring of the heart _does_ emerge from cells reading and interpreting the DNA and building out based only on that. So surely that information is in there somehow, even if implicitly?

Expand full comment
Brozhanksy's avatar

the information is there - somehow. but there is no explicit set of instructions in the genome to be "read" that outlines how to build a heart.

The heart emerges from a trillion (?) reciprocal interactions that happen in sequence in the cells/tissues/genomes of the embryo. "transcription factor A causes Cell 3b to emit morphogen X that stimulates cells in the anterior pole to express cell surface receptor Y that predisposes them to respond to such and such cardiovascular-specific transcription factor 7 cycles later; at that time, diffusion of TF7 causes those primed cells to express a structural protein variant which causes them to exert mechanical tension which makes this shape which causes this which causes that; meanwhile a neural migration signal is being expressed by a subset of primordial kidney cells 4mm away...this causes the cardiac muscle tissues of the heart to wrinkle in such and such manner and to express calcium channels 1, 17, 44 but not calcium channels 2,5,8 because they were epigenetically silenced by...."

I don't think it's possible to deduce this behaviour from the genetic sequence. Somehow it is embodied both in genome *and* those interactions that can't be reduced to genome. I don't know how to describe it other than 'behaviour'. Maybe it's analogous to any other physiological process.. the stomach responds to carbohydrates not at a genetic level but because that's what this structure with this particular set of cell types, these epigenetic states, these tissue structures, this innervation, etc etc etc - does in response to that signal

the total information content of the developmental program - if we were to write out every step required to go from zygote to adult - exceeds the theoretical information content of the genome the way we try to estimate it - "how much information in bytes is in 3B quaternary digits". imo the only way to go from sequence to development in silico would be to develop a perfect simulation of the fertilized egg and watch the program run

Expand full comment
Throw Fence's avatar

Yeah I think that makes sense. But I also think this information really is there, it's not magic. And I think making a close-enough-to-perfect simulation and watch it play out is indistinguishable (in the limit) from "thinking about it", just you need to be really really smart to do the sufficient thinking. So I still maintain that the information is encoded in the genome, but it's obviously encoded in such a way that it emerges when the laws of physics and the corresponding cell (also encoded in the genome, recursively confusing us) does its thing -- which is to say, reads the genome to know what to do.

What do you make of the analogy to the hypothetical Dynoscript in the parallel series of comment of mine and dynomight here?

Expand full comment
dynomight's avatar

Very interesting argument. That seems likely to me, at least for Dynoscript. But I'm unsure if that might be a weakness of the analogy. Could it really be that, given human DNA, the physical configuration of a human cell is the only "interpreter" that makes sense? I'm highly unsure, but it does seem possible. So you've made me update towards giving a higher chance that the embodied information could actually be pretty small. Thank you!

Expand full comment
Throw Fence's avatar

It's like a quine, right? So it's not just that it's the only interpreter that can work on the DNA, it's the only interpreter that is also described by that process.

I think the analogy is pretty good. Consider this. To bootstrap the process, the aliens would create a machine that looks nothing like a human cell, but which nevertheless reads the DNA and does the correct thing, which would produce a human cell (since this is what the information is encoding), and then you're off. Similarly, I would write an interpreter in a language that is nothing like Dynoscript, but then I could compile the self hosted Dyno compiler, and there you go -- the output of that process, really would be a Dyno compiler written in Dyno that could compile itself.

Maybe a fixed point is indeed how you (or the aliens) find the solution? If it's recurisvely a fixed point, you know you have a solution.

Whether there conceivably could be multiple fixed points seems like an open question, I would admit in the general case; but thinking specifically about human cells: could there really be a totally different looking machine which, when reading the DNA, would produce a copy of itself, that is not in fact a human cell? And considering Dynolang, could there really be a different artifact which when run on the Dyno compiler source code, would produce itself? That seems implausible, if not downright paradoxical. How else could it faithfully be a Dyno compiler, if not to produce _the_ Dyno compiler specificed by the source code of the compiler?

Expand full comment
dynomight's avatar

Well, to strengthen your argument a bit—if there are multiple fixed points, then doesn't the embodied information all ultimately boil down to telling you which fixed point is right?

Expand full comment
Throw Fence's avatar

Yes I would agree with that! Very interesting, so that actually gives you a very concrete way to tell how many bits of information it is (log2 number-of-fixed-points)!

And I think it's not entirely impossible for that to be the case, like a very obvious idea is that maybe a human cell could be entirely mirrored? I don't know enough physics or biology, but I'm sure a there's enough symmetry in our universe to allow at least one mirroring of literally every component, and then there is no way for the aliens to tell which ones we are (unless we tell them, using one extra bit of information). (Are double helixes (heli?) directional?) (Would a mirrored human cell need other molecules to run?)

Unfortunately I have realized a flaw in the analogy: a simple quine in any language, say python or js, could ignore its input (the Dyno source code) and produce itself, another js or python quine that ignores its output. (Or maybe the input-output here should be two artifacts, not source codes?) I'm confusing myself I think. Anyway, the point is that such a solution is kind of "trivial" in a sense, and by virtue of being supposedly "recursive" in some sense, it should really contain its own semantics or information in some way, like how there is a literal and physical copy of the DNA in a cell after replication. But maybe we can just postulate that such a "zero" fixed point doesn't count.

Expand full comment
Karson's avatar

In this case it seems like when defining this sort of "phenotypic" complexity there is some sort of assumed conditional complexity built in. For example, if we were communicating with the aliens here, may also be assuming they have the ability to match the physical conditions required to build a human cell. Building that machine might add more complexity to the program. But, if they already have some cell machinery, or even another eukaryote like yeast, the relative complexity is reduced to something else (still not quantifiable currently (?))

In the article, I would say the phenotypic Kolmogorov complexity is conditional on having a way to develop the human. The human developmental program requires a lot of specific environmental cues. These were also presumably optimized under mutational assault and are fundamentally stochastic - not sure if there's a way to obtain any sort of absolute phenotypic complexity in these cases.

Expand full comment
dynomight's avatar

There's some related discussion over at Asimov: https://www.asimov.press/p/dna-information/comment/115590276

Basically, I agree that the phenotypic Kolmogorov complexity wouldn't capture all the information needed to develop a human. But technically, that's not my goal, just to measure how much information is contributed by DNA. For that purpose, I think it's OK.

At one point, this article actually had a long rant about the reasons for studying this question. (Removed during editing.) To me, it came up in the context of another post trying to think about how much of human intelligence comes from "in-life" learning vs opposed to evolution: https://dynomight.net/data-wall/#its-not-just-dna

Expand full comment
Warren's avatar

Thanks for writing this! Though I feel like this article is talking about different ways to compress strings, and it’s only related to dna because we can give names to nucleotides and think of that sequence of characters as a string. But like. Changing some bases in some spots can give you blue eyes. Why! Why that, instead of a billion other possible phenotypic changes? How expressible is dna really? It can code for every living thing on earth so it’s expressive but how expressive really? I’d love to read more about that.

The bit in this post about mutations for generating lightning is the only thing I’ve found in that direction I think: https://www.lesswrong.com/posts/4Bwr6s9dofvqPWakn/science-as-attire

Expand full comment
dynomight's avatar

> different ways to compress strings

I object! :) My main takeaway is that compressing strings doesn't capture what we mean by information, and what we should be doing is compressing *phenotypes*.

Expand full comment