Biological data storage

Just a quick note on some probably unoriginal ideas I had about categorizing biological data storage methods.

While it’s important to note that these are listed in roughly the order of their evolution and respective capacity, there’s nothing particularly deterministic about their evolution. Some of them are more-or-less necessary, invoking the anthropic principle, to write this blog, but plenty of microbes get on fine with the supposedly less sophisticated subset of them.

This blog is inspired in part by the astonishing Pfizer/Moderna mRNA vaccines for COVID, which I just cannot wait to get shot into my immune system. It’s so damn cyberpunk!

DNA/RNA. The last universal ancestor used DNA to store information, most notably the instructions to make copies of itself. Its species could adapt to its environment through evolution and natural selection, which represents a very distributed and slow mechanism for reading and writing data into storage. DNA is a digital storage mechanism. Humans have about 3 billion base pairs of DNA, representing 6 gigabits of raw data. There is, however, a lot of redundancy and non-coding DNA included, and all humans share about 99.9% of their DNA. So the individual information entropy separating even unrelated individuals may be as little as 6 megabits. By comparison, a novel might have about 100 kbits of entropy.

Immune System. Phylogenetic bracketing indicates that the adaptive part of our immune system evolved during the Cambrian explosion, 525 million years ago and is shared by all vertebrates. It uses lymphocytes to recognize, attack, and remember foreign pathogens in the human body. It’s super duper complicated and I don’t understand it, but here I’m interested in its data capacity. Humans are exposed to millions of pathogens every day, and T cell receptors can “spell” at least hundreds of thousands of different shapes. Over time memory may fade, while autoimmune problems are caused by errors in the delete function. I point this out only to reinforce that, like DNA, data stored in hundreds of millions of immune lymphocytes is not necessarily data in a fully general sense, but it’s safe to estimate its overall capacity at around a gigabit of information. Unlike DNA (although employing a similar selection mechanism in miniature) an individual organism can update their immune system through infection or vaccination in as little as a couple of days, conferring greatly increased immunity for their remainder of their life. Parts of this immune system can even be transmitted to offspring through, eg, milk. A biological data storage mechanism that can react effectively within days compared to the countless deaths over hundreds of generations required for natural selection is a very valuable adaptation.

Bee dances. While not a part of our collective human evolutionary history, bee waggle dances are so damn cool I am including them anyway. Strictly speaking the waggle dance doesn’t form any kind of persistent collective cultural memory, but it does provide some insight into the navigation and learning capacity of superficially simple insects. The waggle dance encodes information about the quality, distance, and solar bearing of flowers and does so by transferring perhaps 15-20 bits of information in a repeating action that takes a few seconds to communicate. The ability of spectator bees to successfully navigate to the pollen source varies substantially among members of the hive, with some preferring to forage at random.

Brains. While the earliest vertebrates had nervous systems, they weren’t much more sophisticated than worms and probably didn’t have particularly deep thoughts. Yet many fish (including their mammalian descendants) and even a few invertebrates (particularly cephalopods) went one step further and evolved brains to integrate sensory information, coordinate motion, plan, and learn from the environment. Despite common myths, even goldfish can remember things for much longer than four seconds! Again, I’m far from an expert on brains but I will point out that birds make do with much smaller brains than many mammals, possibly due to more efficient neural structure.

Brains are much more like conventional computers when it comes to data capacity and read/write capability, compared to the evolution of DNA or the adaptations of the immune system. It’s difficult to quantify the capacity of the human brain but some estimates put it at around 2.5 petabytes. Certainly I’ve never experienced running out of space, although it seems that perhaps retrieval is more troublesome than storage. There are known cases of patients in brain surgery responding to a stimulus by reciting, verbatim, books read half a century prior and almost certainly (except in rare cases) completely “forgotten” as measured by conventional means. Human memory is contextual, constructed, and often unreliable. And yet its storage capacity is effectively infinite.

As for reading and writing, I often feel that these are major bottlenecks. Within certain contexts, humans can absorb vast quantities of information very quickly, such as an afternoon spent learning a new skill or reading a good book. Other sorts of data, less anchored to the familiar, are much harder to retain and yet, with practice, can be remembered with astonishing precision. Experts in music or chess can often recite symphonies or games move by move after only a single viewing.

In writing, a human can speak or type intelligibly at around 100 words per minute. With the assistance of intuition, context, and other people the data rate from people in positions of leadership can be increased incrementally. But 100 bits/second isn’t a whole lot, especially when most people would struggle to maintain this rate for more than a few hours at a time. On occasion I write relatively prolifically, but seldom more than 10,000 words in a day, representing a limit of about 10 kbits/day. That is, given that the information entropy of text is about one bit per word.

Speech and oral history. Jumping more fully into human prehistory, modern social structures suggest that ancient kin groups or tribes employed communication to improve their odds of survival. Here the question is how much data can be reliably transmitted from generation to generation using purely oral history. As a data storage mechanism it’s relatively flexible and able to accommodate the addition of new information, but its capacity is limited by the weakest link the chain. Who knows how many times various facts of nature were discovered by neolithic scientists only to be lost due to cultural discontinuities.

There are about 180,000 words in each of the Iliad and Odyssey, two Homeric Greek epic poems passed down orally for generations prior to being written down. We know of numerous other epic poems from this era that have not survived to the present day, though many of them were written down during the classical period and lost more recently. For comparison, the bible has about 780,000 words.

There are Australian Aboriginal stories describing events such as volcanic eruptions that have been dated to about 13,000 years ago, so we know that oral history is able to transmit information over much longer time periods than the comprehensible coherence interval. That is, after a thousand years a language typically evolves enough to be 80% similar to itself. After 13,000 years, barely 5% remains. Even though it was written only 700 years ago, Chaucer’s “Canterbury Tales” in Middle English are nearly incomprehensible to naive reading. Shakespeare is Modern English.

At typical rates of recitation, it takes about 2 hours to speak one of the 24 books of the Odyssey. If a prehistoric society tells stories on a yearly cycle, then about 2.5 million words can be spoken. Of course, some breaks might be taken, and some stories would only be told to certain people or at certain times. The quantity of information that can be reliably preserved, however, is not much more than half a dozen or so epics of similar scale to the Odyssey.

Writing. Symbolic representation of words as text in a persistent medium arose independently at least four times in Mesopotamia, Egypt, China, and Mesoamerica. Writing is useful not only because it has essentially unlimited capacity and endurance (almost), but it also enables all kinds of bureaucratic technology that is necessary for large scale organization of civilizations. As previously discussed, the read/write speeds of writing are about 100 bits/second, but because reading requires relatively little effort, a motivated scholar can read all day (and night, if they have artificial light) for years, absorbing the thoughts and ideas of hundreds of other people, even if those people are no longer alive, never spoke the same language, lived in a different culture, or were entirely imaginary.

The ancient Library of Alexandria stored between 40,000 and 400,000 scrolls at its height, in around 200BC. While undoubtedly some of them were not very interesting, as a store of information it may have contained the equivalent of as much as 100,000 books, or 10 gigabits of information. For comparison the Library of Congress has 170 million catalogued items, including 39 million books. The text only English language Wikipedia is about 20 gigabytes, or 160 gigabits (uncompressed).

The Printing Press. While a dedicated scholar could probably have read the entire collection at Alexandria in a lifetime, mass consumption and dissemination of knowledge, to say nothing of preservation of classical texts, was not really possible when all books were written by hand. Ada Palmer has a great anecdote about pre-Gutenberg books in medieval Florence costing more than a house, because creating a single one took a scribe more than a year of labor. Then came the printing press in the 1400s, whose modern form, the linotype, was phased out as recently as my childhood in favor of digital methods. Within a generation, the cost of a book was determined not by a year of skilled labor, but by the cost of paper and ink, an improvement of eventually five orders of magnitude. The printing press enabled the expansion of medieval university libraries from thousands of volumes to millions and freed most scholars from the day to day drudgery of copying text after text. Books have been written about the social changes enabled by widespread literacy. While the read/write speed for a well-resourced individual was much the same, the overall flux of knowledge through the world’s people increased by many orders of magnitude.

Computers and The Internet. While books represent a transportable, durable, and accessible form of information, they still lack several intuitive aspects of knowledge that we take for granted in thought or conversation. Beginning with the Mother of All Tech Demos, we began to see the emergence of a new knowledge medium: digital information. Mutable, storable, infinitely replicable, transportable, flexible and enabling hyperlinks, the internet represents a major improvement in the capacity of evolved life to manipulate data.

While earlier forms of rapid communication, including the postal service, telegraph, and radio advertised steady improvements in both speed and data capacity, none of these media were also durable stores of information in the way that the internet is.

In 2020, read/write speeds, download/upload speeds, processing speeds are all measured in units of gigabits per second. Podcasters and streamers can exploit data abundance to avoid spending cycles precompressing data for the written medium. Bleeding edge software engineers automate as much as possible to avoid bottle-necking their algorithms with human input that, after all, can type barely 100 words per minute and must sleep 8 hours a day. In many ways, the data has taken on a life of its own, connected to our own minds by a few tenuous strings that we are cutting as quickly as we can. In this world, I can text the entire data contents of my DNA, immune system, oral history, and the Library of Alexandria to my friend in seconds. I’m sure that’s just what they want.

Neuralink? At this point I think it’s fair to say that the capabilities of the medium have grown beyond any meaningful constraints, and the only remaining bottleneck is factory-standard human i/o capability. What evolved on the African savanna to be adequate for sensory input is not really equal to the task of enabling large scale communication and coordination, needed to build our grand future global cybernetic collective. Indeed, how much effort is expended in human organizations routing around the damage and limitations of our imperfectly evolved bootstrapped communication capabilities? A lot!

The idea is that brains are capable of impressive feats of internal computation, and so are our computers of one form or another, but the eyes, ears, and typing fingers connecting them are due for an upgrade. Neuralink is the most obvious candidate here, about which Wait But Why has a comprehensive explainer. The most recent update shows promise along the path of generating a generic, high bandwidth neural interface. I think it’s fair to say that a revolutionary increase in brain-computer data rate is a matter of time, whether years or decades, and that it will be on par with the printing press for changing the ways humans interact with their knowledge environment.

Conclusion. I think it’s tempting to review the history of data transmission as an exercise in inevitability, when in fact there’s no guarantee that any of this would occur, given the laws of physics. However, if we see ourselves as some kind of evolved entity that derives utility from manipulating data we can understand why each of these major increases in capacity did, in fact, occur, as well as chart a course to a more integrated, connected future.

10 thoughts on “Biological data storage

  1. While in Australia some years ago, I learned that our Eurocentric word “elder” is an overly-simplified approximation of the Aboriginal term for “story teller”. In fact, contrary to the implied English meaning, some Aboriginal story tellers might be young girls. Cultures without a formal written language need to have mechanisms to filter out unwanted “noise” in the information being passed along over time. Apparently, members of the community hearing a story repeated would collectively assess how well it matched their recollection of previous iterations of the story — and the story teller with the most accurate re-tellings would be recognized as the “elder”.

    I saw references to this in several places, but the one that sticks out was at the Naracoorte Caves interpretation centre in SA, where visitors can see evidence of mega-fauna remains from 30,000 years ago that fit surprisingly well to Aboriginal oral history.

    Some form of error correction is needed for any type of “data transmission” to work effectively. Since many cultures with an oral tradition have been around for a long time, I would think that the methods they use should have useful parallels to our culture’s strong (and relatively recent) dependency on technologies of literacy.

    Like

    1. Great point. I think error correction (and innovation) is probably easier with contextual information, eg “uses of this plant” but information must still have been lost all the time.

      Like

      1. Yes; prescriptive information would typically have more immediate usefulness, so it would have a higher requirement to be “right” for an oral culture. And, as the information became less relevant, it would tend to disappear (at least locally in the case of a plant that doesn’t grow in a new area for example).

        But is it possible that the lengthy & complex stories so common in cultures over the ages include hooks about information only carried in the fringes of the collective memories? The full details wouldn’t need to be stored by everyone if the hooks triggered enough recalls to pass along. Distributed storage? Perhaps as members of a culture that has become so reliant on written information, we may have suppressed an innate ability to detect the links to other content contained within the stories.

        Like

  2. Human eyes are apparently 8.75 megabits/s (https://www.newscientist.com/article/dn9633-calculating-the-speed-of-sight/).

    I don’t think Neuralink is necessarily the obvious candidate, or that a revolutionary increase in brain-computer data rate (for the non-medical purposes you’re describing here) will come in years or decades. Certainly not years, maybe decades, possibly centuries.

    For applications like prosthetics or fixing spines, developing denser, safer, more precisely-placed electrodes could be the main challenge, and we could see advances there over the next few decades. The progress will likely happen slower than people would like. Researchers have been working on this for a while; Neuralink may lead if it’s the most highly-funded, but I don’t think it has any obvious standout advantages in talent or approach.

    For non-medical uses (which I think this article is focused on), I don’t think electrode technology is the only challenge. A consumer BCI has to be able to transmit thoughts, feelings, knowledge, skills, etc to be worthwhile. If it’s just a combined cell phone/keyboard/oculus/music player that requires brain surgery, there’s no point, except maybe for hardcore immersive gamers.

    That means we would need to have a complete understanding of how the brain learns, thinks and remembers, so we can interface to those systems. That’s hundreds of open scientific questions, worth probably dozens of Nobel prizes. I don’t think that will happen on anybody’s schedule. Maybe I’ll look dumb a couple of decades from now, but I think the Singularity will take a lot longer than that.

    Like

  3. The ability of oral tradition to record and remember events is massively overstated
    Here (NZ) the Maori oral tradition had completely forgotten about the Moas –
    Only a couple of hundred years after they had been a big “industry” leaving substantial archeological remains

    The early Europeans found Moa bones and showed them to the Maori – who were surprised and had no traditional knowledge of these somewhat obvious birds

    Saying that I believe that you massively understate the advantages of oral transmission in things that directly affect the “tribe” – being able to tell people how to do things was IMHO a massive step forwards

    Liked by 1 person

  4. A little off-topic but… given the deep influence of computer science on modern biotech, I expect Elon Musk’s next venture (besides Neuralink) will be in biology/medicine.
    His background in CompSci, his interest in HPC High Performance Computing (Tesla Dojo…etc) and his twitter feed over the past year are strong indicators.
    He replied to his employee at Tesla (Andrej Karpathy) that mRNA turned bio into a software problem.https://twitter.com/elonmusk/status/1343002225916841985

    and he demonstrated deep interest in the field a couple of times before that twitter.com/elonmusk/status/1328421992144355328
    twitter.com/elonmusk/status/1343105835958857730

    Maybe it just becomes another branch of neuralink, who knows…

    Like

  5. We store large amounts of information biologically in (at least) a couple ways not directly mentioned above.

    One is epigenetics. DNA is modified by methylation and acetylation, it’s packed into chromatin and unpacked again, and various proteins bind to it. This information tells most of the long-lasting difference between one cell type and another.

    The other is protein phosphorylation. The amino acids serine and tyrosine have hydroxyl groups where a phosphate can be added (if the serine or tyrosine is exposed on the surface of the protein, and there’s an enzyme to carry out the reaction). This stores a lot of the short-term information about the state of every cell.

    Then there are the mechanisms that were mentioned implicitly, because they underlie the function of the nervous system: ion flow across membranes; transfer of trans-membrane proteins between the cell surface, the membranes of vesicles, and the membranes of other organelles; and the physical arrangements of cell parts (notably the axons, dendrites, and synaptic spines of neurons). These mechanisms are not unique to the nervous system, but process large amounts of information in determining the structure and function of other tissues too.

    Like

Leave a comment