What does a complete model of biology look like? — LessWrong

What does a complete model of biology look like? — LessWrong

Why am I posting this on LessWrong?

Biological models inhabit the intersection between biology, philosophy, and computer science. I have not yet found good answers to my question, despite asking professional biologists, philosophers, and others. Perhaps the community has some ideas.

What does a complete model of biology look like?

Which genes increase my chances of Alzheimer’s, by what mechanisms, and what can we do about them?

How do eyes work?

How do I make flying monkeys?

These are the questions that people look to biology to answer. Is the mountains of data we biologists are generating, efficiently bringing us closer to answering them? How explanatory and predictive are the models we build upon this data?

How can we unify these models into one?

History unequivocally illustrates again and again how central this last question is. Maxwell’s unification of electromagnetism reconciled a patchwork of disparate observations into a coherent theory and made it far more powerful than the sum of its parts. Not only did hitherto disorganized and confusing experiments suddenly make sense, their incorporation into a whole immediately provided deep insights, such as reducing light to an electromagnetic wave.

Maxwell’s is not the only great unification. Newton’s Principia and Mendeleev’s Periodic Table are other examples.

Image 1: For millennia people have wondered what everything is made of. The Periodic Table is a sublimely succinct summary of our world. [1]

In biology, we have had analogous unifications, such as the Modern Synthesis of evolution and genetics. However, biology is still a cacophony far from an orchestra.

Most experiments today are generating only patchwork explanations and insights

Biology is generating enormous amounts of data. This data is extremely multimodal in its experimental sources and therefore encoding: everything from DNA and RNA sequencing, to mass spectrometry, to fluorescent imaging, to cryo-electron microscopy, and so on and so on, and each of these categories in turn are furthermore subdivided into a plethora of techniques and therefore data types.

I have tried to find decent estimates of the total amount of biological data generated annually without success. As a floor, considering only genomics research by itself, it is at the very, very least somewhere between 2 petabytes and 4 exabytes annually [2]. This an obvious underestimate. For example, a single publication a few weeks ago yielded a multimodal dataset on one cubic millimeter of brain tissue in a single mouse, featuring co-registered structural and functional data [3].

This paper’s data is open for everyone to analyze. So, we have all this data: great, now what? How do we make sense of it? Are we using it efficiently, or is most of the signal wasted? Suppose tomorrow someone else publishes results from a similar, but not identical, experiment. Will the two datasets be mutually informative and integrable into a single coherent model? Or will they remain islands fit to model only their particular experiment? Is there a common (multidimensional) space in which all their data points are apples-to-apples?

As a bioinformatician, I believe that our inability to encode biology is the primary philosophical barrier to creating a complete model—and therefore understanding—of biology. We should not simply outsource this question to the philosophy department: it is a practical barrier for our science.

Other sciences have already traveled this road. Let’s see what we can learn from them.

What does it mean for a theory to be complete? Physics as archetype.

We will use physics as the archetype of a complete theory.

Image 2: A t-shirt encoding most of physics [4]

I can buy a t-shirt (Image 2) with all of physics on it. It describes all the quantum fields, and I believe only gravity is missing from it, but that can be amended without difficulty using analogous formulas. Even if tomorrow the Large Hadron Collider discovers a new particle, all we would need to do is add another variable or two, maybe some constants, and we’re done. We know how to reductionistically encode all of physics, and how to apply this encoding to describe real phenomena.

Let’s zoom out a little to chemistry. All of chemistry is also encodable. We have the table of the elements (Image 1), electron pushing heuristics, and, if we’re desperate, we can reduce it to quantum mechanical simulations on a computer. There’s really not too much mystery about how to describe any reaction in chemistry.

I want to emphasize that I am not saying we know everything about physics or chemistry. There is plenty left to be discovered. Perhaps we will have another revolution or two, as with quantum mechanics and relativity. I am also not saying that these encodings are sufficient to automagically calculate and predict all possible scenarios: there certainly are systems whose modeling is beyond extant computational capacity. What I am saying is that as of right now, we know how to comprehensively encode physics and chemistry, reduced to their moving parts, yet with enormous explanatory and predictive power.

Not so with biology.

One important thing we have to address is an etymological problem of what we mean by “physics” and “biology”. Now, one of the implicit artifacts of this question are the socially defined boundaries of these fields. After all, all of the sciences are really just parts of our single science that studies the world, and their labels are primarily for organizing our college departments and library shelves. These labels are in large part a historical and social construct. The word “physics” colloquially refers to the study of fundamental constituents of matter and energy and forces and fields; it does not include explaining galaxy formation unless you prefix it as in “astrophysics”. The cows are spherical.

Physics is one of the older sciences, and is archetypical. It is highly susceptible to reduction. What are the questions we consider “most fundamental” in physics? It’s the particles & fields questions. Things like fluid dynamics are classified as downstream of fundamental physics. When people talk about “a complete theory of physics”, what they mean is that all phenomena can be reduced to some unified model of the most elementary building blocks of our universe. Today, that is the Standard Model + General Relativity. Solving Navier-Stokes? Details.

Physicist YouTuber Angela Collier had an interesting video a while back [5] where she addresses complaints about how physics in the popular imagination has been stuck for over half a century because, while physicists have made lasers and all kinds of other crazy stuff, they haven’t yet combined the Standard Model and General (!) Relativity into a Theory of Everything. Our curiosity is predisposed to perceive this as physics' most important question. Reductive theories are not only the most satisfying, they are the “ultimate truth”.

Reduction is not without merit; it is extremely powerful. In physics, it allows experiments to inform and cross-validate each other rather directly. For example, whether I measure the gravitational constant with Cavendish’s torsion balance or with atom interferometry, my measurements should cross-corroborate each other. Contrast this with the smorgasbord of one-off data sets in biology.

So, is reductionism the answer to biology as well?

Before you go all xkcd and just claim all of biology is just chemistry, and all of chemistry is physics, and therefore voilà: you’re missing the point. We cannot simply simulate an entire embryo at the atomic level. Aside from computational capacity, we do not know all of the initial conditions: atoms are memoryless, biological molecules are not. Not only that: brute force calculations aren’t insight. How do we encode phenotype? What's next, Señor Borges, “La biblioteca de Babel”?

In biology, the most important “ultimate” questions that we really really want to know the answer to are not “how does this protein fold” (although of course that’s important): it’s questions about much more complicated systems, like the questions I opened this essay with. Historically, the word “biology” colloquially encompasses a broader range of inquiry than “physics”, and depending on your definition can include everything from single-molecule biochemistry up to multi-species communities and even (for some people) ecosystems. Here I’m going to constrain it to the scale of individual multicellular organisms. No population dynamics or whatever for now.

Thus the nature of questions that we consider to be “most important” in biology are rather different versus physics. And, they are less susceptible to reductionism. (Reductionism in biology has its own extensive literature and I won’t address all aspects of it here.) Cows aren't spherical!

So while a physicist is drilling down into “the true nature of reality” by asking what the most fundamental particles are and how they behave, the analogous questions that biologists consider to be central are more about, for example, “What is the relationship between genotype and phenotype?” This question is basically asking: if I observe an organism that has certain traits, how can I explain these traits in terms of the DNA encoding it? Note that this is not a question about what sorts of “particles” compose the organism. We already know that more or less: we know pretty much all of the molecules that exist in an organism, whether they are DNA, RNA, proteins, lipids, other organic molecules, metabolites, etc. The questions we are concerned about are how these molecules interact to produce the organismal properties and behavior we observe. Our lives would be much simpler if we were satisfied with simply understanding the basic molecules of biology. But we’re not.

Furthermore, there are the practical applications. Using fundamental physics to engineer something is perhaps difficult but routine. If you know Maxwell’s equations, you can design a radio without too much insanity. However, if you want to do something non-trivial to an organism, that’s way way more difficult. Traditional drug development is designing small molecules that have macro effects on our bodies. This is hard!! There are so many interactions. Billions go into researching them. Modern medicine, including immunotherapy, RNA vaccines, and so forth, can be even more complicated. And this is before even getting into CRISPR and genetic engineering, where predicting all effects of all but the most non-trivial edits is beyond our current capacity. So if you want your biological models to not only be explanatory, but to also be predictive of whatever edits you make, we have to do much better.

And, to be honest, biological models that cannot give us recipes to edit organisms are weaksauce. Can you imagine if the theory of electromagnetism was too lame to tell us how to build a radio? Boring!

Injecting an embryo with some DNA is not particularly technically challenging. I’ve done it multiple times. The injection is sequence agnostic: other than in cases of very long DNA, one sequence isn’t any harder or easier to inject than another. The real difficulty is knowing what to inject.

Very few traits are the consequence of one, single gene. For example, sickle cell disease is caused by mutated β-globin genes. Editing both copies of this one gene to a healthy variant is sufficient to cure the disease. However, the vast majority of traits are highly polygenic. There are many, many factors that influence their outcome. And we are only beginning to understand them.

Here is the problem statement that a complete model of biology should aim for: given an embryo, and a list of desired traits, what are the modifications that will confer them? Difficulty in generating this recipe emerges not only from our ignorance, but from the shortcomings of how we encode biological knowledge itself. Therefore, an outstanding question that biologists have not yet answered is: what does a complete biology look like?

Current examples of bioinformatical encodings

Image 3: Metabolic pathways [6]

Image 4: Plant protein complexes [7]

We do have some nascent encodings in biology. Image 3 shows some metabolic pathways, and Image 4 some protein complexes. If you are curious, the topic to google is “systems biology”. Obviously, systems biology relies heavily on computationally modeling enormous amounts of data. Any comprehensive model of biology would have to be a computational encoding. Not even John von Neummann’s memory is sufficient.

However, even these maps are but very small slices of the whole. There are many of them across the literature. It’s unclear how you would unify them into a single model. What would be the common space into which we could transform all these maps together? We do not yet know how to systematically structure biological knowledge in a unified, comprehensive paradigm. Even in principle.

A Turing Test analog for biology

Now, there are other fields where we do not yet know the form a model or answer would take. Artificial intelligence is an example. We don’t necessarily know what form an artificially intelligent agent—and here I mean artificial general intelligence (or AGI) that matches or surpasses human intelligence—takes. Perhaps the convolutional neural networks and large language models exploding in popularity are sufficient. Perhaps not. There may be entirely different models that prove superior. However, despite not knowing the form the “answer” will take, computer scientists have long been brainstorming criteria for deeming something AGI. The most famous criteria is the Turing Test, originally called “the imitation game”, proposed in 1949.

Image 5: The Turing Test [8]

The imitation game is played as follows (Image 5). A human interrogator (C) sits in a room with two chats open, one chat with another human (B), another chat with a machine (A), and the interrogator does not know which is which. The machine passes the Turing Test If the interrogator cannot distinguish which chat is the machine. There are of course various philosophical criticisms of the Turing Test, but they are beside the point here.

Can we, with the Turing Test as inspiration, brainstorm an analogous criteria for a complete system of biology? For example, one criteria can be: can I describe an organism’s phenotype to a computer, and have it generate a recipe about how to create this organism? The computer would have to understand what my description in human language means. This recipe would have to be interactive: I as the user would need to be able to interrogate this recipe in sufficient detail to understand every part. If this organism is not creatable, the computer would need to provide an interactive explanation as to what the snags are.

Teleonomy as central to human understanding of biology, and therefore in human models and encodings of biology

Naturally, the computer's recipes and explanations will often be teleonomic. What is teleonomy? Let’s start with an example. Suppose you ask “What does myosin do?” Answer: myosin is a motor protein that plays a central role in muscle cell contractions, for example in cardiomyocytes. You continue, “And what do cardiomyocytes do?” Answer: cardiomyocytes are the contractile cells that make the heart pump. “And what does the heart do?” Answer: it pumps blood, which delivers oxygen and nutrients throughout the body. So now you understand what myosin does.

It is true that it is possible to explain myosin matter-of-factly and purely mechanistically. I could restrict myself to only saying things like, “Here is the structure of the myosin protein”, and “here is its motion along actin filaments”. However, simply reeling off a list of facts by itself will not be nearly as enlightening without also conveying to you what myosin’s “function” or “purpose” is. Here’s a challenge: try to explain to yourself what some body part does without using the language of “purpose”. It’s very challenging!! Even with simple statements like “the pancreas secretes insulin”, you are implicitly ascribing the pancreas the “purpose” of secreting insulin. (The pancreas is doing a whole lot other than just secreting insulin! Think about it…) And, of course, “the pancreas secretes insulin” makes sense so succinctly in a broader context because we know insulin has the “purpose” of regulating blood sugar and so on and so on.

I remember a wonderful lecture from Roderick MacKinnon who won the Nobel Prize for researching ion channels. His last slide was the full, high resolution structure of an ion channel protein complex, revealed by cryo-electron microscopy about twenty years or so (?) after he started researching the channel. Before this structure, he was a blind man feeling the leg of an elephant. Nevertheless, he did learn a whole lot about the function of each amino acid in that channel by doing things like mutating them. So when he finally had the full structure, all of his observations made perfect sense, and achieved the golden standard of understanding how structure leads to function. Imagine how much less insightful that structure would be if we only had the coordinates of its atoms, without the teleonomic assignment of functions to them!

Imagine you are Zeus and omnisciently know the location of every single molecule in your body. How would you describe what each of those molecules do? To the human mind, the clearest explanations would be those that would describe these molecules in the broader context of the system, in terms of their “function” or “purpose” as part of the whole. Lactase digests lactose. Polymerase copies DNA. Biologists use teleonomy regularly. If you were to look up myosin on the biological database UniProt, you would see annotations about what it does, for example muscle contraction. It is by teleonomically propagating “purpose” from the heart, to its cells, to their proteins that you have a clear understanding of what’s happening.

Notice I consistently use quotes around the word “purpose”. This is to emphasize that this “purpose” does not objectively exist in the real world, and that it is simply an artifact of how our minds think. Teleonomy’s predecessor, teleology, would not place “purpose” in quotes. Teleology claims that these purposes indeed exist—how would you make sense of living organisms otherwise!—and that our body parts have intentions—for example that our eyes are for seeing—and that usually the ultimate teleological purposes are ascribed to a creator god. In fact, teleology has been used as an argument for the existence of god.

This is, of course, fallacy. Just because our brains think in terms of “purposes” does not make those purposes real.

Anyways, you can see that the computer’s model of biology would not only have to include knowledge of biology itself, but also knowledge about how human minds interpret information.

This, of course, is an insanely ambitious criteria. However I consider anything less than that incomplete. And, any machine that can do this would likely exceed our own intelligence to begin with.

Perhaps our intellectual quests, whether physics or chemistry or biology, are soon going to exceed the capacity of unaugmented humans to solve. How many more Newton- and Einstein-like leaps do we have left in us? Within how many generations will we approach a threshold where scientific models exceed even our geniuses’ minds?

What will we be satisfied with?

Citations