Outliers Over Averages
Rethinking AI for biology
Spend enough time working with new technologies, and you will inevitably see huge sums of money being thrown at incomplete solutions. It happens all the time, lurking in the spaces between the headlines. Once noticed it’s like an insatiable itch. You know that with just a little more attention put in the right places the chasm between what is and what could be can be bridged.
This is happening across the sector automating science, and AI for biology is no exception1. I look around at the various AI tools that position themselves for use in creating new things with biology, specifically tools to aid in the R&D processes for synthetic biology. At first blush, there are a bunch of different tools deploying different strategies. But when I look more closely…how different are they really?
To generalize, the common strategy is to deploy vast datasets with as much molecular data as possible. If this was modeling a commercial market we would call this the bottom-up approach. The most headline grabbing models have achieved status through predicting structures and interactions based on sequence. This is of course a great leap in scalability from traditional (time and resource consuming) techniques like x-ray crystallography. But is it enough? To go back to market modeling, best practices would have you also use a top-down model, then compare the results of the two models as a sort of “sanity check.”
As I see it, the molecular-based approach has three major limitations: it doesn’t scale well, existing data is nowhere near sufficient to capture the complexity of whole organisms, and by its very structure is fundamentally limited in its ability to predict or develop novel biological characteristics.
What we have are tools that emerged from the perspective of iterating or improving biology. To unlock the potential of bio-innovation, we need complementary approaches that unlock creating new biological characteristics.
The Limitations of the Molecular-Minutiae Approach
Current AI architectures excel at interpolation and struggle with extrapolation. They’re great at finding patterns within the bounds of what has been previously observed. But like a person who has only walked on continuous ground struggles to imagine vertigo at a cliff's edge, AI cannot extrapolate beyond the bounds of their experience.
The pharmaceutical industry serves as a cautionary tale. Despite decades of focus on molecular understanding, billions in R&D investment, and increasingly sophisticated computational tools, clinical trial failures persist at staggering rates. If the molecular approach can’t reliably predict how a single molecule will behave in the human body, how can we expect it to predict what will emerge when we combine biological systems in novel ways?
One self-imposed limitation is that we are trying to answer this question using data that came from reductionist methodologies, experiments in which we try to understand pieces by breaking them into little parts instead of understanding them in the context of the whole. We’re studying the notes of the instruments in excruciating detail, hoping to understand the orchestra. In order to fully unlock the potential of synthetic biology to create new things, we need to go beyond “how does this known protein fold?” and instead ask “what is most likely to create a new characteristic?”
The computational power required to extend molecular design modeling to the scale of entire organisms is mind numbing. Check out this great description from Cradle CEO Stef van Grieken about their model for protein design. In his words, with regard to resolving a single small protein “This negotiation takes time and significant compute.” Now expand that to encompass entire pathways, crosstalk between pathways, intracellular compartments, cell-cell interactions, tissue-tissue interactions, organ systems. Each of these layers increases the complexity - and future computing demands - by orders of magnitude.
That means that even if these tools can eventually be developed for creating new biology, they will only represent a shift in resources from the wet lab to computation. The costs will remain too high to enable our ambitions for bio-innovation.
Learning from How Nature Innovates
Science (today) is full of reductionism. Nature is not.
Reductionism is baked into the scientific method we’ve used since the Enlightenment, and in evolution it’s been reinforced by more than methodology. Charles Darwin’s observations of niche speciation fundamentally shaped our evolutionary framework. His focus on gradual change, selective pressure, and divergent adaptation became the lens through which we view all biological innovation.
I suspect that our agricultural experience only reinforced this perspective. Cross-breeding animals or plants often fails or produces sterile offspring. Meanwhile, selective breeding resulted in a wider variety of nutritious vegetables and easier to manage animals. It’s a perfect combination to have a bias towards divergent evolution, which is indeed the more common mechanism.

But while divergence is the most common form of evolution, we know that convergence has played a very important role in the evolution of life on Earth. Unfortunately, we’ve fallen into the trap of describing the convergent events as specific “exceptions.”
In our bias, we’ve largely overlooked biology’s merger and acquisition events: hybrid speciation and symbiogenesis, respectively. These processes create genuinely novel characteristics through combination and integration rather than gradual modification. They represent biological innovation through synthesis, exactly what synthetic biology seeks to achieve.
Current datasets, analytical frameworks, and predictive models are all built on the assumption that biological innovation happens gradually, through small incremental changes. We’re trying to study biological creativity using tools designed only for optimization.
The Power of Biological Acquisitions
Two of the most commonly recognized examples of symbiogenesis resulted in mitochondria and chloroplasts, and I liken these to acquisition events. Billions of years ago, there were very small cells that were capable of using oxygen to more efficiently extract energy. This was a great advantage, as the process using oxygen is up to 20x more efficient than the non-oxygen using process.
At this time, there were also other cells, much larger and more complicated, which did not yet have the advantages of being able to use oxygen. In business, this would be viewed as a prime opportunity for an acquisition. And indeed, the larger cells took in the smaller cells, and over time they streamlined their processes until they were a single organism with broader capabilities than either had alone.
The success of mitochondria is so great that the vast majority of life on earth has evolved from these mitochondria-containing cells. This includes a large variety of bacteria as well as all animals, plants, and fungi.

Chloroplasts have a similar origin story. I often think that chloroplasts are the most under-appreciated feature of biology as they make sugar from sun and CO2. When you talk about something materializing out of thin air, chloroplasts actually do that.
Like mitochondria, there were smaller cells that were capable of photosynthesis that were taken up into large, more complex algal cells. Over time the processes streamlined, and the end result was so successful as to result in the downstream evolution of all types of plants.
But the story of chloroplasts and algae goes further. While all plants contain chloroplasts from a common ancestor, different algae have acquired different chloroplasts at different points in time. There are even documented cases where it appears that the process of streamlining genes is still underway2, offering both a unique opportunity to observe the process in action, and serving as a reminder that this has not only occurred multiple times but has happened across time as well.
Potatoes: The Merger That Feeds the World
Similar to how mitochondria and chloroplasts are examples of acquisition-like events, there are also examples of merger-like events in biology. Case in point: potatoes.
Incredibly, after decades of jokes about how to pronounce potato and tomato, it turns out that they are in fact related. More specifically, the potato evolved from the coming together of the tomato plant and another type of plant called etuberosum3. Despite the fact that you can find the word “tuber” in etuberosum, that plant does not in fact produce tubers. Tuber production uniquely arose in potatoes, as a result of the combined genetics of both the tomato and etuberosum.

This is a very cool example of a new characteristic emerging in biology. Potatoes might seem ordinary but they represent a novel way for plants to store energy, and a lot of it. And it’s an incredible development considering that neither tomatoes nor etuberosum have this ability. This ability to store energy in tubers was so helpful to the survival of potatoes that it supported the subsequent development of over 4,000 types of potatoes. Originating in South America, potatoes now thrive as part of global agriculture.
The potato example is not just an exception, it is an elegant example of a broader pattern: the examples of biological mergers that we know of are often followed by explosive diversification. While hybrid speciation may be relatively rare, they are also represent a massive opportunity space for fundamentally new and differentiating biological traits. Characteristics so different that they enable new environmental niches to be filled.
Patterns in Hybrid Speciation
As of yet, we have no systematic framework for understanding these events, much less a comprehensive dataset documenting the emergence of novel characteristics through biological combination. This precludes the development of predictive models that can anticipate when a novel characteristic might emerge.
But we don’t currently know if there are detectable patterns, though a few key characteristics give us a resource-efficient pathway to finding out.
Known hybrid speciation events are associated with a subsequent radiation of species (similar to how more than 4,000 types of potatoes later evolved from the first potato). But of course, not all species radiations result from hybrid speciation; species radiation is generally enabled by the emergence of a differentiating characteristic (which could arise through mutation or other means) or a change in ecological opportunity.
Still, species radiation gives us a starting point of evolutionary events that have a higher than average likelihood of having come from hybrid speciation. Genetics databases can be screened for species radiation, then further evaluated for characteristics that might indicate hybrid speciation4.
I’m not claiming this is easy, I’m proposing it’s worth trying. By using this strategy we spare ourselves from needing to explore in depth the evolution of every individual species. Even in cases that turn out not to have been a hybrid speciation event, it was still an evolutionary event that resulted in a robust, differentiating characteristic. This is highly valuable in our ultimate goal of understanding emergent characteristics in biology, critical for all strategies of innovating with biology.
The infrastructure exists to test this today. Genetic databases are mature, computational resources are available, and the analytical techniques for identifying radiation events are well-established. What’s missing is the coordinated effort to approach these resources with hybrid speciation as the organizing principle.
Benefits to Bio-Innovation
A pattern-based understanding of emergent characteristics through hybrid speciation represents the potential for a more scalable path forward. Rather than trying to predict emergent properties from first principles through hyper-specific molecular interactions, we could learn to recognize the signatures of biological combinations likely to produce novel characteristics.
This approach could complement the molecular modeling approach by addressing its fundamental pitfalls. Instead of requiring computational power that scales exponentially with biological complexity, we’d be looking for patterns that could guide targeted experimentation. Instead of relying on complete molecular understanding before making predictions, we’d be working with the kinds of emergent properties that actually drive biological innovation.
Most importantly, this approach would shift our focus from improving existing biological functions to creating entirely new ones. Novel bio characteristics would be pursued systematically instead of discovered by accident.
Modern tooling is driving our ability to do science differently5. Laboratory automation, high-throughput experimentation, and AI-assisted analysis are making it possible to shift from reductionist approaches toward constructive approaches where we build and test integrated systems.
But to harness this potential, we need to shift biological thinking away from hyper-specific contexts toward pattern recognition. Our understanding of biology remains trapped in the assumption that deep knowledge of individual components will eventually reveal how complex systems work.
Biology has been running merger and acquisition experiments for billions of years, creating innovations that continue to astound us. It’s time we study those lessons systematically, rather than treating each discovery as an isolated surprise.
We’re looking for outliers; it’s time to stop staring at the averages.
In The Clickbait of AI for Biology I talked about the gaps between the promise and reality of AI for biology. Today I’m discussing a specific example of this, and propose a remedy.
These other characteristics could include genomic mosaicism, relative heterozygosity, or phylogenetic incongruence, each of which would vary depending on how long ago the speciation event occurred.
In Beyond Binary Causality I discussed the opportunities to develop a new scientific method, based on the abilities of modern tools.

