Introduction
Biology may be the first science humans developed to grow more food, understand disease and discover medicines. As our planet's climate continues to change, we will need bioengineering to produce more from less, while rebalancing earth's chemistry at scale. We can engineer foods that grow at higher densities in conditions as harsh as Mars. We can engineer fast growth trees to convert carbon dioxide into building materials while restoring degraded lands. We can extend lifespan, and empower individuals to control their metabolic, reproductive, and cognitive systems. While hard to believe, remember the unimaginable technologies we've already discovered: antibiotics, vaccines, immunotherapy, programmable medicines, etc. Nature's diversity provides an abundance of technologies for us to repurpose. Machine learning will dramatically accelerate biological discovery, offering a path to increase material abundance and human wellbeing while reversing climate change!
Bottlenecks to progress
The current research bottleneck is performing experiments in living systems - a painfully slow and expensive step. Companies estimate an average of 10 years and over $2 billion in research costs per drug before running a clinical trial. Despite this immense effort, 90% of human trials fail: testing drugs in cells or mice does not accurately reflect human biology. How can machine learning possibly solve this problem? AlphaFold and similar atomic models have succeeded at understanding molecules, but there aren't yet models for cells. Major efforts are currently underway to simulate biology, from cells to whole animals.
Defining the problem
What are the challenges in modeling a cell? First, there is scale: a human cell contains trillions of atoms, many of which are complex macromolecules like RNA and protein. Running atomic simulations at this scale is unlikely to succeed at a pace and budget competitive with real-world experiments. Second, there is the problem of predicting macromolecule structure and function. AlphaFold has famously solved the protein structure prediction problem, but predicting protein function (e.g. catalytic rate) and RNA structure remains a challenge. Finally, there is the problem of understanding how molecules interact with each other - their social networks. Proteins may interact with eachother to form a complex, or deposit modifications to alter function. They also control the production of other biochemicals including RNA and proteins. Inferring this causal network of interactions is a grand challenge for virtual biology.
Understanding regulatory networks
How can we model complex molecular networks and their dynamics? Recall that DNA is first copied into RNA; RNA is then translated into protein molecules. Cell states are driven by different RNA and protein abundances to create different functionalities. For technical reasons, scientists primarily measure RNA to quantify cell state: a vector of N genes and their corresponding RNA abundance. A virtual cell model should take as input a cell state and a perturbation vector impacting one or more genes; the output should be the future cell state vector after perturbation. This may sound simple, but remember that every molecule in the network may regulate other molecules, triggering secondary effects that cascade through the network until a new equilibrium is reached. Any model that accurately solves the RNA prediction task must develop a near perfect understanding of the full molecular network and its dynamics.
Data generation
Scientists are collecting millions of data points to train virtual cell models described above. The method is simple and direct: inactivate one gene per cell, and measure changes in RNA abundance. This process is repeated for all genes in as many cells as possible. Millions of cells are profiled to resolve sampling error and the diversity of cell types and responses. While powerful, these datasets have not yet produced models with significant predictive value.
A major limitation of conventional techniques is poor time resolution. Current methods cannot detect changes within 4 hours due to the slow RNA decay rate in human cells (average 10 hour half-life). As a result, the causal order of events of faster cellular processes is obfuscated. This significantly complicates causal inference as models fail to distinguish primary and secondary effects.
Right now most companies are hoping that larger datasets and better models can solve this problem. At Hypergraph Bio, we generate datasets capable of quantifying RNA changes within 10 minutes, allowing us to distinguish primary and secondary effects using time. This dramatic improvement is achieved by tracking RNA synthesis rather than total RNA. See our manuscript.
Inferring molecule networks from sequence
Theoretically, we know that genomes contain all the information necessary to clone an individual. So why can't we learn how to create digital clones, similar to how AlphaFold can predict protein structure from sequence? First, we have to understand how to "decompress" the genome into a data structure we can understand and compare across species. There are encouraging signs that this may, in fact, be possible: AlphaFold was recently used to predict all protein interactions. If similar efforts can resolve DNA and RNA interactions, we will finally have a complete first-draft map of all possible molecular interactions inferred directly from sequence. The final challenge will be parsing this map into specific cellular programs: not all proteins are in use at any given time, and different combinations can rewire interaction partners. From this perspective, it will be a key milestone to identify all molecular interactions from genome sequence on the path to building virtual cells.
Conclusion
Bioengineering is on the cusp of a golden age. Machine learning can dramatically accelerate progress by accurately simulating cells, organs or whole animals. Recent approaches create social networks of molecules to understand cause and effect without full atomic simulation. Once complete, these maps can be used to predict cell behavior, dramatically reducing the expense and labor required for biological research. Virtual cell models may soon supercharge bioengineering as radically as the development of personalized computers empowered software engineering.