What's different? Part four: Software

Software as a scientific tool, the unreasonable effectiveness of data, platforms for design and discovery

Elliot Hershberg

May 01, 2022

Welcome to part four of the What’s Different? series. You can find the earlier posts here:

Part one: Sequencing

Part two: Synthesis

Part three: Scale

You can subscribe to this newsletter to have future posts delivered to your inbox for free.

Overview

The more things that you can do in parallel at the same time, the faster you can go. When this happens, the parts of your work that you can’t parallelize become more important. In the world of computer architecture, there is actually a name for this. According to Amdahl’s Law, "the overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used."

The graph above illustrates this principle. The speedup from parallelization is a function of how much work can actually be parallelized in the first place. The remaining serial portion becomes the bottleneck for making progress.

What does this have to do with biotechnology?

We have managed to massively parallelize DNA Sequencing and Synthesis. In a matter of decades, genome Sequencing transitioned from a government-backed moonshot to a grad student’s afternoon project. Synthetic DNA is now available for next-day delivery like an Amazon order. In concert with new molecular tools like CRISPR, this has enabled a totally new Scale of biological science and engineering. We are in the midst of a technological revolution moving at a pace that makes the digital revolution seems slow in comparison.

Just like in computing, massive parallelization has created new bottlenecks for biotech to deal with. From the very start of the genomic revolution, progress was impossible without massive innovation in Software. High-throughput DNA sequencing created totally new types of data that required an entire set of new tools to be built to process it.

This is only more true today. Software is now a foundational part of modern biotechnology. In science, new algorithms and analytical tools are of equal importance to the measurement technologies that they interface with. For complex engineering problems, models and computation are now used at the beginning of the design process, rather than as an afterthought.

I’m going to cover three important aspects of the interface between bits and atoms:

Software as a scientific tool
The unreasonable effectiveness of data
Platforms for design and discovery

Onward!

Software as a scientific tool

It’s important to understand that software programs can be scientific tools in the most literal sense. One of my favorite examples of this is genome assembly. The incredible parallelized DNA sequencers that I’ve written about churn out sequencing reads—which are strings of DNA bases—ATTAGGCCTTA.

It’s a really tough algorithmic problem to stitch these reads together into a complete genome. Hackers were as instrumental as molecular biologists in the completion of the Human Genome Project. One of the first assembly programs was actually written in a passionate race against private competitors who wanted to patent segments of the human genome sequence.

A graduate student Jim Kent hacked day and night to create the GigAssembler to map the genome. His advisor David Haussler said that “Jim in four weeks created the GigAssembler by working night and day. He had to ice his wrists at night because of the fury with which he created this extraordinarily complex piece of code.''

Without this kind of software, genome sequencing and assembly isn’t possible. Modern DNA sequencers are like iPhones or Tesla cars—they are a deeply integrated hardware-software technology. Software is a core part of the measurement stack itself.

One layer up from the raw data, progress can’t happen without new analysis tools. Bioinformatics is an entire scientific/engineering discipline focused on doing this. Thousands of tools are developed and published every year, and practically every PhD student in the life sciences is learning at least enough coding skills to use them to analyze their data.

The bioinformatician Heng Li is one of the major unsung heroes of modern genomics. Credit: Daniel MacArthur.

It’s worth noting that these tools and their creators are often unsung heroes. I’ve written about the some of the challenges that have emerged because of the way that this type of work is structured and funded. This is actually one of the main reasons that I decided to spend time working with a VC fund. I think that there are potentially ways to use markets to create high-quality tools in a similar fashion to the way that lab instruments have been developed and commoditized.

The unreasonable effectiveness of data

In 1960, the Nobel laureate physicist Eugene Wigner wrote a thought-invoking article entitled The Unreasonable Effectiveness of Mathematics in the Natural Sciences. He wrote about the incredibly surprising fact that by the middle of the 20th century, physicists had been able to encapsulate an incredible number of natural laws into elegant mathematical equations.

He expressed his awe at the way that these simple equations could capture so much of physics—especially when the universe did not have any requirement to be set up this way. This can be reflected in him saying “The miracle of the appropriateness of the language of mathematics for the formulation of the laws of physics is a wonderful gift which we neither understand nor deserve.”

It turns out that for scientists hoping to explain more complex emergent phenomenon, we have been a bit less fortunate. Thankfully, we have found some new tools that help us make progress. In 2009, three Google scientists wrote an essay entitled The Unreasonable Effectiveness of Data. Their core argument is that in fields such as economics and natural language processing, we may be “doomed to complex theories that will never have the elegance of physics equations” but we do have a new superpower: data.

They reflected on the fact that the inability to use math and first-principles thinking to derive elegant and general laws of natural language didn’t prevent them from using Web-scale data and machine learning to create a highly accurate translation system. This was an early realization of a new paradigm: leveraging enormous amounts of computation and data to solve problems that evaded the types of beautiful closed-form solutions possible in physics.

As this paradigm has evolved and continued to solve previously intractable problems, people have developed new ways of talking about it. One of my favorites is the Software 2.0 concept by Andrej Karpathy, who is the Director of AI at Tesla. His argument is that neural networks and modern machine learning represent a fundamentally new approach to creating computer programs.

In Software 2.0, instead of attempting to explicitly write down the rules of a program, we focus on learning the rules of the program by collecting large volumes of data that represent the solution. Creating a program to accurately classify all possible images of cats is really hard. Collecting massive sets of labeled images and learning the underlying program is actually easier.

As Karpathy writes, “It turns out that a large portion of real-world problems have the property that it is significantly easier to collect the data (or more generally, identify a desirable behavior) than to explicitly write the program.” I would argue that it is fairly clear that we inhabit this portion of program space in biology.

We can generate enormous sets of measurements with our sci-fi level DNA sequencers and automated lab instruments, but simple physical equations aren’t waiting for us—complex biological systems are a different beast compared to particles. An exciting recent trend is that software 2.0 is beginning to solve real biological problems.

Protein folding. A linear chain of amino acids folds into a stable 3D structure. (source)

One of the most notable examples is how machine learning was used to effectively solve the protein structure prediction problem. This is a problem much like natural language translation where there is no simple underlying equation, but it is possible to learn how strings of amino acids fold into 3D machines using massive amounts of data and enormous models.

This model was developed at DeepMind, and represents one of the first major scientific advances based on AI. They have now spun out a new company called Isomorphic Labs to focus on how this type of approach can be applied to drug discovery. In the announcement, DeepMind CEO Demis Hassabis wrote: “Biology is likely far too complex and messy to ever be encapsulated as a simple set of neat mathematical equations. But just as mathematics turned out to be the right description language for physics, biology may turn out to be the perfect type of regime for the application of AI.”

To be clear, biology will require a lot more software 1.0—where brilliant engineers design algorithms and hammer out efficient C++ code to assemble genomes and create new scientific tools. Not all scientific problems are solvable by software 2.0, or even capable of being represented in this way. However, as these models produce state-of-the-art results across the central dogma from DNA sequence data, to RNA, to proteins, and all the way up to cells, I’m buckling my seatbelt. I wouldn’t be surprised to see models on the magnitude of AlphaFold in biology every 5-10 years from here on out.

I’ve also covered the emergence of a new scientific strategy based on these modeling successes. These models can be used as incredible tools for discovery. People are developing interpretation methods to effectively reverse engineer these models using virtual experiments to better understand the underlying processes that these models have learned to approximate.

Platforms for design and discovery

Hopefully I’ve convinced you over the course of this series that biology is changing. We have a new stack of complex and high-throughput measurement tools, and now rely on a lot of complex software 1.0 and software 2.0. It is a major open question as to how we should best unify software and experiments to develop platforms for biological design and discovery.

ENCODE at UCSC — The UCSC Genome Browser. (source)

One approach has been to develop pure software platforms for scientists. In genomics, genome browsers have become the de facto representation of genetic information. Many browsers, databases, and tools on the Web have been used by scientists around the world to analyze, download, and process their data.

A big benefit of shared pure software platforms is that they help eliminate waste and inefficiency. With common infrastructure, individual groups and companies don’t have to reinvent the wheel and build the same tools that everybody else is using. I personally got a lot of joy out of working on the Generic Model Organism Database project, which develops tools to let individual communities set up their own platforms using their shared data and annotations.

Sometimes pure software doesn’t cut it—at the end of the day nothing is real in biology without an experiment. As I’ve mentioned, DNA sequencers represent an example of a deeply integrated software-hardware product. An exciting trend is that more of these types of instruments are emerging.

The Opentrons robot that I mentioned last week is another example of a product that directly blends hardware and software in a single system. Another cool example is the Onyx system being developed by Inscripta. The goal is to pack a lot of chemistry and bioinformatics for genome engineering into a single bench-top box. These types of modular platforms can be integrated into existing lab workflows.

Some people have dreamed of an even more extreme integration of experiments and software. In academic groups like the Murphy Lab at Carnegie Mellon, scientists have explored developing automated closed-loop systems where AI drives the entire experimentation process. Active learning algorithms select experiments based on data generated using liquid handling robots and microscopes.

Companies like Insitro and Recursion are driving these types of ideas forward. Entire labs are being designed and engineered from the ground up to generate optimal training data for machine learning. The goal is to push ML models in biology to the peak of possible performance—with the hope of using them to accelerate discovery.

Software is equally important for design. A core focus of the GP-write—which I described in the Synthesis post—is to develop a Computer-Aided Design (CAD) system for genome engineering. Like in other engineering disciplines, the idea would be to specify designs using software, and then synthesize and test them. One startup actively pursuing this dream is Asimov in Boston.

The biology revolution will require new platforms spanning the entire spectrum from pure software all the way to model-driven labs. It is a beautiful opportunity for talented engineers to use their skills to create an abundant future in the physical world.

Using the model

Over the course of this series, I’ve proposed a general thesis that the combination of Sequencing, Synthesis, Scale, and Software represents a new foundational technology stack for biology. I have found that focusing on these four components has broad explanatory power. This mental model can help to provide context for some of the most exciting science and technology development currently happening.

I don’t want you to just take my word for it! Next time, I’ll talk about how I’ve used this model to contextualize my own work and the science I’ve written about so far. I’ll also highlight some of the companies building on this stack that I’m the most excited about. If you have examples of work that you think fits into this model—or doesn’t fit—I’d love to hear from you.

You can subscribe for free to make sure you don’t miss the next post:

Until next time! 🧬

The Century of Biology

Discussion about this post