Today, DeepMind announced that it has apparently solved a prominent problem in biology: how the chain of amino acids in a protein folds into a three-dimensional shape that enables its complex functions. It’s a computational challenge that has resisted the efforts of many intelligent biologists for decades, despite the use of supercomputer-level devices for these calculations. Instead, DeepMind trained its system with 128 specialized processors for two weeks; It now returns potential structures in two days.
System limitations aren’t clear yet – DeepMind says it is currently planning on a peer-reviewed paper that has only provided a blog post and some press releases. But the system is clearly doing better than anything its predecessor, after doubling the performance of the best system in just four years. Even if this is not beneficial in all circumstances, the advancement likely means that the structure of many proteins cannot now be predicted from more than the DNA sequence of the gene encoding them, representing a major change in biology.
Between the folds
To make proteins, our cells (and the cells of every other living thing) chemically link the amino acids to form a chain. This works because each amino acid participates in a backbone that can be chemically connected to form a polymer. But each of the twenty amino acids that life uses have a distinct set of atoms attached to this backbone. They can be charged or neutral, acidic or basic, etc., and these properties determine how each amino acid interacts with its neighbors and the environment.
The reactions of these amino acids determine the three-dimensional structure that the chain adopts after it is produced. The hydrophobic amino acid ends up in the inner part of the structure to avoid the aqueous environment. The positive and negatively charged amino acids attract each other. Hydrogen bonds drive the formation of uniform spirals or parallel plates. Collectively, they lead what would otherwise be a turbulent chain to be folded into an organized structure. This regulatory structure in turn determines the protein’s behavior, allowing it to act as a trigger, bind to DNA, or drive muscle contraction.
Determining the arrangement of amino acids in a protein chain is relatively easy, as they are defined by the arrangement of the DNA bases within the gene that encodes the protein. And since we get so good at sequencing the whole genome, we have a huge abundance of gene sequences and thus an enormous excess of protein sequences available to us now. For many of them, we have no idea what the folded protein looks like, making it difficult to determine how they work.
Given that the protein backbone is very flexible, virtually any two amino acids can interact with each other. So figuring out which ones actually interact in the folded protein, and how this reaction reduces the free energy of the final formation, becomes a difficult math challenge once the amino acid count becomes too large. Basically, when any amino acid can occupy any possible coordinates in a 3D space, it becomes difficult to know what to put into place.
Despite the difficulties, there has been some progress, including through Distributed computing And the Folding gamification. But a semi-annual continuous event is called Critical evaluation of predicting protein structure (CASP) has seen somewhat uneven progress throughout its existence. In the absence of a successful algorithm, people are left with the tedious task of purifying the protein and then using X-ray diffraction or cryogenic electron microscopy to figure out the structure of the purified shape, endeavors that can often take years.
DeepMind enters the battle
DeepMind is an AI company that Google acquired in 2014. Since then, it has made a number of patches, developing systems that have been successful in engaging with humans. In GoAnd the chessAnd and even in Starcraft. In its many notable successes, the system was trained simply by providing it with the rules of the game before it was freed to play by itself.
The system is incredibly powerful, but it wasn’t clear that it would work for protein folding. For one thing, there is no clear external criterion for “winning” – if you get a chassis with very low free energy, that doesn’t guarantee there’s something a little less there. There is not much in the way of the rules. Yes, amino acids with opposite charges will cut free energy if they are next to each other. But that wouldn’t happen if it was at the expense of dozens of hydrogen bonds and hydrophobic amino acids.
So how can you adapt AI to work under these conditions? For the new algorithm, called AlphaFold, the DeepMind team treated the protein as a graph of a spatial network, with each amino acid as a node, and mediating the connections between them by their proximity to the folded protein. The AI itself is then trained in the task of discovering the composition and strength of these connections by feeding them the pre-defined structures of more than 170,000 proteins obtained from a public database.
When administering a new protein, AlphaFold looks for any proteins with related sequences, and aligns the relevant portions of the sequences. It also looks for proteins with known structures that also have similar regions. These methods are usually great at optimizing localized features of the structure but are not great at predicting the overall structure of a protein – homogenizing a group of highly optimized pieces together does not necessarily produce an optimal whole. And this is where the attention-based deep learning portion of the algorithm was used to ensure that the overall structure was coherent.
Obvious success, but with limits
For CASP, AlphaFold, and algorithms from this year’s other entrants, they were either placed in a series of proteins that had yet to be resolved (and resolved as the challenge continued) or resolved but not yet published. So there was no way for algorithm creators to equip systems with real-world information, and the output of the algorithms could be compared to the best real-world data as part of the challenge.
AlphaFold performed well – much better, in fact, than any other entry. For about two-thirds of the proteins that predicted a structure for them, it was an experimental error that you would get if you tried to repeat the structural studies in the laboratory. Generally speaking, on an accuracy rating of zero to 100, the average score is 92 – again, the kind of range you’ll see if you try to get the build twice under two different conditions.
By any reasonable standard, the computational challenge of discovering the protein’s structure was solved.
Unfortunately, there are a lot of unthinkable proteins out there. Some get stuck instantly in the membrane. Others quickly pick up chemical modifications. Still others require intense reactions with specialized enzymes that burn energy to force other proteins to recoil. In all likelihood, AlphaFold will not be able to handle all of these cutting edge cases, and without an academic paper describing the system, it would take the system some time – and some realistic uses – to figure out its limitations. This does not detract from an amazing feat, just to warn against unreasonable expectations.
The main question now is how quickly the system can be made available to the biological research community so that its limits can be determined and we can begin to use it in cases where it is likely to function well and have great value, such as the structure of proteins from pathogens or mutant forms found in cancer cells.