ESM Metagenomic Atlas: The first view of the ‘dark matter’ of the protein universe

November 1, 2022

Something Went Wrong
We're having trouble playing this video.

  • Meta AI has created the first database that reveals the structures of the metagenomic world at the scale of hundreds of millions of proteins. These proteins – which are found in microbes in the soil, deep in the ocean, and even inside our bodies – vastly outnumber those that make up animal and plant life. But they are the least understood proteins on earth.

  • Decoding metagenomic structures can help us solve long-standing mysteries of evolutionary history and discover proteins that may help cure diseases, clean up the environment, and produce cleaner energy.

  • To make structure predictions at this scale, a breakthrough in the speed of protein folding is necessary. We trained a large language model to learn evolutionary patterns and generate accurate structure predictions end to end directly from the sequence of a protein. Predictions are up to 60x faster than the current state-of-the-art while maintaining accuracy, making our approach scalable to far larger databases.

  • We are now sharing our models, research paper, and a database of more than 600 million metagenomic structures, as well as an API that allows scientists to easily retrieve specific protein structures relevant to their work.

  • Explore the ESM Metagenomic Atlas here.

Proteins are complex and dynamic molecules, encoded by our genes, that are responsible for many of the varied and fundamental processes of life. They have an astounding range of roles in biology. The rods and cones in our eyes that sense light and make it possible for us to see, the molecular sensors that underlie hearing and our sense of touch, the complex molecular machines that convert sunlight into chemical energy in plants, the motors that drive motion in microbes and our muscles, enzymes that break down plastic, antibodies that protect us from disease, and molecular circuits that cause disease when they fail — are all proteins.

Metagenomics, one of the new frontiers in the natural sciences, uses gene sequencing to discover proteins in samples from environments across the earth, from microbes living in the soil, deep in the ocean, in extreme environments like hydrothermal vents, and even in our guts and on our skin. The natural world contains a vast number of proteins beyond the ones that have been cataloged and annotated in well-studied organisms. Metagenomics is starting to reveal the incredible breadth and diversity of these proteins, uncovering billions of protein sequences that are new to science and cataloged for the first time in large databases compiled by public initiatives such as the NCBI, European Bioinformatics Institute and Joint Genome Institute, incorporating studies from a worldwide community of researchers.

Meta AI has developed a new protein-folding approach that harnesses large language models to create the first comprehensive view of the structures of proteins in a metagenomics database at the scale of hundreds of millions of proteins. Our research team found that language models can accelerate the speed at which an atomic-level three-dimensional structure can be predicted up to 60x faster relative to existing state-of-the-art protein structure prediction approaches. This advance will help to accelerate a new era of structural understanding where it could be possible for the first time to understand the structure of billions of proteins that gene-sequencing technology is cataloging.

Today, we are releasing the 600+ million protein ESM Metagenomic Atlas, with predictions for nearly the entire MGnify90 database, a public resource cataloging metagenomic sequences. To our knowledge, this is the largest database of high resolution predicted structures, 3x larger than any existing protein structure database, and the first to cover metagenomic proteins comprehensively and at scale. These structures provide an unprecedented view into the breadth and diversity of nature, and hold the potential for new scientific insights and to accelerate discovery of proteins for practical applications in fields such as medicine, green chemistry, environmental applications, and renewable energy.

In addition, we are releasing the fast protein folding model used to create the database and an API that allows researchers to use it for scientific discovery. With 15 billion parameters, our new language model is the largest language model of proteins to date.

Unlocking a hidden natural world: the first comprehensive view of metagenomic structural space

Advancements in gene sequencing have made it possible to catalog billions of metagenomic protein sequences. Although we know that these proteins exist, because we have discovered their sequences, understanding their biology is a staggering challenge. Determining the three-dimensional structures for hundreds of millions of proteins experimentally is far beyond the reach of time-intensive laboratory techniques such as X-ray crystallography, which can take weeks to years for a single protein. Computational approaches can give us insight into metagenomics proteins that isn’t possible with experimental techniques.

The ESM Metagenomic Atlas will enable scientists to search and analyze the structures of metagenomic proteins at the scale of hundreds of millions of proteins. This can help researchers to identify structures that have not been characterized before, search for distant evolutionary relationships, and discover new proteins that can be useful in medicine and other applications.

A map of tens of thousands of high-confidence predictions showing similarity to proteins whose structure is currently known. The image shows large regions of completely unknown structural space revealed for the first time.

Learning to read the language of nature

The ESM-2 language model is trained to predict amino acids that have been masked out of sequences across evolution. We discovered that, as a result of this training, information about the protein’s structure emerges in the internal states of the model. This is surprising because the model has been trained only on sequences.

Like the text of an essay or letter, proteins can be written as sequences of characters. Each character corresponds to one of 20 standard chemical elements — amino acids, each with different properties — that are the building blocks of proteins. These building blocks can be combined in an astronomical number of different ways — e.g., for a protein made of 200 amino acids, there are 20^200 possible sequences — more than the number of atoms in the visible universe. Every sequence folds into a three-dimensional shape (though not all will fold into coherent structures; many sequences fold into disordered forms), and it is this shape that largely determines the biological function of the protein.

Learning to read this language of biology poses extraordinary challenges. While a protein sequence and a passage of text can both be written down as characters, there are deep and fundamental differences between them. A protein sequence describes the chemical structure of a molecule, which folds into a complex three-dimensional shape according to the laws of physics.

Protein sequences contain statistical patterns that convey information about the folded structure of the protein. For example, if two positions in a protein coevolve with each other — in other words, if at one of the positions a certain amino acid appears, which is usually paired with a certain amino acid at the other position — this could be a signal that those two positions are interacting with each other in the folded structure. Similar to two pieces of a puzzle fitting together, evolution must choose amino acids that fit together in the folded structure. This means we can often infer something about the structure of a protein by looking at patterns in protein sequences.

Evolutionary scale modeling (ESM) uses AI to learn to read these patterns. In 2019, we presented evidence that language models learn the properties of proteins, such as their structure and function. Using a form of self-supervised learning known as masked language modeling, we trained a language model on the sequences of millions of natural proteins. With this approach, the model must correctly fill in the blanks in a passage of text, such as “To __ or not to __, that is the ________.” We trained a language model to fill in the blanks in a protein sequence, like “GL_KKE_AHY_G” across millions of diverse proteins. We found that information about the structure and function of proteins emerges from this training. In 2020, we released ESM1b, a state-of-the-art protein language model, which is being used for a variety of applications including to help scientists predict the evolution of COVID-19 and discover genetic causes of disease.

We have now scaled up this approach to create a next-generation protein language model, ESM-2, which at 15B parameters is the largest language model of proteins to date. We found that as the model is scaled up from 8M to 15B parameters, information emerges in the internal representations that enables 3D structure prediction at an atomic resolution.

Accelerating protein folding by an order of magnitude

Something Went Wrong
We're having trouble playing this video.

High-resolution protein structure emerges as the model scales up. As the model scales, new details emerge in the atomic resolution image of the structure.

With current state-of-the-art computational tools, predicting structures for hundreds of millions of protein sequences in a practical time frame could take years, even using the resources of a major research institution. To make predictions at the scale of metagenomics a breakthrough in prediction speed is critical.

We found that using a language model of protein sequences greatly accelerates the speed of structure prediction (up to 60x). This is fast enough to make predictions for an entire metagenomics database in just weeks and will be scalable to databases much larger than the one we are releasing today. In fact this new structure prediction capability enabled us to predict sequences for the more than 600 million metagenomic proteins in the atlas in just two weeks on a cluster of approximately 2,000 GPUs.

Current state-of-the-art structure prediction methods need to search through large protein databases to identify related sequences. The approaches actually need a whole group of evolutionarily related sequences as input so that they can extract the patterns that are linked to structure. The language model learns these evolutionary patterns during its training on protein sequences, enabling a high resolution prediction of the three-dimensional structure directly from the sequence of the protein.

Protein folding with a language model. Arrows show the information flow in the network from the language model to the folding trunk to the structure module, which outputs 3D coordinates and confidences.

Where do we go from here?

Several billion years ago, evolution invented a language by which complex and dynamic molecular machines can be formed out of simple building blocks. This language is the basis of life. Learning to read the language of proteins is an important step in our understanding of the natural world.

ESMFold shows how AI can give us new tools to understand the natural world, much like the microscope, which enabled us to see into the world at an infinitesimal scale and opened up a whole new understanding of life. AI can help us understand the immense scope of natural diversity, and see biology in a new way. Much of AI research has focused on helping computers understand the world in a way similar to how humans do. The language of proteins is one that is beyond human comprehension and has eluded even the most powerful computational tools. AI has the potential to open up this language to our understanding. Studying AI in new domains such as biology can also give insight into artificial intelligence more broadly. Our work reveals connections across domains: large language models that are behind advances in machine translation, natural language understanding, speech recognition, and image generation are also able to learn deep information about biology.

Metagenomics provides a view into the millions of diverse molecular machines that nature has invented. To extend this work even further, we’re studying how language models can be used to design new proteins and contribute to solving challenges in health, disease, and the environment. This work extends across many disciplines, from AI to chemistry to biology, so it is important to work openly, share our data and learnings, and build upon others’ insights. We hope that the release of this large-scale structure atlas and fast protein folding models will fuel further scientific progress and better our understanding of the world around us.

Explore the ESM Metagenomic Atlas
Read the research paper
View code and models on GitHub