The largest protein-folding model created to date, according to AI researchers at Meta, is able to predict the structure of more than 600 million proteins.
The 15 billion parameter ESM-2 transformer-based model and the ESM Metagenomic Atlas, a database of predicted protein structures, were both made public by the team on Tuesday. Protein forms that have not yet been seen by scientists are included in this database.
Up to 20 different types of amino acids can be found in proteins, which are intricate biological molecules that play a variety of biological roles in living things. Importantly, they fold up into complex 3D structures, the shape of which is critical to how they function. By understanding how they function, scientists can then work to duplicate, alter, or oppose that activity.
Unfortunately, using the amino acid formula alone won’t allow you to figure out the final structure right away. To maybe figure it out, you can run simulations or experiments, but this takes time. Nowadays, if you provide properly trained machine learning software with the chemical make-up of a protein, the model will pretty quickly and reliably predict the structure.
In fact, DeepMind proved this with their AlphaFold model, which took first place in the 2020 edition of the biannual international computing protein-folding CASP competition. AlphaFold and other machine-learning programmes can produce the matching three-dimensional structure from an input string of amino acids.
Since then, DeepMind researchers in London have developed their algorithm to forecast the structure of more than 200 million proteins that are currently known to science. After being educated on millions of protein sequences, the most recent ESM system from Meta has gone even farther, making hundreds of millions more predictions.
You may find a preprint paper by Lin et al., a member of the Meta team, outlining the design of ESM-2 here. The system, interestingly enough, is actually a sizable language model designed to “learn evolutionary patterns and provide correct structure predictions end to end directly from the sequence of a protein,” according to the researchers. One example is AlphaFold, which doesn’t employ a language model and takes a different tack.
Large language models can be used for much more than just handling human languages, according to the researchers in their paper: “Modern language models containing tens to hundreds of billions of parameters develop abilities such as few-shot language translation, commonsense reasoning, and mathematical problem solving, all without explicit supervision.
These findings suggest that language models trained on protein sequences can experience an analogous kind of emergence.
ESM-2, a language model that can be trained to predict a protein’s physical structure from a text string encoding its amino acids, is the outcome.
According to Meta, ESM-2 is the largest model of its kind and predicts structures up to 60X faster than earlier state-of-the-art systems like AlphaFold or Rosetta, which can take over ten minutes to provide an output.
In just two weeks while using 2,000 GPUs, the model was able to produce the ESM Metagenomic Atlas by forecasting approximately 600 million protein structures from the MGnify90 protein database. A protein with 384 amino acids may be simulated in about 14.2 seconds on a single Nvidia V100 GPU. According to the study, Meta claimed that while speed is the most important factor, its system mostly but not entirely matched AlphaFold’s accuracy, enabling it to forecast more proteins.
“Even with the resources of a major research institution, it may take years to predict structures for hundreds of millions of protein sequences using current state-of-the-art computational technologies. A breakthrough in prediction speed is essential to make predictions at the metagenomics scale, the Facebook founder stated.
Meta expects that the ESM Metagenomic Atlas and ESM-2 will aid researchers who are examining the evolutionary past or fighting disease and climate change. The business added, “To take our work even further, we’re researching how language models may be used to design novel proteins and help with problems in health, illness, and the environment.