AI vs. Nature: The Protein Game

Image Courtesy of Wikimedia Commons.

Beyond Amazon Alexa and self-driving cars, artificial intelligence (AI) has transformed modern biology. For over fifty years, scientists have been trying to figure out how to determine a protein’s structure from just its amino acid sequence. AI has made a major breakthrough – Google developed a program called AlphaFold that is able to predict folded protein shapes with greater accuracy than ever before. Following in this vein, scientists at SalesForce Research recently developed an AI tool called ProGen, which produces powerful, artificial enzymes from scratch.

ProGen relies on a large language model. Language models produce human-like text by predicting the most probable word to occur next given what has already been written. These models have gained a lot of recent attention – take, for instance, ChatGPT, which will answer any question as if a real person was writing back to you. Inspired by the success of language models, Nikhil Naik, director of SalesForce AI Research, saw that the same technique might work for proteins. Because proteins can be represented as a sequence of letters from a shared alphabet of twenty amino acids, he reasoned that a language model might also be able to generate new protein sequences. 

For a language model to be successful, it must learn by example. ProGen was trained on 281 million protein sequences covering almost 19,000 protein families in nature. Aside from the sequences, ProGen includes control tags – labels that specify the properties of a particular protein. Along with including more data for ProGen to work with, control tags allow ProGen to synthesize proteins based on user input – when prompted with a property, ProGen can build a protein based on that property. 

Naik then tested whether the novel proteins produced by ProGen could work in the real world. His team focused his efforts on lysozymes, which are a kind of bacteria-killing enzyme found in animals. To increase accuracy, they first fine-tuned the language model to lysozymes. James Fraser, a professor in the Department of Bioengineering at the University of California San Francisco, measured the activity of the new enzymes. Remarkably, an artificial lysozyme created by ProGen could kill bacteria even though it shared at most 31.4 percent of its sequence with any known protein. “The model has learned substitution and co-occurrence patterns that are hard for humans to intuit, but that can lead from data,” Naik said. In this way, ProGen uses the grammar of protein language to build proteins completely different from anything we have seen before.

These findings serve as a proof-of-concept of a new paradigm of protein engineering. Scientists have previously designed new proteins by using natural selection in the laboratory. Through a technique called directed evolution, biologists repeatedly induce mutations in a protein over time. Unlike directed evolution, however, ProGen does not need nature’s help and designs proteins much faster. Although Naik feels AI technology is still far from creating any protein that we can imagine, ProGen shows that this dream might be within reach. AI tools like ProGen will accelerate the discovery of new drugs and environmentally-friendly enzymes. But with this newfound power comes responsibility – while this technology develops, we must consider the ethical implications of having nature’s code at our fingertips.