Learning the Language of a Virus

Viral escape, the strategy a virus adopts to evade the human immune system by mutating just enough to avoid recognition and destruction by host antibodies, is one of the biggest challenges virologists face while developing effective vaccines. It is why a vaccine for HIV and a universal vaccine for influenza have yet to exist. Furthermore, it is why current vaccines approved for emergency use against SARS-CoV-2 may ultimately prove ineffective against new strains of the virus such as the U.K. and South African variants. 

In an effort to predict which viral mutations could result in successful escape, a team of MIT researchers made use of a machine learning technique originally intended for natural language processing to construct computational models of three different surface proteins: influenza A hemagglutinin, HIV-1 envelope glycoprotein, and SARS-CoV-2 spike glycoprotein.  

In a recent article published in Science, Brian Hie, an electrical engineering and computer science graduate student at MIT graduate student, along with senior advisors Bryan Bryson, an MIT assistant professor of biological engineering, and Bonnie Berger, head of computation and biology at MIT’s Computer Science and AI Lab, explore how natural language components such as grammaticality, or syntax, and semantics, or meaning, can be used to better understand viral evolution. 

So, why a language model? To begin, techniques for studying viral escape fall into two main categories: experimental and computational. One high-throughput experimental technique known as a Deep Mutational Scan (DMS) makes every possible amino acid change to a protein and then measures the effect of each mutation by analyzing some property of that protein, such as cellular binding or infectivity. While a DMS is effective in analyzing mutations on a singular amino acid, it becomes impractical—and quite expensive—to analyze the escape potential of combinatorial mutations. To put it into perspective, proteins are made up of chains of polypeptides with between fifty to two thousand amino acid residues, each of which can be one of twenty unique amino acids. Considering this complexity, testing every possible combination of mutations in a laboratory setting would be unfeasible.

Alternatively, machine learning models can use statistics and algorithms to draw patterns from large collections of data without being explicitly told what patterns to learn. “In natural language, that corresponds to completing sentences and modeling grammar and semantic similarity or semantic change,” Hie said. For viral escape, semantic change is analogous to antigenic change, where the virus mutates its surface proteins, and grammaticality relates to adhering to biological rules in order to survive and replicate.

Training the algorithm to model viral escape rather than human language involves feeding it sequences of viral amino acid data instead of English sentences. While machine learning language models of proteins previously existed, none of them looked at both protein fitness and function simultaneously and, therefore, could not predict escape nearly as well as the MIT model, which captures both fitness and function through the language components of grammaticality and semantic change.

Viral fitness, more specifically replicative fitness, refers to a virus’s ability to bind to a host cell, infect it, and produce infectious offspring inside the host cell. In the language model, viral fitness corresponds to grammaticality while protein function is captured by semantic change, or the ability of the virus to alter its surface proteins enough to evade neutralizing antibodies. Mutating viruses must sufficiently change their proteins so as not to initiate an immune response, but not so much that they are unable to fold into the correct conformation and, therefore, lose function. Thus, the host immune system will lose the original ability to recognize the viruses as foreign invaders and the viruses will be able to successfully enter and infect host cells. 

The model was given the task of identifying viral mutations with high grammaticality and high semantic change, which are characteristics of high escape potential. Operating on amino acid data alone and without human instruction, the model was able to execute this task, known as constrained semantic change search (CSCS), by ranking mutations based on fitness and function. Mutations with higher scores corresponded to viruses that were both grammatical—able to preserve fitness by following biological rules—and had experienced high semantic change—were antigenically different from the original wildtype sequence. The results of the model’s rankings were then validated by comparing it to the results of a DMS.

“We started [this project] in response to the pandemic and out of curiosity of how we can better understand viral evolution,” Hie said. While Bryson and Hie usually focus their work on tuberculosis research in Bryson’s lab, they transitioned to studying viral escape because “when you’re in a pandemic, you learn about the pathogen that is wreaking havoc,” said Bryson. 

Initially, the researchers trained their model using influenza A and HIV data. “Once we validated the model on influenza A and HIV, by then the data had been released for SARS-CoV-2 and we were able to run it… The timing was perfect,” Hie said. 

In addition to scoring mutations based on grammaticality and semantic change, the researchers also created visual representations of each protein structure that showed escape potential in different regions of each protein. Different sections of the proteins were color coded according to high escape potential or high escape depletion. Visualizing and quantifying escape potential is significant in identifying which areas of a protein should be targeted by drugs. “Our whole idea is that we look for areas that are depleted by our predictions for escape, and we’re suggesting that [vaccine developers] target those areas,” Berger said. 

For example, areas such as the receptor binding domain (RBD), a region of a virus located on its surface proteins that allows the virus to attach to and enter host cells, have high escape potential. This means that targeting RBDs may be less effective due to the fact that they have a high possibility of mutating and avoiding immune defenses. “For Covid, we found this subunit domain—the S2 domain—that is low on depletion, whereas the N-terminal domain and receptor binding domain have high escape potential,” Berger said. This finding suggests that because the S2 domain is less likely to mutate, it is characterized as a good target of antibodies instead of the receptor binding domain. 

This idea of identifying areas of escape depletion raises the question many immunologists are trying to solve: “How do you design immunogens for regions of a protein instead of a whole protein or a whole inactivated virus?” Bryson said. Immunogen design is something immunologists must keep in mind while developing vaccines. The Pfizer-BioNTech and Moderna vaccines currently being distributed in the United States target the entire SARS-CoV-2 spike protein rather than particular subunits. Because new variants have successfully mutated their spike proteins, current vaccines may not be as effective against them, as areas of the protein may be unrecognizable to neutralizing antibodies. 

Given that the language model was successful in learning viral dynamics from sequence data alone, the researchers can now search for possible mutations on top of the new variants of the SARS-CoV-2 Wuhan strain. “This can tell us what are the best experiments to go test to anticipate potential even further escape,” Bryson said. 

As new data for SARS-CoV-2 is being generated in real time, the researchers consistently re-train the model and publish the results on their GitHub repository. Considering the model’s successful performance, the researchers hope that the Centers for Disease Control and Prevention will adopt their model as a tool for understanding viral epidemics. If this happens, as new strains of SARS-CoV-2 surface, the model could predict more variants on top of the current mutations, which would give scientists a narrow set of experiments to test the efficacy of current vaccines on and allow for modification of the vaccines as needed.


Hie, B., Zhong, E. D., Berger, B., & Bryson, B. (2021). Learning the language of viral evolution and escape. Science, 371(6526), 284-288.