One of the reasons why it is so difficult to make effective vaccines against certain viruses, including influenza and HIV, is that these viruses mutate very quickly. This allows them to escape antibodies produced by specific vaccines through a process called “viral escape”.

In a new study, researchers from the Massachusetts Institute of Technology (MIT) designed a new method based on a model originally developed for analytical language to build a computational model of virus escape. The model can predict which parts of the virus surface protein are more likely to mutate, and can also identify the parts that are less likely to mutate, which are good targets for the development of new vaccines.

“Virus escape is a big problem,” said Bonnie Berger, head of the Computational Biology Group at MIT’s Computer Science & Artificial Intelligence Laboratory. “The escape of influenza virus surface protein and HIV envelope surface protein is the reason why we do not have universal influenza vaccine and HIV vaccine, which kill hundreds of thousands of people every year.”

In this study, Berger and her colleagues identified potential targets for developing vaccines against influenza viruses, HIV, and SARS-CoV-2. The researchers also applied their model to new variants of SARS-CoV-2 that recently emerged in the UK and South Africa. The analysis, which has not yet been peer-reviewed, suggests that the genetic sequence of the virus variants should be further investigated to determine whether it is possible for them to escape the effects of existing vaccines.

Protein Language

Different types of viruses have different rates of genetic mutation. HIV and influenza viruses are among the fastest mutating groups. These mutations must change the shape of the surface protein so that the antibody can no longer bind to it. However, the changes in the protein cannot make it lose its function.

The researchers decided to model these standards using the computational language model that comes from natural language processing (NLP), which was originally designed to analyze language patterns, especially how often they appear with certain words. These models can then predict which words can be used to complete a sentence, such as “Sally eats eggs for.” The chosen words must be grammatically correct and have correct meaning. In this case, the NLP model may predict “breakfast” or “lunch”.

The key insight is that this model can also be applied to biological information, such as gene sequences. In this case, the syntax is similar to the rules that determine whether a protein encoded by a particular sequence has a function, while the semantics is similar to whether the protein can take a new shape to help it escape the antibody. Therefore, mutations that enable the virus to escape must maintain the grammatical nature of the sequence, but change the structure of the protein in a useful way.

Hie said, “If the virus wants to escape the human immune system without the mutation, it will die or cannot replicate. While keeping fitness, the virus has to disguise itself well enough so that it cannot be detected by the human immune system.”

To model this process, the researchers trained an NLP model to analyze patterns found in gene sequences, which allows it to predict new sequences that have new functions but still follow the biological rules of protein structure. An important advantage of this modeling is that it only needs sequence information, which is easier to obtain than the protein structure. The model can be trained with a relatively small amount of information. In this study, 60,000 HIV sequences, 45,000 influenza virus sequences, and 4,000 coronavirus sequences were used.

Hie said, “the language model is powerful because it can learn from this complex distribution structure and gain some insights into function only from sequence changes. We have this large corpus of virus sequence data for each amino acid position, and this model can learn the characteristics of amino acid co-occurrence and co-variation in the training data. “

Block the escape of the virus

Once the model is trained, the researchers can use it to predict the sequences of coronavirus spike proteins, HIV envelope proteins, and influenza virus hemagglutinin (HA) proteins, which more or less produce escape mutations.

For influenza viruses, the model shows that the sequence that is least likely to mutate and escape is the stalk of the HA protein. This is consistent with recent findings that antibodies to the handle of HA proteins, which are not produced by most people infected with the flu virus or vaccinated against influenza, provide near-universal protection against any strain of influenza virus.

The analysis of coronavirus by this model shows that a part of spike protein called S2 subunit is the least likely to produce escape mutation. How quickly the SARS-CoV-2 virus mutates is still a question, so it is unclear how long the currently deployed vaccines against the COVID-19 pandemic would remain effective. Prima facie evidence suggests that the virus does not mutate as fast as influenza viruses or HIV.

In the study of HIV, these researchers found that there are many possible escape mutations in the V1-V2 hypervariable region of the envelope protein, which is consistent with previous studies, and they also found some sequences with low escape probability.

These researchers are working with others using their models to identify potential targets for cancer vaccines that stimulate the body’s own immune system to eliminate cancer cells. They say it can also be used to design small molecular drugs that may not easily cause resistance to drugs to treat diseases such as tuberculosis.