(Nanowerk News) Researchers at the National Institute of Standards and Technology (NIST) have developed a new statistical tool that they have used to predict protein function. It can not only help with the difficult job of changing proteins in practically useful ways, but it also works with methods that are fully interpretable – an advantage compared to the conventional ones. Artificial Intelligence (AI) who have helped with protein technology in the past.
The new tool, called LANTERN, can prove useful in work ranging from producing biofuels to improving crops to developing new disease treatments. Proteins, as building blocks in biology, are a key element in all of these tasks. But even though it is relatively easy to make changes to the DNA strand that acts as the blueprint for a particular protein, it is still a challenge to determine which specific base pairs – the steps on the DNA steps – are the keys to producing a desired effect. . Finding these keys has been the purpose of AI built by deep neural networks (DNNs), which, while effective, are notoriously opaque to human understanding.
Described in a new magazine published in Proceedings of the National Academy of Sciences (“Interpretable modeling of genotype-phenotype landscape with state-of-the-art predictive power”), LANTERN demonstrates the ability to predict the genetic changes needed to create useful differences in three different proteins.
One is the nail-shaped protein from the surface of the SARS-CoV-2 virus that causes COVID-19; Understanding how changes in DNA can alter this nail protein can help epidemiologists predict the future of the pandemic. The other two are well-known laboratory workhorses: the LacI protein from the E. coli bacterium and the green fluorescent protein (GFP) used as a marker in biological experiments.
By choosing these three topics, the NIST team was able to show not only that their tools work, but also that their results are interpretable – an important feature for the industry, which needs predictive methods that help in understanding the underlying system.
“We have an approach that is fully interpretable and that also has no loss in predictive power,” says Peter Tonner, statistician and computational biologist at NIST and LANTERN’s main developer. “There is a widespread assumption that if you want one of these things, you can not have the other. We have shown that you can sometimes have both.
The problem that the NIST team is dealing with can be imagined as interacting with a complex machine that has a large control panel filled with thousands of unlabeled switches: The device is a gene, a DNA strand that encodes a protein; the switches are base pairs on the string. The switches affect all output of the unit in some way. If your job is to make the machine work differently in a specific way, which switches should you turn on?
Because the answer may require changes to multiple base pairs, researchers must reverse a combination of them, measure the result, then select a new combination, and measure again. The number of permutations is frightening.
“The number of potential combinations can be greater than the number of atoms in the universe,” Tonner said. “You could never measure all the possibilities. That’s a ridiculously large number. ”
Due to the large amount of data involved, DNNs have been commissioned to sort through a selection of data and predict which base pairs need to be reversed. By now, they’ve been successful – as long as you do not ask for an explanation of how they get their answers. They are often described as “black boxes” because their internal functions are unfathomable.
“It’s really hard to understand how DNNs make their predictions,” said NIST physicist David Ross, one of the paper’s co-authors. “And it’s a big problem if you want to use those predictions to construct something new.”
The LANTERN, on the other hand, is explicitly designed to be understandable. Part of its explanability stems from its use of interpretable parameters to represent the data it analyzes. Instead of allowing the number of these parameters to grow extraordinarily large and often unfathomable, as is the case with DNNs, each parameter in LANTERN’s calculations has a purpose that is intended to be intuitive, helping users understand what these parameters mean and how they affect LANTERN’s predictions.
The LANTERN model represents protein mutations using vectors, often using mathematical tools that are often portrayed visually as arrows. Each arrow has two properties: its direction indicates the effect of the mutation, while its length represents how strong that effect is. When two proteins have vectors pointing in the same direction, the LANTERN indicates that the proteins have a similar function.
The directions of these vectors often map biological mechanisms. For example, LANTERN learned a direction associated with protein folding in all three data sets that the team studied. (Folding plays a crucial role in how a protein works, so identifying this factor across data sets was an indication that the model works as intended.) When making predictions, LANTERN only adds these vectors – a method that users can track when examining its predictions. .
Other laboratories had already used DNNs to make predictions about which switch-flips would make useful changes to the three substance proteins, so the NIST team decided to set the LANTERN against DNN’s results. The new approach was not only good enough; according to the team, it achieves a new state of the art in predictive accuracy for this type of problem.
“The LANTERN corresponded to or surpassed almost all alternative approaches in terms of prediction accuracy,” Tonner said. “It surpasses all other approaches to predicting changes in LacI, and it has comparable predictive accuracy for GFP for all but one. For SARS-CoV-2, it has higher predictive accuracy than all alternatives other than one type of DNN, which matched LANTERN’s accuracy but did not beat it. ”
The LANTERN calculates which sets of switches have the greatest effect on a given attribute of the protein – its folding stability, for example – and summarizes how the user can adjust that attribute to achieve a desired effect. In a way, LANTERN converts the many switches on our machine’s panel into a few simple knobs.
“It reduces thousands of gears to maybe five small knobs you can turn,” Ross said. “It tells you that the first steering wheel will have a big effect, the second will have a different effect but smaller, the third even smaller, and so on. So as an engineer it tells me that I can focus on the first and “the second steering wheel to get the result I need. LANTERN puts all this out for me, and it’s incredibly useful.”
Rajmonda Caceres, a researcher at MIT’s Lincoln Laboratory who is familiar with the method behind LANTERN, said she values the tool’s interpretability.
“There are not many AI methods applied to biology applications where they explicitly design for interpretability,” says Caceres, who is not affiliated with the NIST study. “When biologists see the results, they can see which mutation is contributing to the change in the protein. This level of interpretation enables more interdisciplinary research, as biologists can understand how the algorithm learns and they can generate additional insights into the biological system being studied.”
Tonner said that although he is happy with the results, LANTERN is not a panacea for AI’s problems with explaining. Exploring alternatives to DNN more generally would benefit the entire effort to create explicable, reliable AI, he said.
“In predicting genetic effects on protein function, LANTERN is the first example of something that competes with DNNs in predictive power while still being fully interpretable,” Tonner said. “It provides a specific solution to a specific problem. We hope that it can apply to others and that this work inspires the development of new interpretable approaches. We do not want predictive AI to remain a black box.”
#tool #artificial #intelligence #predict #protein #function