Sampling the protein conformational space using neural networks

In the previous years many new methods that uses neural networks for sampling the protein conformational space have been presented.

In this article we explore and evaluate these different neural network approaches.

What is the protein conformational space?

The folded 3D structure of the protein is called a protein’s conformation. Since a protein can be folded in many different ways, by for example rotating the alpha carbons of the protein’s backbone, there exist many protein conformations.

The number of possible conformations also increases with the number of residues. The protein conformational space refers to all possible conformations of the protein.

The stability of each protein conformation varies and is defined as the free energy. To reach high stability its important to have good stereochemistry (no steric clashes), the charged buried atoms should be paired and the interior of the protein should be densely packed to provide thermodynamic stability[1].

The reason to explore the conformational spaces of proteins is to understand many of the biochemical functions, and in order to develop effective drugs for curing and managing disease conditions[2].

Methods for sampling the conformational space

Previously, many different methods have been used to understand the structure-function relationship. Two different sampling techniques are Monte Carlo and molecular dynamics (MD). Molecular dynamics is a method where the physical movements of atoms and molecules are simulated.

The forces between the particles and their energies are calculated using interatomic potentials, while the acceleration resulting from these forces is determined by Newton’s second law of motion.

Over the years, MD simulations have been used to understand the structure-function relationship, but today MD simulations are seen as an invaluable tool for complex system, since they are computationally expensive.

Furthermore, many of the algorithms used to explore the conformational space are designed to ignore configuration where the energy increases, which means that the algorithms often get stuck in the nearest valleys, or local minimum [1]. Because of these reasons, we need to find new approaches for sampling the protein conformational space.

Neural network approaches

Neural networks are a concept in machine learning that is inspired by a simplification of neurons in the brain. An artificial neural network is an interconnected group of neurons, and the connections between neurons is called edges which have weights associated to them.

Multiple neural network approaches have been to suggested that addresses the task of sampling the protein conformational space. These different approaches are further described in the sections below.

Reweighted autoencoded variational Bayes

Ribeiro et al. [3] developed an algorithm called RAVE using a state-of-the-art deep learning approach called variational autoencoder (VAE) for sampling the protein conformational space. A variational autoencoder is a type of autoencoder, which are beneficial in cases where there are irregularities in the latent space.

A variational autoencoder returns a distribution over the latent space instead of a single point. In machine learning, the latent space could be described as a simplified data representation, where only the most important featured are considered, and similar data points are closer to each other.

The REAP algorithm determines an optimum reaction coordinate and probability distribution, which can be used for the next biased simulation. This process then continues until the system converges to a desirable thermodynamic state. In the paper, the algorithm was tested on three different illustrative examples, the Szabo-Berezhkovskii potential, three-state potential and a hydrophobic ligand-cavity system in explicit water.

The RAVE algorithm was compared with umbrella sampling and metadynamics and was shown to be at least 20 times faster than these two (umbrella sampling and metadynamics are two methods used to estimate the free energy). However, this estimate does not take the VAE into account, which would add a small computational overhead.

Boltzmann generators

No´e et al. [4] combined deep machine learning and statistical mechanics to develop Boltzmann generators, with the purpose to generate independent samples of low-energy structures.

The system was trained using an invertible neural network that learned the coordinate transformation from the system’s configurations to latent space representation. Since the neural network used was invertible the latent space could then be transformed back to the system configuration with a high Boltzmann probability.

The goal of the Boltzmann generator is to minimize the difference between the Boltzmann distribution and the generated energy distribution. The algorithm start with simple distribution (e.g. Gaussian), and in each iteration the weights are adjusted to give the desired distribution of energy for a system.

To train the neural network, the algorithm starts with a simple distribution (e.g. Gaussian), as mentioned before. The invertible neural network was the trained to transform simple distribution to the desired Boltzmann distribution.

This was done by computing the thermodynamic quantities and adjusting the weights of the system based on the free energy differences. In the paper, the Boltzmann generator was illustrated using the double well potential and the Muellerpotential, which both have metastable states separated by high energy barriers. However, despite the high energy barrier the Boltzmann generator showed that is was able to find the unbiased equilibrium.

A disadvantage with Boltzmann generators that was mentioned in the paper was that depending on the training method, Boltzmann generators may not be ergodic. In essence, this means that the they may not reach all possible configurations.

Reinforcement learning based adaptive sampling

Shamsi et al. [5] presented the REAP algorithm for sampling the protein conformational space. Reinforcement learning is a concept in machine learning which is based on Pavlovian conditioning and control theory.

For reinforcement learning, feedback from the environment such as rewards, is used to learn and train the algorithm. The REAP algorithm can in short be described by the the following steps below. Note that this is a simplified representation of the algorithm.

Run a series of short molecular dynamics simulations, from a collection of initial structures.
Based on the order parameters of interest, cluster the proteins.
Pick structures from these clusters based on a reward function, and begin new simulations until the sampling is sufficient enough.

The reward function is dependent on weights that indicates the importance of the different order parameters. After each step the weights are tuned in order to maximize the cumulative reward.

The performance of the REAP algorithm was evaluated by comparing it to conventional single long trajectories (SL) and least counts sampling (LC). To test the algorithm they used two idealized potentials, an L-shaped and a circular landscape (idealized potentials are often used for demonstration purposes). In the paper they found that the REAP algorithm outperformed SL and LC for both of these landscapes.

Furthermore, the REAP algorithm was then tested on two molecular systems, alanine dipeptide and Src kinase. For both of these systems the REAP algorithm outperformed SL and LC, and was efficient at exploring the landscape.

The algorithm also showed effectiveness with a large number of order parameters by reducing the weight of the unfavorable ones. Since REAP is based on reinforcement learning, a benefit with using REAP is that it does not require any prior information about what variable that should maximized or minimized, such as the residue pair distance, solvent accessible area etc.

Discussion and conclusion

These new neural network approaches have showed promising results for sampling the conformational space, and finding low-energy structures. Compared to previous methods, they are not as computationally expensive either, which makes them beneficial for complex system, such as large protein structures.

The methods presented by No´e et al. [4] and Riberiro et al. [3] took advantage of using the latent space, which is a simplified data representation. Because of this, I would assume that these methods could be less computational expensive compared to the reinforcement learning approach presented by Shamsi et al. [5].

However, there are other benefits with using a reinforcement learning approach. A benefit with the REAP algorithm is that since it uses a reinforcement learning algorithm, the only input needed is a list of the possible order parameters. It does not require any prior knowledge about which order parameter that should be minimized or maximized, the algorithm will learn this by itself.

One drawback with using neural networks and deep learning for sampling the protein sampling conformational space is the ’black box’ problem. In AI and machine learning, the black box, refers to the fact that we don’t know how they do what they do.

We can see the input and the output of the system, but it’s difficult to say why the model have given a specific result. Furthermore, we can still have the problem that deep learning models can be trapped in a local minima, and may not necessarily reach all possible configurations [4].

Since a neural network has no knowledge about the physical laws of nature, we could have the case that a neural network could generate conformations which are unfavorable or not found in nature.

However, this disadvantage can be neglected by penalizing conformations with high free energy. Nonetheless, we cannot penalize high energy conformations too much since this could prevent us from finding the global minimum. Sometimes its beneficial to explore the higher energy space in order to find protein conformations with lower energy.

References

[1] Jane S. Richardson and David C. Richardson. Principles and Patterns of Protein Conformation. Ed. by Gerald D. Fasman. Boston, MA: Springer US, 1989. isbn: 978-1-4613-1571-1. doi: 10.1007/978-1-4613-1571-1_1. url: https://doi.org/10.1007/978-1-4613-1571-1_1.

[2] Emmanuel Oluwatobi Salawu. “Enhanced Sampling of Nucleic Acids’ Structures Using Deep-Learning-Derived Biasing Forces”. In: 2020 IEEE Symposium Series on Computational Intelligence (SSCI). 2020, pp. 1648–1654. doi: 10.1109/SSCI47803.2020.9308559.

[3] Joao Marcelo Lamim Ribeiro et al. “Reweighted autoencoded variational Bayes for enhanced sampling (RAVE)”. In: The Journal of Chemical Physics 149.7 (2018), p. 072301. doi: 10.1063/1.5025487. url: https://doi.org/10.1063/1.5025487.

[4] Frank No´e et al. “Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning”. In: Science 365.6457 (2019), eaaw1147. doi: 10.1126/science.aaw1147. url: https://www.science.org/doi/abs/10.1126/science.aaw1147.

[5] Zahra Shamsi, Kevin J. Cheng, and Diwakar Shukla. “Reinforcement Learning Based Adaptive Sampling: REAPing Rewards by Exploring Protein Conformational Landscapes”. In: The Journal of Physical Chemistry B 122.35 (2018), pp. 8386–8395. doi: 10 . 1021 / acs . jpcb . 8b06521. url: https://doi.org/10.1021/acs.jpcb.8b06521.