Motivation

Lead optimization is a significant challenge in the design of new pharmaceuticals. Even once a compound or class of compounds is discovered that has activity against a desired biological target, there are still a vast number of possible permutations of local features of the molecule that might affect its activity. The possibilities are too numerous to test by brute force, even using computational screening methods. It is therefore desirable to develop heuristics based on machine learning that can be used to identify the features of molecules that determine its activity against a specific target and using those features to construct new molecules that maximize predicted bioactivity.

Major Questions

How do we group bonds and atoms into local chemical features?

How do we group local features into higher-order features that determine bioactivity?

How do we generate new candidate molecules based on these features?

Strategy

Restricted Boltzmann machines (RBMs) are a class of stochastic neural network that can be trained to represent generative models of a dataset. They can be trained using an algorithm closely related to the sleep-wake algorithm in which the system is alternately exposed to training data (the "waking" phase) and allowed to spontaneously generate states (the "sleep" phase) and the connection weights of nodes are adjusted to minimize the difference between the probability distribution of the training data states and that of the states the network spontaneously "dreams" of.

The hidden nodes of an RBM each represent a "feature" that occurs frequently in the dataset to which it has been exposed. Beginning with a list of strings in SMILES notation representing chemical compounds with a shared bioactivity, I first trained a small RBM to recognize local features of the molecules. I then used the output from this layer as the training input for a second layer, where the hidden nodes would represent higher-order features shared by the class of training examples.

There are two ways for the network to "dream" in order to produce new candidate molecules:

The first is for the network to "day dream" and fluctuate about its equilibrium distribution without any external input. Samples drawn from its "dreamed" states will approximate the probability distribution from which the training data are drawn. If the model is trained effectively, these samples will also be probable neuraminidase inhibitors.

The second is for the network to produce a "fantasy" of an optimal molecule that possesses all the features that are common to the training set of bioactive molecules (or as many as possible). This is achieved by turning the nodes that represent the features thought to contribute to bioactivity permanently on and inferring the most probable allowed SMILES string under this condition.

As a proof of concept, I chose to train the network on a set of SMILES strings representing molecules that inhibit neuraminidase, an enzyme found on the surface of the influenza virus that is required for its proliferation.

Construction

Existing Python libraries can run Boltzmann machines but are not suitable for implementing the "dreaming" procedure required for the network to design new molecules. I therefore wrote my own implementation of a training algorithm for DBMs using Python base libraries and Numpy.

Results

I am still working on obtaining good convergence from the algorithm and tuning the learning parameters. I am also still improving my set of rules for ensuring that the output string is a valid SMILES string. However, the network can be run and produces a mostly correct SMILES string as output, though currently it does not generally resemble the input structures in the way it should.

Further Development

Training RBMs separately and then stacking them to form a DBM is an imperfect procedure, and the resulting stacked network can be further optimized by backpropagation. The approach I took in this project might be complemented well by genetic algorithms for feature selection, or a set of manually-designed features. If the network does eventually yield promising candidates, these could be further investigated by molecular dynamics simulations to determine whether they effectively bind neuraminidase.

Built With

Share this project:

Updates