This app accepts a file with SMILES strings and target values, featurizes the molecules, and uses a
neural network to fit the features to the target values.
Input should be a .csv file that has a column of SMILES strings and a column potential target values. The file can have any number of
columns as long as those two are included.
The app will read the CSV and allow you to choose a column to be the target for the model. You can also choose to apply a natural log
transformation to the target values if they span a large range (for example, IC50 values from 0.1 nM to 1000 nM).
You may view the molecules in your dataset on the right of the screen after you load the file.
Molecules are featurized from the SMILES strings using RDKit descriptors. These include up to 43 descriptors including
all of the usual ADME/Lipinski values. Features are scaled before training.
The dataset is divided into training (80%) and validation sets (20%).
You may save the model you train to local storage in your browser. Use the app in the menu above to read in a saved model for predictions.
The featurization and model training processes may take a few seconds to a minute or two depending on your computer, so please be patient.
Click here to view or hide hyper-parameter descriptions.
The number of layers: each layer is a set of "neurons." A layer receives data, transforms that data,
(see below) and then passes it on to the next layer. Each DNN has an input layer which reads in the data,
an output layer which produces the final answer, and n hidden layers, which
do the "learning." Is there is 1 hidden layer, the model is called a universal approximator but if it has
more than 1 hidden layer, it is a deep learning model. The more layers, the more the model can learn.
The number of units: each layer h has a certain number of units which are akin to neurons. Each unit takes a piece
of input data, xi,and first applies a linear transformation to it: yh,i = Mhxi+bh.
This yh,i is then passed through an activator function (see below) before being passed to the next layer.
The width: the model can have a single stack of layers as described above, or it can have more than one stack of layers which come together
at the output layer. This gives the model more flexibility in learning.
Skip connections: this allows a copy of the input data to "skip" the layers and go directly to the output layer. This allows another
path for learning, and preserves any information in the input data.
The activation: the transformed data yh,i is passed through a non-linear activation function so that we don't just
end up with linear-regression. The non-linear functions can include hyperbolic tangent, a sigmoid function, or a special transformation called a
rectified linear unit. Each activation is appropriate for different situations.
The optimizer: the set of Mh and bh values for each unit of each layer are called the weights and biases
for the model, or often just called the weights. In order to learn accurately, the best possible sets of weights are needed.
These optimal weights are found through a process by which the accuracy of each successive approximation of the training data is differentiated
with respect to the weights, and the weights that produce the most accurate answer are found.
The epochs: the optimization is a method of successive approximations, which we call training so the more you try, the closer you get to
the answer. Some problems need more or less training that others.
The training process is compute-intensive, so it will take a while (usually a minute or so). The epoch update will show your progress.
This plot shows the target values in
red,
the machine-learning prediction for the training set in green,
and the machine-learning prediction for the validation set in blue.
Molecules from dataset. Click here to show the next one.
This is the work of Dr. Mauricio Cafiero and may be used widely though attribution is appreciated.