This app accepts a file with SMILES strings and target values, featurizes the molecules, and uses a
Random Forest to fit the features to the target values.
Input should be a .csv file that has a column of SMILES strings and a column potential target values. The file can have any number of
columns as long as those two are included.
The app will read the CSV and allow you to choose a column to be the target for the model. You can also choose to apply a natural log
transformation to the target values if they span a large range (for example, IC50 values from 0.1 nM to 1000 nM).
You may view the molecules in your dataset on the right of the screen after you load the file.
Molecules are featurized from the SMILES strings using RDKit descriptors. These include up to 43 descriptors including
all of the usual ADME/Lipinski values. Features are scaled before training.
The dataset is divided into training (80%) and validation sets (20%).
You may save the model you train to local storage in your browser. Use the app in the menu above to read in a saved model for predictions.
The featurization and model training processes may take a few seconds to a minute or two depending on your computer, so please be patient.
Click here to view or hide hyper-parameter descriptions.
A Random Forest is based on Decision Trees. A Decision Tree is a model that examines all the input features and determines which one
correlates most with the target, and splits the data according to that feature, creating the first level of branches. It then looks at the rest of the data and
decides on the next features to use, and splits again, etc... until it has a series of branches that reaches each target value.
A Random Forest is made of of many decision trees, each independent of the others. Each tree uses randomly chosen data from the full dataset with or without
replacement. With replacement means that the same datapoint can be chosen for each tree more than once. Each tree then gets a random subset of all the features
of the dataset, and makes its splits/branches with only those features. The result is a collection of trees which, taken together, are resistant to overfitting.
Seed: A random number to control initial options for the model. Choose the same one each time for replicable results.
Maximum Features: a fraction between 0 and 1.0 that indicates how many features are used in each separate tree in the forest.
Replacement? True/false whether to take datapoints for each tree with replacement or not.
n-Estimators: how many trees in the forest.
Maximum depth: how many layers of branches can be made (tests applied).
Minimum split: how many datapoints in each set of test results to justify a new split.
Molecules from dataset. Click here to show the next one.
This plot shows the target values in
red,
the machine-learning prediction for the training set in green,
and the machine-learning prediction for the validation set in blue.
This is the work of Dr. Mauricio Cafiero and may be used widely though attribution is appreciated.