Computer

Use Random Forest to fit a molecular chemistry dataset.


This app accepts a file with SMILES strings and target values, featurizes the molecules, and uses a Random Forest to fit the features to the target values.
Click here to view or hide hyper-parameter descriptions.
  • A Random Forest is based on Decision Trees. A Decision Tree is a model that examines all the input features and determines which one correlates most with the target, and splits the data according to that feature, creating the first level of branches. It then looks at the rest of the data and decides on the next features to use, and splits again, etc... until it has a series of branches that reaches each target value.
  • A Random Forest is made of of many decision trees, each independent of the others. Each tree uses randomly chosen data from the full dataset with or without replacement. With replacement means that the same datapoint can be chosen for each tree more than once. Each tree then gets a random subset of all the features of the dataset, and makes its splits/branches with only those features. The result is a collection of trees which, taken together, are resistant to overfitting.
  • Seed: A random number to control initial options for the model. Choose the same one each time for replicable results.
  • Maximum Features: a fraction between 0 and 1.0 that indicates how many features are used in each separate tree in the forest.
  • Replacement? True/false whether to take datapoints for each tree with replacement or not.
  • n-Estimators: how many trees in the forest.
  • Maximum depth: how many layers of branches can be made (tests applied).
  • Minimum split: how many datapoints in each set of test results to justify a new split.




Molecules from dataset. Click here to show the next one.
This plot shows the target values in red, the machine-learning prediction for the training set in green, and the machine-learning prediction for the validation set in blue.





This is the work of Dr. Mauricio Cafiero and may be used widely though attribution is appreciated.