Overview

In biochemical research and drug discovery, we may want to know how much of, or to what extent, a molecule will dissolve in a particular solution. Given the chemical structure of a molecule, can we train a machine learning model to predict how soluble that molecule is in aqueous solution? Here I explore the task of molecular solubility, following an excellent tutorial from the DeepChem project, which aims to open-source deep learning for science. Below are the names and chemical structures of a few compounds from the training data.

Overview

Data Setup: Featurize Chemical Structure

The original dataset [1] of 1128 chemical compounds maps their chemical structure to a list of molecular properties, including the target metric: ESOL predicted log solubility in mols/liter. Plotting a histogram of the data below shows a fairly normal distribution of solubility, which our model will try to capture.

There are several preprocessing steps to extract feature vectors from these chemical structures, which are represented as strings in "SMILES", or "simplified molecular-input line-entry system" format, e.g. "c2ccc1scnc1c2"):

The data splitting puts more common scaffolds into "train" and more rare scaffolds into "validation"—see if you can spot a visual distinction between the two in the panels below.

[1] John S. Delaney. ESOL: Estimating aqueous solubility directly from molecular structure. Journal of Chemical Information and Computer Sciences, 44(3):1000–1005, 2004.

Data Setup: Featurize Chemical Structure

Fitting with a Random Forest

Fitting with a Random Forest

Fitting with a simple deep net

Fitting with a simple deep net

Hyperparameter exploration for simple deep net

Hyperparameter exploration for simple deep net

Diving deeper on layer configuration

In the section above, the parallel coordinates plot is not obviously informative about layer configuration: the proportion of better (more yellow) versus worse (more blue) runs is about equal in each node. Grouping the runs by Layer configuration gives me more visibility. Below, each line is a set of runs with the same layer configuration. E.g., "layers: "200 100"" means a first fully-connected layer of 200 neurons, and a second fully-connected layer of 100, while "layers: 1000" means a single fully-connected layer of 1000. You can see the total runs in each group to the right of the group name, and the resulting average R^2 score for that group immediately to the right in the "r2" column.

Free variables: the more, the better

Observations from this plot:

Next step: running a more precise sweep on the promising layer combinations—more layers, larger first layers.

Diving deeper on layer configuration