Property prediction task

Task

The task is to predict a target compound’s property from its molecular structure.

Property predictors can be used to test large collections of molecules in silico to identify candidates with high activity, and these candidates can then be validated in the lab. For COVID-19, the predictor will be applied to safe compounds (e.g. FDA-approved drugs) to screen for antiviral activity against SARS-CoV-2. The top ranked molecules will be tested in the lab.

Data

For all the datasets, a training pair is represented by a molecular structure (SMILES string) and an activity measurement.

 

  1. E.coli: This dataset consists of 2335 pairs, with a binary activity measurement indicating E. coli inhibition. There are 120  molecules which inhibit E. coli growth. The size, quality, and distributional properties of this set are a good proxy for the SARS-CoV-2 screening data that will eventually be available (data).
     

  2. SARS-CoV-1 3CLpro: This dataset consists of 290,726 pairs obtained via an assay that measures activity against the SARS-CoV-1 3CLpro target, which is highly homologous to the corresponding protease in SARS-CoV-2. There are 405 molecules in this dataset which are active against the 3CLpro target (raw data, processed data).

Model

Chemprop is a type of neural network called a message passing neural network (MPNN). MPNNs are designed to operate on graph-structured objects like molecules, where each atom is represented by a node and each bond is represented by an edge. An MPNN for molecules works by first creating feature vectors for each atom and bond based on simple properties like atom type (carbon, oxygen, etc) and bond type (single, double, etc). Then it performs a series of “message passing” steps where a neural network sends information between neighboring atoms and bonds, thereby encoding local chemical information. After a number of these steps, the local chemical information is aggregated to form a single vector representing the entire molecule, which is then processed by a feed-forward neural network that makes the final property prediction. Optionally, the molecule vector created by the MPNN can be augmented with additional chemical information by concatenating it with a chemical fingerprint or descriptor before feeding the combined vector through the feed-forward neural network.

Results

See below

References

[1] Chemprop (https://github.com/chemprop/chemprop) - GitHub repo containing code for the message passing neural network.

 

[2] Yang, Kevin, et al. “Analyzing Learned Molecular Representations for Property Prediction.” Journal of Chemical Information and Modeling. 59.8 (2019): 3370-3388. (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.9b00237) - Paper describing the message passing neural network applied to a range of molecular properties.

[4] Stokes, Jonathan, et al. “A Deep Learning Approach to Antibiotic Discovery” Cell. 180.4 (2020): 688-702. (https://www.cell.com/cell/fulltext/S0092-8674(20)30102-1) - Paper describing the application of the message passing neural network to E. coli.

 

[3] Landrum, Greg. "RDKit: Open-source cheminformatics." (2006): 2012. https://www.rdkit.org/ Open source package for computational chemistry.