Today we are sharing some of the models we have been developing in order to identify drugs that may have therapeutic effect against the novel coronavirus. Our models are message passing neural networks trained using chemprop and they are available publicly on chemprop.csail.mit.edu/predict. Anybody can visit the website and upload molecules to make predictions using our models. Additionally, we applied our models to molecules from the Broad Repurposing Hub, which contains many FDA-approved drugs, and our models’ predictions are available here: https://github.com/yangkevin2/coronavirus_data/tree/master/predictions. Below, we describe the data and the models in more detail.
In order to identify potential therapeutics for the novel coronavirus, also known as SARS-CoV-2, we would ideally like to train machine learning models using molecular screening data for SARS-CoV-2. Unfortunately, since SARS-CoV-2 is still a new virus, relatively little screening data is available. An alternative is to instead use data for SARS-CoV, the virus which caused the 2002/3 SARS outbreak. Since SARS-CoV and SARS-CoV-2 are very similar, we hope that any molecules which are predicted to have effect against SARS-CoV may also have effect against SARS-CoV-2 and thus may help coronavirus patients.
In particular, we have focused on the SARS-CoV 3CL protease, which is an integral component of viral replication and is >96% homologous with the SARS-CoV-2 3CL protease. With the help of Dr. Robert Malone, we identified the publicly available PubChem assay AID1706, which is a “QFRET-based primary biochemical high throughput screening assay to identify inhibitors of the SARS coronavirus 3C-like Protease (3CLPro).” The AID1706 assay contains 405 active molecules and ~290,000 negative molecules. We preprocessed the data by reformatting it and removing compounds with duplicate SMILES strings, and we have posted the processed data on our public data GitHub repo under the name AID1706_binarized_sars.csv.
Given this data, the goal is to build a binary classification model that can predict for each molecule whether that molecule is likely to inhibit the 3CL protease of SARS-CoV or not. The input to the model is a SMILES string representing a molecule and the output is a number between 0 and 1 representing the probability, according to the model, that the molecule inhibits the 3CL protease. It’s worth noting that the values output by the model may not be well-calibrated probabilities, but they do represent the model’s ranking of the molecules and can be used to prioritize molecules with higher prediction values.
Message Passing Neural Networks (MPNNs)
The models that we train are message passing neural networks (MPNNs), also known as graph neural networks. We use the chemprop MPNN, which we first described in our paper Analyzing Learned Molecular Representations for Property Prediction and which we applied to antibiotic discovery in our paper A Deep Learning Approach to Antibiotic Discovery.
In short, MPNNs like chemprop are designed to build molecular representations which can be used to predict properties such as 3CL protease inhibition. The model first identifies simple features of the atoms and bonds such as the type of atom (carbon, oxygen, etc.) and the type of bond (single bond, double bond, etc.). The model then performs a number of “message passing” steps, where it uses neural network layers to “pass messages” between adjacent atoms and bonds in the molecule. This message passing aggregates information across the molecule and eventually produces a single, global representation of the molecule. This global representation is then used by another neural network to predict the property of interest. Optionally, additional chemical features can be computed in silico and can augment the global representation learned by the MPNN to enhance the predictive power of the final neural network layers.
MPNN Training Details
We trained an ensemble of 5 chemprop models on the AID1706 data, each one augmented with 200 chemical features computed with RDKit. Each of the 5 models was trained on a different 90%/10% split of the data and they achieve 0.778 +/- 0.075 test AUC. Training took about 1 hour and 40 minutes per model on an NVIDIA GeForce GTX 1080 GPU. At inference time, the prediction probabilities of the 5 models are averaged to form a single ensemble prediction. Making predictions takes only a matter of seconds. The trained model is available under the name “SARS” at chemprop.csail.mit.edu/predict.
We noticed that the model trained on AID1706 produces extremely low probabilities for nearly all molecules, even those in the training set that were marked as positive. This is due to the fact that 0.1% of training molecules were positive while 99.9% of training molecules were negative, thereby skewing the model to always predict low probabilities. To address this, we implemented a class balancing procedure where we required each training batch to contain an equal number of positive and negative molecules. During the course of training, this results in the model seeing only about 12,000 negative molecules rather than all 290,000. However, this is still sufficient to achieve a test AUC of 0.761 +/- 0.054, on par with the previous model, while training in 6 minutes instead of 1 hour and 40 minutes. Furthermore, prediction probabilities are in a much more reasonable range (0.35 on average rather than 0.002). The trained model is available under the name “SARS – balanced” at chemprop.csail.mit.edu/predict.
Due to the long time scales required for evaluating the efficacy and safety of a new drug (often >10 years), our focus is instead on repurposing existing drugs which are known to be safe for use in humans. To this end, we have used both of our models to make predictions on the Broad Repurposing Hub, a set of about 6,000 drugs, many of which are FDA-approved. The predictions for both models in this dataset are available here: https://github.com/yangkevin2/coronavirus_data/tree/master/predictions.
We are continuing to experiment with other datasets and models for identifying drugs for the novel coronavirus. All of the data we have gathered is publicly available at https://github.com/yangkevin2/coronavirus_data and our chemprop MPNN code is open-source at https://github.com/chemprop/chemprop. We will post updates and upload new models as they are developed.