The task is to learn a model which predicts the probability that a given input molecule is active against the SARS-CoV-1 3CLpro target.
As the SARS-CoV-1 3CLpro target is highly homologous to the corresponding protease in SARS-CoV-2 (COVID-19), we hope a model which can identify active molecules against SARS-CoV-1 3CLpro may be useful in identifying active molecules against SARS-CoV-2 3CLpro as well.
The training data is derived from this assay for SARS-CoV-1 3CLpro activity:https://pubchem.ncbi.nlm.nih.gov/bioassay/1706. There are ~400 positives and ~300K negatives. Evaluation is conducted on a combination of two data sources, with training set overlaps removed: (1) a verified reference list of actives obtained from the creator of the original assay, which are labeled as active, and (2) the Broad repurposing library (https://clue.io/repurposing), which are labeled as inactive. The resulting evaluation set has 41 positives and ~6K negatives.
Our baseline method is the same as for the antibiotics task: the graph convolutional network Chemprop augmented by 2D RDKit computed features. We additionally run a version of Chemprop which randomly samples from the negatives during each training epoch to preserve class balance. Each method is run as a 5-model ensemble. The metric is ROC-AUC on the evaluation set.
The Chemprop model achieved a test AUC of 0.961.
The Chemprop model with class balance achieved a test AUC of 0.978.
 Yang, Kevin, et al. “Analyzing Learned Molecular Representations for Property Prediction.” Journal of Chemical Information and Modeling. 59.8 (2019): 3370-3388. (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.9b00237) - Paper describing the message passing neural network applied to a range of molecular properties.
 Landrum, Greg. "RDKit: Open-source cheminformatics." (2006): 2012.https://www.rdkit.org/ Open source package for computational chemistry.