SARS-CoV-1 3CLpro Task


The task is to learn a model which predicts the probability that a given input molecule is active against the SARS-CoV-1 3CLpro target.


As the SARS-CoV-1 3CLpro target is highly homologous to the corresponding protease in SARS-CoV-2 (COVID-19), we hope a model which can identify active molecules against SARS-CoV-1 3CLpro may be useful in identifying active molecules against SARS-CoV-2 3CLpro as well. 


The training data is derived from this assay for SARS-CoV-1 3CLpro activity: There are ~400 positives and ~300K negatives. Evaluation is conducted on a combination of two data sources, with training set overlaps removed: (1) a verified reference list of actives obtained from the creator of the original assay, which are labeled as active, and (2) the Broad repurposing library (, which are labeled as inactive. The resulting evaluation set has 41 positives and ~6K negatives.


Training data:

Evaluation data:


Our baseline method is the same as for the antibiotics task: the graph convolutional network Chemprop[1] augmented by 2D RDKit computed features[3]. We additionally run a version of Chemprop which randomly samples from the negatives during each training epoch to preserve class balance. Each method is run as a 5-model ensemble. The metric is ROC-AUC on the evaluation set.


The Chemprop model achieved a test AUC of 0.961.

The Chemprop model with class balance achieved a test AUC of 0.978.


[1] Chemprop ( - GitHub repo containing code for the message passing neural network.


[2] Yang, Kevin, et al. “Analyzing Learned Molecular Representations for Property Prediction.” Journal of Chemical Information and Modeling. 59.8 (2019): 3370-3388. ( - Paper describing the message passing neural network applied to a range of molecular properties.


[3] Landrum, Greg. "RDKit: Open-source cheminformatics." (2006): 2012. Open source package for computational chemistry.