Open Source Data

We are excited to encourage experimentation and collaboration in this space. To that end, we are working with our collaborators to open-source data related to the SARS-CoV-2 effort. All our data may be found here and are summarized below.


  • SARS-CoV data

    • AID1706_binarized_sars.csv - (N = 290726, 405 hits) in-vitro assay that detects inhibition of SARS-CoV 3CL protease via fluorescence.

    • evaluation_set_v2.csv - (N = 5671, 41 hits) An evaluation set for SARS-CoV-1 3CL protease containing 41 experimentally validated hits along with 5630 molecules from the Broad Repurposing Hub which are treated as non-hits. There is no overlap with AID1706_binarized_sars.csv.

    • PLpro.csv - (N = 233891, 697 hits) bioassay that detects activity against SARS-CoV in yeast models via PL protease inhibition. Combines PubChem data from AID 652038 and AID 485353.

  • ​Data extracted from literature

    • corona_literature_idex.csv - (N=101) FDA-approved drugs that are mentioned in generic coronavirus literature. Drug to SMILES mapping is generated through the PubChem idex service.

  • ​Catalogues of drugs that can be screened for repurposing

  • Data for open task: Pseudomonas aeruginosa

    • Dataset details. The dataset consists of 2,335 molecules. The test set contains 238 molecules, while the rest -- 2097 molecules paired with their activity -- are used for training. The training set includes 48 active compounds. We provide the 10 cross-validation splits that were used to obtain benchmark results. We also provide test molecules. Instructions to submit your predictions will be included.
      Assay description. Pseudomonas aeruginosa PA01 was grown overnight in 3 ml Luria-Bertani (LB) medium and diluted 1/10,000 into fresh LB. 99 µl of cells was added to each well of a 96-well flat-bottom plate using a multichannel pipette. Next, 1 µl of a 5 mM stock of each molecule from an FDA-approved drug library supplemented with a natural product library (2,560 molecules total) was added, in duplicate, using an Agilent Bravo liquid handler. The final screening concentration was 50 µM. Plates were then incubated in sealed plastic bags at 37°C without shaking for ~16 hr, and subsequently read at 600 nm using a SpectraMax M3 plate reader to quantify cell growth. Plate data were normalized based on the interquartile mean of each plate. Molecules that inhibited growth >80% were labelled as active. The dataset was curated by Jonathan Stokes.

      Contact us to request the dataOnce you complete training your model, submit your predictions and we will send you back the results on the private test set.

Our current focus is to develop drugs that inhibit the 3CL protease. The SARS-CoV 3CL protease gene is >96% conserved in SARS-CoV-2, so we can expect good transferability from models trained on SARS-CoV. Furthermore, many pharmaceutical companies and external collaborators are concentrating efforts on the RNA polymerase and the PL protease.

Initial experiments use AID1706 as a training set. Our evaluation dataset can be found here. Additional details can be found in the baseline  procedure. Training data splits are also available.