Open Tasks

Fighting Secondary
Effects of Covid 

COVID-19 presents many health challenges beyond the virus itself. One of them is finding effective antibiotics for secondary infections.  Help us improve machine learning models for predicting antibacterial properties. We have released a new screening dataset, and are reporting our current results on this benchmark here.

 

ML Capabilities

Multiple data generation initiatives are currently underway with the goal of primarily assessing the effectiveness of various FDA approved drugs or other safe compounds against SARS-CoV-2 targets. Our effort is aimed to ramp up and advance AI capabilities so that we are ready to extract most value from these and subsequent screens. While we are waiting for data specific to SARS-CoV-2 we can nevertheless use alternative data sources that are similar in style and scale so as to demonstrate and evaluate the effectiveness of these tools. Researchers, companies, and government agencies are then able to use these publicly available algorithms and tools, already verified to work in related tasks, to make most of  COVID-19 screens or any subsequent similar data.

 

Property prediction task

Task

The task is to predict a target compound’s property from its molecular structure.

Property predictors can be used to test large collections of molecules in silico to identify candidates with high activity, and these candidates can then be validated in the lab. For COVID-19, the predictor will be applied to safe compounds (e.g. FDA-approved drugs) to screen for antiviral activity against SARS-CoV-2. The top ranked molecules will be tested in the lab.

Data

For all the datasets, a training pair is represented by a molecular structure (SMILES string) and an activity measurement.

 

  1. E.coli: This dataset consists of 2335 pairs, with a binary activity measurement indicating E. coli inhibition. There are 120  molecules which inhibit E. coli growth. The size, quality, and distributional properties of this set are a good proxy for the SARS-CoV-2 screening data that will eventually be available (data).
     

  2. SARS-CoV 3CLpro: This dataset consists of 290,726 pairs obtained via an assay that measures activity against the SARS-CoV 3CLpro target, which is highly homologous to the corresponding protease in SARS-CoV-2. There are 405 molecules in this dataset which are active against the 3CLpro target (raw data, processed data).

Specific training, validation, and test splits for the above datasets are here.

Model

Chemprop is a type of neural network called a message passing neural network (MPNN). MPNNs are designed to operate on graph-structured objects like molecules, where each atom is represented by a node and each bond is represented by an edge. An MPNN for molecules works by first creating feature vectors for each atom and bond based on simple properties like atom type (carbon, oxygen, etc) and bond type (single, double, etc). Then it performs a series of “message passing” steps where a neural network sends information between neighboring atoms and bonds, thereby encoding local chemical information. After a number of these steps, the local chemical information is aggregated to form a single vector representing the entire molecule, which is then processed by a feed-forward neural network that makes the final property prediction. Optionally, the molecule vector created by the MPNN can be augmented with additional chemical information by concatenating it with a chemical fingerprint or descriptor before feeding the combined vector through the feed-forward neural network.

Results

See below

References

[1] Chemprop (https://github.com/chemprop/chemprop) - GitHub repo containing code for the message passing neural network.

 

[2] Yang, Kevin, et al. “Analyzing Learned Molecular Representations for Property Prediction.” Journal of Chemical Information and Modeling. 59.8 (2019): 3370-3388. (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.9b00237) - Paper describing the message passing neural network applied to a range of molecular properties.

[4] Stokes, Jonathan, et al. “A Deep Learning Approach to Antibiotic Discovery” Cell. 180.4 (2020): 688-702. (https://www.cell.com/cell/fulltext/S0092-8674(20)30102-1) - Paper describing the application of the message passing neural network to E. coli.

 

[3] Landrum, Greg. "RDKit: Open-source cheminformatics." (2006): 2012. https://www.rdkit.org/ Open source package for computational chemistry.

 

Molecular cocktails

Task

The task is to predict synergistic effects (properties) of drug combinations (cocktails).

Antiviral drugs are typically administered as cocktails so it is important to model synergistic effects of drug combinations. Given the combinatorial nature of cocktails, it is not practical to screen empirically all possible combinations, increasing importance of in-silico modeling.

Data

For all the datasets, a training instance is represented by a drug combinations (A,B) and their activity measurement. Each drug is represented by its structures (SMILES string).
 

  1. NCI cancer drug combination dataset: combination of FDA-approved cancer drugs performed in killing cancer cells.
     

  2. DrugComboDB: database of drug combinations extracted from various sources.

Model

A graph convolutional network with inter-molecule attention on the NCI dataset to predict synergistic effects.

References

[1] Holbeck, Susan L., et al. "The National Cancer Institute ALMANAC: a comprehensive screening resource for the detection of anticancer drug pairs with enhanced therapeutic activity." Cancer research 77.13 (2017): 3564-3576. (https://cancerres.aacrjournals.org/content/77/13/3564.long) - NCI cancer drug combination: 


[2] Jin, Wengong, et al. "Predicting organic reaction outcomes with weisfeiler-lehman network." Advances in Neural Information Processing Systems. 2017. (https://papers.nips.cc/paper/6854-predicting-organic-reaction-outcomes-with-weisfeiler-lehman-network.pdf) - Graph convolution with inter-molecule attention

[3] Predicting Organic Reaction Outcomes with Weisfeiler-Lehman Network: https://github.com/wengong-jin/nips17-rexgen

 

Uncertainty estimation

Task

The goal is to augment regression models to estimate their own uncertainty or expected error, enabling us to identify individual predictions of which we are highly confident.

By focusing on the most confident predictions of ML models, we can cut the time and expense of empirical screening.

Data

For all the datasets, a training pair is represented by a molecular structure (SMILES string) and an activity measurement.  In addition to the property prediction, the model has to output an uncertainty prediction.

 

Uncertainty estimation methods are tested on a public dataset which stores the lipophilicity of 4201 molecules. This can be accessed in compressed form in the Confidence branch of ChemProp’s GitHub repository.

Model

We distinguish between four primary strategies for uncertainty estimation.

  1. Ensemble-based methods (https://arxiv.org/abs/1612.01474)
     

  2. Distance-based methods (https://www.ncbi.nlm.nih.gov/pubmed/30404432, https://pubs.rsc.org/en/content/articlelanding/2019/sc/c9sc02298h#!divAbstract)
     

  3. Mean variance estimation (https://ieeexplore.ieee.org/document/374138)
     

  4. Union-based methods (https://www.ijcai.org/Proceedings/15/Papers/503.pdf)

 

We test methods for their ability to rank predictions by error, mimic the distribution of prediction errors, and mimic the magnitude of prediction errors.

Results

For each uncertainty estimation method, RMSE on the lipophilicity dataset is recorded for the 100%, 50%, 25%, 10%, and 5% of the test set on which there was lowest reported uncertainty. Strong uncertainty estimators would see a steep decline in RMSE as predictions become more selective.

References

[1] Chemprop (https://github.com/chemprop/chemprop) - GitHub repo containing code for the message passing neural network.

 

[2] Yang, Kevin, et al. “Analyzing Learned Molecular Representations for Property Prediction.” Journal of Chemical Information and Modeling. 59.8 (2019): 3370-3388. (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.9b00237) - Paper describing the message passing neural network applied to a range of molecular properties.

[3] Scalia, Grambow, et al. “Evaluating Scalable Uncertainty Estimation Methods for DNN-Based Molecular Property Prediction.” (2019). (https://arxiv.org/abs/1910.03127) - Paper which outlines standard practices for evaluating uncertainty estimation methods.

[4] Tran, Neiswanger, et al. “Methods for comparing uncertainty quantifications for material property predictions.” (2019). (https://arxiv.org/abs/1912.10066) - Paper which evaluations the performance of many similar uncertainty estimation methods when applied to material property prediction.

Stay up to date. 

  • Grey Twitter Icon
  • Grey LinkedIn Icon
  • Grey Twitter Icon
  • Grey LinkedIn Icon