Machine Learning Formulation

Machine learning methods can easily learn to relate compounds and their properties so long as the molecule - property relationship is illustrated via examples. For instance, data from molecular screens gives us a set of compounds that are active (e.g., inhibit a specific protein target) and others that are deemed inactive against this target. From this dataset we can learn a classifier that can predict for any new compound whether the molecule has activity against the target. The trained classifier can be applied across libraries of known compounds in search for candidate drugs against the protein target.

To apply this strategy to find compounds that are effective against covid-19, we currently have to resort to auxiliary data from related viruses. For example, we can use compounds screened for activity against 3CL-pro/M-pro pertaining to SARS-Cov-1 and learn an active/inactive classifier based on this training set. Once trained, we can use it to scan a library of known compounds (e.g., Broad repurposing hub) to see if these compounds score high against 3CL-pro/M-pro according to the classifier. Since Covid-19 or SARS-Cov-2 bears a close relationship with SARS-Cov-1, the resulting high scoring candidates could be further examined and tested for activity against SARS-Cov-2.

Specific challenges in this approach arise from the fact that the large scale screens are often noisy, might specifically miss active compounds, and it's challenging to accurately evaluate the quality of the resulting predictor. For evaluation, we can and have used a small set of compounds specifically verified for activity against 3CL-pro/M-pro. While this "test set" is small, compounds in this set ought to be ranked high if the predictor is reasonable.

Orthogonal verification of predictions A step further is to consider not only a predictor score (e.g., probability of being active) for compounds in the repurposing library but seek a stronger measure of confidence for such predictions. It is notoriously challenging to obtain a well-calibrated measure of confidence from neural networks. However, we can use an orthogonal approach to a similar effect. For example, we can use "rationales" described in Composing Molecules with Multiple Property Constraint

which are extracted from a predictor via Monte Carlo Tree Search (MCTS) by pruning each active compound until the predictor no longer deems the remaining part as active. The resulting fragments (multiple fragments extracted from a single compound) represent reasons why the predictor thinks the compound is active. If these rationales from training compounds are also present in compounds we predict as active in the repurposing library, we have a greater confidence that the predictor is making statistically valid classifications (i.e., relying on substructures present in the training compounds).

Predicting molecular cocktails One can also learn to predict molecular cocktails rather than single compounds as potential therapies. Indeed, antiviral therapies often consist of multiple compounds that cooperatively perturb various biological processes that are essential for viral infectivity and replication. Ideally, we would have large-scale screens of infected cells treated with multiple compounds simultaneously, but this data we do not have. The machine learning question is then, how can we estimate the antiviral properties of molecular cocktails when the available screening data consists overwhelmingly of single compound assays? Many creative answers/formulations could be developed for this problem. For example, we can take the rationales identified as described above for active training compounds and look for a collection of high scoring compounds in the larger library that nevertheless exercise different rationale fragments. Collections of such compounds therefore have a higher chance of acting on slightly different targets, perhaps with functions related to discrete viral processes.

More broadly, the problem of predicting molecular cocktails calls for a better integration of known biology with machine learning methods. We should take all known viral processes and identify candidate targets within each. We can then learn to predict compounds with activity against each target and assay for combinations that perturb essential – and ideally synthetic-lethal – viral processes. As stated, this requires screens against multiple targets (which we are currently limited by). Additional data would be particularly helpful.

Many more formulations A number of other formulations and approaches are possible. For example, since the classifier is likely applied to a known library of candidate compounds, information about the library itself can be incorporated into the classifier already during training. Such transductive methods can potentially yield performance advantages. Moreover, disparate data may be available across different viruses. Thus compounds known to inhibit close relatives of target proteins in other viruses could be considered as candidates or included as training examples.

Finding Drug Candidates via Information Retrieval Some of the targets like genes or proteins can be attacked not only by antivirals but by compounds used for other purposes. There is a possibility to leverage machine learning techniques to automatically scan through large amounts of literature / assay tests to search for drug candidates for a given target. Some of the challenges in this approach are:

  • The same target might have different names (just like synonyms in language).

  • Even though we find that a particular compound can inhibit a target, the same compound may not inhibit the target in the virus or even it inhibits the target, it may not inhibit the virus itself. This makes it important to identify good targets.

Finding Good Targets via Information Retrieval There are many ways in which we can find target candidates for drugs. Some of these are: a) Ask a domain expert; b) Find a similarity metric between viruses and use that to identify good targets; c) Form a set of all genes in COVID-19 and search for genes that may have more than x% overlap (x might be say 99%) with COVID genes. d) Similar to genes, we can also form a set of proteins in COVID-19 and append the set with related proteins. Challenges associated with this are similar to the one noted in the previous section.

Data Augmentation from Other Viruses The SARS virus is most closely related to COVID. The problem is that we only have very few positive drug candidates for SARS and a very large number of negatives. This makes it very hard to train a ML model. We can augment this data by collecting data from other viruses. However, the problem is that for only a very few viruses we have large amounts of assay data (due to worldwide interests). For many other viruses, we might only have data for 30-40 compounds. To further complicate things, different viruses can behave quite differently from each other. It is possible that what works on one virus has very little predictive power on other viruses. At the same time, there is some chance, if we can compile a dataset of all virus / drugs combinations that have been reported, we might be able to improve our model.