Update: We are releasing our predictions on compounds from the Broad Repurposing Hub. Download predictions.
In this post, we present results from of our model trained to predict antiviral activity against COVID-19. The main technical challenge in designing and training such models is still the lack of sufficient data — molecules labeled with associated antiviral activity. Despite multiple on-going experimental efforts, data made available in public domain is very limited. For instance, this paper (Jeon et al., 2020) reports testing of 48 FDA approved molecules, of which 24 are identified as potent. On its own, the dataset is clearly insufficient for training a neural model. Therefore, we have to find a way to augment the data with other relevant sources that can further inform the model. We identified two such sources. The first one consists of molecular fragments that bind to SARS-COV-2 main protease, obtained via crystallography screening by the Diamond Consortium. The second source comes from readily available screening of a close relative of the virus, SARS-CoV-1. This data is relevant since SARS-CoV-1 and SARS-CoV-2 proteases are similar (79.5% sequence identity). While both are clearly pertinent, they also differ significantly from COVID-19 screens. For instance, binding fragments from Diamond dataset are much smaller than molecules we are trying to classify. This prevents us from using standard property prediction models that we and others have successfully used in the past for in-silico screening.
The key technological challenge is to be able to estimate models that can extrapolate beyond their training data, e.g., to different chemical spaces. The ability to extrapolate implies a notion of invariance (being impervious) to the differences between the available training data and where predictions are sought. In other words, the differences in chemical spaces can be thought of as "nuisance variation" that the predictor should be explicitly forced to ignore. To this end, we introduce a novel approach that builds on and extends recently proposed invariant risk minimization, adaptively forcing the predictor to avoid nuisance variation. We achieve this by continually exercising and manipulating latent representations of molecules to highlight undesirable variation to the predictor. This method is specifically tailored to rich, combinatorially defined environments typical in molecular contexts. For full details of the method, please see the paper.