Hi,
Could someone direct me to resources to understand the link between ML and genomics related to COVID-19? Having a background on ML learning only, I do not understand fully what kind of mathematical problem the genomics community would try to solve when developing a cure or a vaccine using the virus' genome. The group I am working with at MIT developed sparse methods applied to regression and classification in which the number of samples n is much lower than the number of features p using a limited number of variables which improves interpretability, robustness and accuracy vs. existing methods. I understand that the genomics community is using methods such as LASSO and sparse partial least square. I would be very grateful to anyone helping with the following questions:
- What is the objective and meaning of mathematical problems related to this issue? (I understand that similarities with other viruses from a genetic point of view also means that similar vaccines/treatments would help)
- What data bases of existing viruses are used? are they available?
- Is there a genome database of COVID 19 that coul dbe used?
Hello! Please let me know if these answer your questions. 1) There exist BLAST-style algorithms for determining "relatedness" between genetic sequences, e.g. those of coronaviruses: https://blast.ncbi.nlm.nih.gov/Blast.cgi
These algorithms can be used to construct phylogenies (genetic trees) of SARS-CoV-2 and other viruses. SARS-CoV-2 is most genetically similar to SARS-CoV and shares high similarity in some key proteins, such as the envelope. However, they are still "different" enough, and there has been no approved drug for SARS-CoV (since it stopped abruptly) that we cannot directly take "results" for SARS-CoV and apply them here.
Furthermore, even though coronaviruses are very common (they cause the common cold), few have caused epidemics, so few have been sequenced. Right now, it also makes more sense to fund experiments on SARS-CoV-2, rather than related viruses. Thus, it's hard to apply transfer learning / low-resource learning when we don't have many examples of "other viruses" = I believe it's hard to find a cure from a purely genomics perspective 2) Here is a paper that published early sequencing results for SARS-CoV-2, with comparisons to other coronaviruses: https://www.ncbi.nlm.nih.gov/pubmed/32007145 There are also many other papers describing the homology (or lack thereof) between SARS-CoV-2 and other viruses. 3) I would refer you to GenBank's latest data for SARS-CoV-2, where researchers around the world have shared their sequencing results: https://www.ncbi.nlm.nih.gov/genbank/sars-cov-2-seqs/