DMICE Researchers Organizing and Participating in Covid-19 Information Retrieval Challenge

One of the major challenges of the Covid-19 epidemic is managing the rapidly expanding scientific corpus that is published in journals, by health-related organizations, and on preprint servers. A new information retrieval (IR) research challenge aims to identify the best methods for retrieving scientific literature for the current and all future rapidly evolving pandemics. Researchers from the Department of Medical Informatics & Clinical Epidemiology (DMICE) are among the organizers of  this new research challenge related to Covid-19. OHSU medical students overseen by DMICE are also annotating the output of systems for relevance to topics in the challenge.

The challenge is called TREC-COVID and aims to develop and evaluate methods to optimize search engines for the current and rapidly expanding number of scientific papers about Covid-19 and related topics. The challenge is being organized by a group of IR researchers from the Allen Institute for Artificial Intelligence (AI2), the National Institute of Standards and Technology (NIST), the National Library of Medicine (NLM)Oregon Health and Science University (OHSU), and the University of Texas Health Science Center at Houston (UTHealth). A press release and official Web site for the project have been posted. DMICE Chair William Hersh, MD is also maintaining a page about the project.

TREC-COVID applies well-known IR evaluation methods from the NIST Text Retrieval Conference (TREC), an annual challenge evaluation that evaluates retrieval methods with data from news sources, Web sites, social media, and biomedical publications. In an IR challenge evaluation, there is typically a collection of documents or other content, a set of topics based on real-world information needs, and relevance assessments to determine which documents are relevant to each topic. Different research teams submit runs of the topics over the collection from their own search systems, from which metrics derived from recall and precision are calculated using the relevance judgments.

The document collection for TREC-COVID comes from AI2, which has created the COVID-19 Open Research Dataset (CORD-19), a free resource of scholarly articles about COVID-19 and other coronaviruses. CORD-19 is updated weekly, although fixed versions will be used for each round of TREC-COVID. It includes not only articles published in journals but also those posted on preprint servers, including bioRxivmedRxiv, and others. A preprint about the dataset and an article describing it also mention OHSU.

Because the dataset (along with the world’s corpus of scientific literature on Covid-19) is being updated frequently, there will be multiple rounds of the challenge, with later ones focused on identifying newly emerging research. There may also be other IR-related tasks, such as question-answering and fact-checking. The search topics for the first round are based on those submitted to a variety of sources and were developed by Dr. Hersh; Kirk Roberts, PhD of UTHealth; and Dina Demner-Fushman MD, PhD of NLM. Relevance judgments are being done by those with medical expertise, such as medical students and NLM indexers. Dr. Hersh is overseeing the initial relevance judging process, which is being carried out by OHSU medical students who are currently sidelined from clinical activities due to the Covid-19 crisis. DMICE faculty Steven Bedrick, PhD is helping to organize the technical and logistical aspects of the judging process.