The challenge will spur the creation of innovative strategies in NLP by allowing participants across academia and the private sector to participate in teams or in an individual capacity. Prizes will be awarded to the top-ranking data science contestants or teams that create NLP systems that accurately capture the information denoted in free text and provide output of this information through knowledge graphs.
With an ever-growing number of scientific studies in various subject domains, there is a vast landscape of biomedical information which is not easily accessible in open data repositories to the public. Open scientific data repositories can be incomplete or too vast to be explored to their potential without a consolidated linkage map that relates all scientific discoveries. This massive amount of medical knowledge can often be computationally transformed into knowledge graphs that can be used in an open data repository and has the potential to assist in identifying gaps in medical research and accelerating research for unexplored medical domains through scientific investigations.
However, open medical data on its own is not enough to deliver its full potential for public health. By engaging technologists, members of the scientific and medical community and the public in creating tools with open data repositories, funders can exponentially increase utility and value of those data to help solve pressing national health issues. The LitCoin Natural Language Processing (NLP) Challenge seeks to spur innovation by rewarding the most creative and high-impact uses of biomedical, publication-free text to create knowledge graphs that can link concepts within existing research to allow researchers to find connections that may have been difficult to discover without them. This challenge is part of a broader conceptual initiative at NCATS to change the “currency” of biomedical research. NCATS held a Stakeholder Feedback Workshop in June 2021 to solicit feedback on this concept and its implications for researchers, publishers and the broader scientific community.
This challenge brings together government, medical research communities and data scientists to create data-driven knowledge graphs that consolidate medical scientific data across domains. With an approximately four (4)-month development cycle for the challenge, data scientists will be challenged to develop NLP systems with the ability to identify concepts from a biomedical publication and link them together into relationships to create well-linked and carefully defined knowledge graphs for each publication.
Biomedical researchers need to be able to use open scientific data to create new research hypotheses and lead to more treatments for more people more quickly. Reading all of the literature that could be relevant to their research topic can be daunting or even impossible, and this can lead to gaps in knowledge and duplication of effort. Transforming knowledge from biomedical literature into knowledge graphs can improve researchers’ ability to connect disparate concepts and build new hypotheses, and can allow them to discover work done by others which may be difficult to surface otherwise.
To advance some of the most promising technology solutions built with knowledge graphs, the National Institutes of Health (NIH) and its collaborators are launching the LitCoin NLP Challenge. This challenge aims to (1) help data scientists better deploy their data-driven technology solutions towards accelerating scientific research in medicine and (2) ensure that data from biomedical publications can be maximally leveraged and reach a wide range of biomedical researchers; together this will drive toward solutions for the critical problems these scientists aim to solve.
NCATS will share with the participants an open repository containing abstracts derived from published scientific research articles and knowledge assertions between concepts within these abstracts. The participants will use this data repository to design and train their NLP systems to generate knowledge assertions from the text of abstracts and other short biomedical publication formats. Other open biomedical data sources may be used to supplement this training data at the participants’ discretion. In addition to creating these assertions, successful participants’ NLP systems should be able to recognize which assertions are novel findings that represent the fundamental reason that the manuscript was published, as opposed to background or ancillary assertions that can be found elsewhere.
Total Cash Prize Pool
This is a single-phase competition in which up to $100,000 will be awarded by NCATS directly to participants who are among the highest scores in the evaluation of their NLP systems for accuracy of assertions.
A total of up to $100,000 will be awarded by NCATS to the top performers of this challenge.
At this stage, NCATS anticipates that cash prizes will be awarded to seven (7) of the top performing NLP systems as follows:
First prize: $35,000
Second prize: $25,000
Third prize: $20,000
Four runner-up prizes: $5,000 each
In the case that a team, entity or individual who does not qualify to win a cash prize is selected as a prize winner, NCATS will award said winner a recognition-only prize.
NCATS may choose to award different cash prize amounts, or no prize at all, at their discretion.
Cash prizes awarded under this challenge will be paid by electronic funds transfer and may be subject to federal income taxes. The U.S. Department of Health and Human Services and NIH will comply with the Internal Revenue Service withholding and reporting requirements, where applicable.