Open Source


Open-sourcing a new parser to improve clinical trial participant recruitment

May 7

ADDENDUM, June 5, 2020: Our architecture design references relation extraction (RE), which we have not included in the open-sourced parser implementation. We have, however, demonstrated the feasibility of a basic RE block, which could be built by incorporating a publicly available algorithm for negation detection, such as NegEx. The updated graphic below clarifies this and should be referenced in place of the original.

What it is:

A new clinical trial parser intended to help developers and researchers build improved recruitment tools and to make it easier for people to discover and determine eligibility for clinical trials. Launching on GitHub today, this library of AI models and data helps transform clinical trial eligibility criteria into a machine-readable format. Using this library, trials can be easily searched by their eligibility requirements. To achieve this, we use natural language processing (NLP) to extract information with domain intelligence from trial descriptions and external medical vocabularies.

Our main contributions include a parser architecture that is treatment-area agnostic and a named entity recognition (NER) training dataset that is large and diverse, containing 120K doubly reviewed samples from 50K eligibility criteria and 3.3K trials. We believe there is opportunity for these resources to be used in the development of technologies surrounding clinical trials, so we are open-sourcing all of our context-free grammar (CFG) code, NER model binary, training data, and word embeddings. The parser uses public information from clinical trials and the Medical Subject Headings (MeSH) vocabulary (related license and use terms are in the repository). No Facebook data was used for this project, and we are neither launching nor hosting our own clinical trial tool.

What it does:

The library employs CFG and information extraction (IE) modules to transform eligibility criteria text into structured nominal, ordinal, and numerical relations. About 6K concepts can be extracted from the eligibility criteria. The CFG module uses a lexer to divide criteria into tokens and a modified Cocke–Younger–Kasami (CYK) algorithm to build parse trees from tokens. The interpreter analyzes the parse trees by removing duplicates and subtrees. The remaining trees are then evaluated into machine-readable ordinal and numerical relations.

The IE module leverages an attention-based bidirectional long short-term memory with a conditional random field layer architecture for NER to extract medical terms and their classes, e.g., cancer, chronic disease, or clinical variable.

Named entity linking (NEL) directly matches extracted terms to medical concepts. Links between terms and concepts could also be computed by grouping terms into clusters with word embeddings. The clustered terms can then be matched to additional medical concepts. This grounding of extracted terms to controlled medical concepts gives us medical variables and machine-readable nominal relations. The process currently uses the MeSH vocabulary, which is augmented with manually created concepts. It could be updated to use another vocabulary for better quality.

Relation extraction (RE) is used to detect logical negations in a criterion and to convert an exclusion criterion to an equivalent inclusion criterion. It also aggregates and removes redundant relations and validates the structured relations.

The extracted relations can be used to build a search interface for clinical trials. For example, an organization specializing in a certain treatment area could implement a question flow that matches people with clinical trials based on their medical condition and history. Furthermore, the organization could reach out to potential participants for new matching trials before those trials start recruiting. These types of discovery use cases may help address recruitment challenges that clinical trials need to overcome. The below visualization demonstrates how the library will translate text criteria from into structured relations.

The README on GitHub includes instructions for building your own parser as well as sample inputs and outputs.

Why it matters:

We need clinical trials now more than ever. They play a crucial role in public health by advancing science and identifying effective new treatments, but they face a series of challenges. First, many trials are not able to recruit a sufficient number of participants; as many as 86 percent of clinical trials do not reach their recruitment targets within the specified time period. Further, it has been estimated that 19 percent of studies are terminated for insufficient enrollment or are completed with less than 85 percent of target enrollment. Second, trial participation often lacks diversity with an overrepresentation of non-Hispanic whites. The non-Hispanic white population represents approximately 60 percent of the U.S. population but as much as 86 percent of U.S. cancer trial participants, for example.

The difficulty that nonmedical audiences have in understanding trial details and eligibility criteria is a driver of these issues, and it reduces the ability of researchers to engage potentially eligible participants. We’ve developed this library to make it easier for developers and researchers to build tools that determine trial eligibility. We hope this work helps these communities provide better ways for patients from all backgrounds to access clinical trials.

Get it on GitHub:

Clinical trial parser repository

Written By

Markku Salkola

Software Engineer, Machine Learning

Yitong Tseo

Software Engineer

Dr. Freddy Abnousi

Head of Health Technology