Accelerating Clinical Trial Recruitment with Natural Language Processing

Updated on December 28, 2020
Pandemic flu coronavirus or covid 19 outbreak concept. Hand holding blood tube test isolated on background. Vector illustration design.

Photo credit: Depositphotos

By David Talby 

If 2020 has taught us anything, it’s that the need for faster development, testing, and regulatory approval for new medical treatments is critical to public health. With the acceleration of COVID-19 research months into the pandemic, people are hopeful that a vaccine is on the way so we can resume some semblance of normal life. Healthcare providers are equally ancitipcating a light at the end of the tunnel, given the burden the pandemic has on our healthcare system.

The race for the COVID-19 vaccine is one of the fastest in history. For most treatments, the normal time for development is around 15 years. The severity of the disease is irrelevant here: whether it’s cancer, Alzhiemer’s disease, or Parkinson’s disease, the reality is that launching new drugs or medical devices takes a long time. For many suffering serious illnesses, this is the difference between life and death.  

Just consider the numbers. With too few participants, 86 percent of clinical trials are delayed and 50 percent of trial sites fail. For one blockbuster drug, this equates to 993 human lives lost, $2.8 billion in revenue lost, and a 5x increase in drug price once it goes to market. The consequences are enormous, for humanity and financially, and there’s one surprising factor stalling progress in this area— and it isn’t what you would think.

One of the biggest hurdles in treatment development is recruiting patients for clinical trials. In order to find participants, pharmaceutical companies and the clinicians they partner with must get patient data, generate search criteria, find a match, and get patients on board. It sounds simple given the data analysis and artificial intelligence (AI) capabilities we have today, but when you consider the nuances of medical jargon (different terms used by different doctors), the source of data (messy collection of reports and images spread across many systems), and increasingly complex enrollment criteria, a host of complexities are added to the mix. 

Thankfully, advances in natural language processing (NLP) technology are helping address this challenge. With highly accurate pre-trained models available coupled with the ability to easily train and tune domain-specific models, NLP has the power to help organizations sort through complex data sources and accelerate the rate in which patients can be recruited to trials for life-changing treatments.

NLP enables data scientists to disambiguate clinical and human language—think of doctor’s notes written in a specialty-specific jargon. Detecting and refining relevant clinical facts is no easy task when you consider how specific and contextual clinical language is, and then pepper in human error, such as misspellings, improper punctuation, and acronyms that are sometimes used only within one hospital or clinic. Healthcare-specific NLP models are essential here, going beyond generic biomedical NLP to extract specific entities and facts required by each medical specialty. Spark NLP for Healthcare, for example, is backed by 300 pre-trained models and 2,200+ datasets being constantly updated. This means enhancements in deciphering terminologies, outcomes, drugs, providers, devices, genes, clinical guidelines, billing codes, and many others.

Let’s say, for example, that a clinical trial is looking for patients with triple-negative breast cancer (TNBC). TNBC refers to the fact that these specific cancer cells don’t have estrogen or progesterone hormone receptors and fail to make a significant amount of the protein called HER2. As such, the cells test negative on all 3 tests. As you can imagine, there are many ways to denote this in medical records— Er-/pr-/h2-, (er pr her2) negative, Tested negative for the following: er, pr, h2,  Triple negative neoplasm of the upper left breast, etc.—which can result in thousands of search results for even a small pool of patients.

Modern NLP models that are trained specifically on breast cancer data will not only be able to identify variations of these terms, but also when they are in context. The same terms—“er”, “pr”, “her2”, “tn”—can have different meanings in a different specialty, like hematology, cardiology, or genomic research. Acronyms, ER stands for “emergency room,” or spelling mistakes, as in “she missed her2 last appointments,” can also be challenging to detect, but as NLP gets smarter, we can account for even human error. 

Moreover, a clinical NLP pipeline should go beyond recognizing terms to also identify what is being said about them. Since most breast cancer patients will have the terms mentioned in their records, it’s important to tell where an assertion made is positive (“patient is h2-/er-/pr-”), negative (“she’s confirmed not to be her-”), hypothetical (“recommending a test for TNBC given her recent labs”), or about someone else (“her sister had tnbc 8 years ago”).

Traditional text mining algorithms and software could not differentiate between these common variations at reasonable accuracy until very recently. They’re far from perfect today, but accuracy improvements made just over the past two years saves a significant amount of time for finding the right candidates for clinical trials, accelerating the recruitment process.

With more than 5,400 cutting-edge treatments behind the gates of clinical trials, it’s vital for healthcare organizations to start leveraging the power of NLP to aid in clinical trial recruitment. We still expect doctors to read through trial descriptions and make the matches with patients manually, and have long since hit the point where there’s just too much data to enable this to effectively happen.

If we could reduce the average 10.8 month delays in clinical trials, we could take years off of the drug development timeline. This means healthier patients, less expensive drugs, and a huge weight off of the healthcare system. And implementing NLP is the best way to achieve this. 

David Talby is CTO of John Snow Labs.

The Editorial Team at Healthcare Business Today is made up of skilled healthcare writers and experts, led by our managing editor, Daniel Casciato, who has over 25 years of experience in healthcare writing. Since 1998, we have produced compelling and informative content for numerous publications, establishing ourselves as a trusted resource for health and wellness information. We offer readers access to fresh health, medicine, science, and technology developments and the latest in patient news, emphasizing how these developments affect our lives.