By David Dyke & Vishal Kapadia
One challenge facing the research community today: acquiring meaningful data sets on the COVID-19 pandemic. The questions asked are still so fundamental – why do some people end up positive but asymptomatic, while others are in respiratory failure?
The answers to those questions may well lie in the historical medical data for COVID-19 positive patients. More than 1.4 million patients have tested positive for COVID-19 nationwide, and the medical community is accumulating information to further scientific understanding of the disease and the factors influencing its transmission and progression.
Answers require data, and that data has to be accumulated. It’s a tall order. Imagine the scale of collecting a lifetime of medical records for each of the 1.4 million (and growing) cases of Americans testing positive for COVID-19. Getting medical records from every doctor’s office, hospital and urgent care facility in the country into a single system is a convoluted process requiring secure technology and a broad and deep network to curate medical data.
Then there are many questions of data quality and completeness. Medical data is famously siloed, narratively dense and not very interoperable. The clinical data rests in various states of structure. While interfaces like HL7, CCD/A and FHIR exist, they are not designed for clinical research use cases and connectivity challenges abound. The lowest common denominator and the fastest path is generally an exported copy of the medical data, typically a PDF.
So the challenge is building a clinically relevant longitudinal data set for 1.4 million people from narrative clinical notes derived from a PDF. How does a health technology company turn that into research-grade data?
The answer lies in applying Artificial Intelligence techniques to the raw data, including Biomedical Natural Language Processing (bNLP) and Computer Vision, to contextualize medical records data paired with Deep Learning to learn and improve without being explicitly programmed automatically. With machine learning pattern matching, backpropagation and other techniques with a highly trained biomedical informatics NLP, we discover solutions to problems humans would not see. However, despite their power, many of these models are inherently opaque, meaning that the results cannot be explained in logical inference.
The machines won’t cut it. And a person alone would spend untold hours to review each piece of the patient’s medical record, combine them and assess its applicability to the clinical questions at hand. However, the combination of machines and people? Efficient. That’s where Ciox Real World Data and the DataFit PlatformTM comes in.
A machine can take an enormous amount of input data and narrow it down into the more useful aspects—a process that would typically take hours by hand. The machine also provides several predictions and a reasonable interpretation of the data, but because of the machine’s imperfections, the data is still flawed and not completely refined. To finish the process, a clinical expert at Ciox RWD uses their knowledge to clean the data up and curate the machine’s output into quality, research-grade data.
This process would have previously taken many hours, but the combination of machine learning and human expertise facilitates this process considerably. By leveraging what machines are good at and what people are good at, we reach an endpoint faster, more completely and with maximum quality.
The effort – and it’s worth noting that the data collected and studied in search of COVID-19 research is wholly private, de-identified information and fully HIPPAA compliant – is similar to the mission we’ve undertaken in recent years for various cancers. That is to say, it’s registry building.
In cancer, as in COVID-19, the registries help us better understand correlations and causation. What medicines work and under what conditions? What therapies work better than others? Why do some treatments work only on a certain kind of people? With questions like these, in cancer as in COVID-19, the depth, breadth and quality of the data become vitally important.
The sheer quantity of unknowns is precisely why clinicians, nurses, doctors and researchers are working tirelessly to understand the underlying factors in the COVID-19 positive population. They are the insights we still need. They’re why the data must be good, and we need to move at a fast pace.
None of this rapid dataset building would have been possible even five years ago. But today, thanks to the support of the DataFit PlatformTM, which we are using to build the research-grade datasets used in studies of the coronavirus, we can address the crisis rapidly and at scale.
We can take real-world data and put it in the hands of researchers to derive real-world outcomes. We can unlock the high-value data trapped in unstructured formats and make it available to authorized researchers faster than ever before. Together, with a combination of identification and analytic technologies, and with data curation workflows and data provenance protections, we are giving the right real-world data to researchers to discover solutions for COVID-19 treatment and prevention.