Should Patients Share Their Medical Records to Improve AI Models?

Updated on November 24, 2024

How would you feel about contributing your medical records to an organization dedicated to improving healthcare worldwide if the data were anonymized and used for the “greater good”? To benefit humanity?

“Gen AI could streamline service delivery and operations for frontline public health and community health workers, reducing administrative burdens…and ensuring effective, tailored engagement with diverse communities,” according to a McKinsey report.

These Gen AI models could be made more effective with larger healthcare information datasets. Individuals could consider contributing their healthcare data for such purposes if assurances are provided that data would not be misused.

A well-managed organization committed to improving the health and wellness of the underserved worldwide could benefit from the data by informing generative AI designed to assist treatment in underserved communities. So, should we consider sharing our healthcare data for the greater good?

On the surface, the answer would be a resounding “yes.”

Nevertheless, sharing personal medical information remains a tough question to answer. 

Indeed, some of the health information you and I share online with friends or acquaintances could be training large language models (LLM) as you read this. If the information resides on the public internet, not behind a login screen, the data can be scraped from public-facing websites and used in the training of generative AI.

Our personal and private healthcare information is highly protected data. Laws govern how and when it can be disclosed to third parties. No one can use your health information without explicit consent. 

If one consents to their data being used, there are many issues to consider, including data security, the type of organization entrusted to care for the information, and exactly how the data would be anonymized.  Before allowing an LLM to train on medical histories, a sophisticated and secure consent management process must be created.

It’s not only personal medical data use that can be restricted. Companies are leveling copyright infringement lawsuits at the makers of generative AI. The New York Times, for example, is suing ChatGPt and Copilot because the media giant says the LLMs are training on its content. (As of this writing, more people and organizations are suing saying AI is being trained on their data without permission.)

Constructing the LLM

As we know, LLMs require enormous amounts of data to train generative AI models. Depending on the LLM that’s created, it can use anywhere from 780 billion to 3.6 trillion tokens. (A token breaks a word into small parts, which can help the LLM better understand the information.) As patients, we generate an astounding amount of healthcare data during a lifetime. For example, hospitals generate 50 petabytes of data annually.

While medical data exists and LLMs consume vast amounts of it, there remain many obstacles—some of which may be insurmountable—to sharing the information with generative AI.

Medical data can be anonymized, however, even when deidentified, it’s still mine. I own it and you need my explicit consent to use it. Just as The New York Times and others are asserting in copyright infringement lawsuits. In addition, even clustered de-identified data can lead back to a specific person if enough information is available. 

In a more general sense, there are a host of obstacles to overcome and questions to answer before rolling out a generative AI program: “Developing and deploying gen AI comes with several challenges and risks that need to be evaluated and managed as part of any organization’s journey,” according to McKinsey. “These include, but are not limited to, fairness and bias of models; privacy, intellectual property infringement, and regulatory compliance concerns; the interpretability and usability of models; the need for human oversight and accountability to override incorrect recommendations; and performance inaccuracy stemming from misinformation or hallucinations, in which the gen AI model presents an incorrect response based on the highest-probability response.”

More questions than answers?

Like many things, the question remains, and the answer isn’t simple when you move beneath the surface. 

If sizable numbers of people self-select and opt out of the program without sharing any medical data, the LLM could face serious challenges, including bias, inconsistent responses, “hallucinations,” and additional issues caused by a deficiency of training data. 

Additional concerns include:

  • Retrieving your anonymized data if you decide to revoke the usage from the LLM
  • Updating your de-identified information as your electronic health record changes 
  • Implementing data security during the deidentification process

“(T)he safety-critical nature of the domain necessitates thoughtful development of evaluation frameworks, enabling researchers to meaningfully measure progress and capture and mitigate potential harms,” according to an article published in the journal Nature. “This is especially important for LLMs, since these models may produce text generations…that are misaligned with clinical and societal values. They may, for instance, hallucinate convincing medical misinformation or incorporate biases that could exacerbate health disparities.”

Significant consideration must be given to the potential use of personal medical data, including:

  • Security
  • Anonymity
  • Usage
  • HIPAA compliance

The many potential benefits of generative AI could outweigh the possible widespread unease in training the LLM with personal medical record data. However, just as the benefits are real, so are the disadvantages. No matter your position on this topic, it requires additional research and thought before implementation on a small or large scale.

Madan Moudgal
Madan Moudgal
Head of Technology Solutions at Sagility

Madan Moudgal is the Head of Technology Solutions at Sagility.