Healthcare’s Unstructured Data Problem

A patient shows up to a new specialist carrying three decades of history. It is scattered across fax pages, scanned PDFs, discharge summaries, and a few sheets of handwritten notes. The information needed to treat the patient safely exists. It just isn’t in a form any system can read. So, a staff member will retype parts of it into the electronic health record, one field at a time.

That scene repeats across hospitals and clinics every day. It is the visible edge of a deeper problem: most clinical information lives in documents, not databases. Pulling it out reliably has become one of healthcare’s quiet operational crises, and it explains why interest in intelligent document processing and specialized data extraction services has climbed so sharply. The technology promises to turn the document pile into usable data.

The stakes are concrete. Administrative work already consumes a large slice of every healthcare dollar. When the data that drives billing, care coordination, and reporting sits trapped in unstructured files, the cost surfaces as wasted clinician hours, delayed reimbursement, and decisions made without the full picture. Multiply one retyped record by a health system’s daily volume, and lost minutes turn into real money and real risk.

Where the Patient Record Actually Lives

Structured data, the coded diagnoses, lab values, and medication lists that fit neatly into fields, tells only part of the story. The richer clinical narrative sits in free text: the progress note, the radiology impression, the referral letter. A 2025 validation study in the Journal of Medical Internet Research found that free-text notes carry clinical concepts that structured fields miss entirely, confirming what clinicians have long sensed about where the real detail hides. Across a typical chart, the notes are where a clinician reconstructs what actually happened to a patient.

Context is the reason this matters. A coded diagnosis records that a patient has heart failure. The note records that the symptoms began after a missed medication, that the patient lives alone, and that a daughter manages the prescriptions. Treatment often depends on the second kind of information. Miss it, and a readmission-prevention plan can target the wrong problem. Capturing that context at scale is the hard part, and it is where most automation efforts stall. Capturing that context at scale is the hard part, and it is where most automation efforts stall — in part because the distinction between structured and unstructured data extraction is far more consequential in clinical settings than in most enterprise environments.

Why Manual Extraction Stopped Scaling

For years, the answer was people. Medical records teams, coders, and clinicians keyed information from one document into another. That held up when volumes were smaller. It buckles under the data growth healthcare now generates. Every new data source, every additional payer portal, every scanned attachment adds to the pile a person has to process by hand.

Interoperability shows the strain clearly. The Office of the National Coordinator reports that by 2023, about 70 percent of U.S. hospitals routinely engaged in all four core exchange activities: finding, sending, receiving, and integrating outside records. Integration is a weak link. A hospital can receive a summary of care electronically and still need a person to read it and re-enter the relevant parts, because the incoming document does not map cleanly into the receiving system.

The price of all that manual handling is steep. McKinsey estimates that broad adoption of AI and automation could produce net savings of 5 to 10 percent of U.S. healthcare spending, much of it by easing the administrative and documentation load on clinicians. A large share of that load is plain data handling: reading, transcribing, and reconciling information that software could extract instead.

The Workflows Where Extraction Pays First

Some processes feel the document burden more than others, and those are usually where extraction earns its keep sooner. Prior authorization is a prime example. A single request can mean pulling diagnoses, prior treatments, and clinical justification out of notes and faxes, then keying them into a payer portal. Referral intake is another. A specialist’s office receives a packet, and someone has to sort it, find the relevant history, and enter it before the patient is even scheduled. Claims and coding round out the list, since accurate reimbursement depends on details buried in operative reports and progress notes.

These workflows share a shape: high volume, repetitive reading, and a direct line to revenue or patient access. Automating the extraction step leaves the clinical judgment untouched. It removes the data-shuffling that surrounds the judgment, which is where most of the hours quietly disappear. A prior-authorization specialist may touch a dozen systems to assemble one request, and trimming the document lookup from that loop frees hours each week without changing a single clinical call.

What Intelligent Document Processing Actually Does

Intelligent document processing, usually shortened to IDP, is the technology stack built to turn messy documents into structured data. It works in layers, and the order matters:

Optical character recognition reads the text off scans, faxes, and images, including the rough ones like a third-generation photocopy.
Machine learning models classify each document: this is a referral, that is a lab report, and the next is an insurance card.
Natural language processing pulls out the meaningful pieces, such as a diagnosis, a dosage, or a date of service.
Validation rules and confidence scoring flag anything the system is unsure about before it reaches the record.

The incorporation of generative AI has pushed these capabilities further still, enabling IDP systems to perform contextual summarization and zero-shot extraction from document types they were never explicitly trained on, a shift that researchers and practitioners have begun documenting in both academic literature and industry analysis.

The leap past plain OCR is the understanding step. Older systems could turn a fax into text. They could not tell a discharge date from an admission date, or recognize that “MI” means myocardial infarction in a cardiology note. IDP adds that judgment. That is what makes extraction from clinical documents workable rather than merely digitized. Clinical language is its own dialect, thick with abbreviations, drug names, and shorthand that general models stumble over. Extraction tuned for medicine handles that vocabulary, which is why a tool trained on invoices rarely transfers cleanly to a chart.

Figure 1. The intelligent document processing pipeline, with a validation step that routes uncertain values to human review.

The Documents That Break Most Systems

Healthcare punishes generic extraction tools. The documents are varied, inconsistent, and often decades old. A few categories cause most of the trouble:

Handwritten notes and margin annotations, where legibility ranges from poor to heroic.
Faxed and re-scanned pages that lose resolution with every pass.
Forms that look standardized but vary by facility, payer, and state.
Tables and flowsheets, where a value’s meaning depends on its exact position in a grid.

This is where general-purpose tools and specialized data extraction services part ways. A provider that has processed millions of clinical documents has seen the edge cases: the smudged dosage, the ambiguous abbreviation, the intake form that changed last quarter. That accumulated exposure, more than any single algorithm, separates reliable extraction from a demo that only looks good on clean inputs. An invoice from a vendor follows a predictable shape. A 1990s cardiology consult does not.

The Question Most Coverage Skips: How Accurate Is It, and Who Checks?

Most writing on healthcare IDP celebrates speed and automation. Less of it asks the question that matters most in a clinical setting. How accurate is the extraction, and what happens when it is wrong? Extraction is probabilistic. A model returns its best guess with a confidence level, not a guarantee. In a marketing database, a wrong field is a nuisance. In a patient record, a misread dosage or a transposed lab value is a safety event.

Peer-reviewed work on reusing extracted clinical data is blunt about this. Regulatory-grade use depends on validation, traceability, and governance, not on extraction accuracy alone. The same logic applies to everyday care and billing. Accuracy at the document level is not the finish line. What counts is whether errors get caught before they touch a decision.

Picture a model reading a handwritten medication order. It sees a dose that could be 5 mg or 50 mg, and it picks one. If the system writes that value with no flag, the error now sits in the record, indistinguishable from a verified entry. A confidence threshold changes the outcome. The uncertain field stops, and a person confirms it against the source page. The gap between those two designs is the gap between automation that is safe in healthcare and automation that is not.

A sound extraction program builds the checks in from the start:

Confidence thresholds that route low-certainty fields to a human reviewer instead of writing them silently into the record.
Audit trails that record where each extracted value came from, so a clinician can trace a figure back to its source page.
Sampling and accuracy measurement against a known ground truth, repeated over time as documents and models drift.
Clear ownership of the exceptions queue. The hard 5 percent is where patient safety actually lives.

Evaluating any extraction approach, whether built in-house or sourced from data extraction companies, should start with these controls. Providers that lead with accuracy rates and explain how they measure them are describing a discipline. The ones that lead with speed alone are describing a risk. Accuracy claims also deserve scrutiny of the inputs behind them. A 99 percent rate on clean, typed forms says little about performance on a faxed, handwritten note.

Build It, Buy It, or Outsource Data Extraction Services

Healthcare IT leaders face a familiar choice. Build extraction capability internally, buy a platform, or outsource data extraction services to a partner that specializes in clinical documents. Each path fits a different situation, and many organizations end up blending them. The right answer usually turns on document volume, the age of the records, and how much in-house data talent already exists.

Building in-house gives the most control and suits organizations with strong data science teams and steady document volume. Buying a platform moves faster and works when internal workflows are already mature. Outsourcing tends to make sense for backlogs, legacy record digitization, and seasonal spikes that would swamp internal staff. A common pattern pairs a platform for steady-state intake with a data extraction company for the messy historical archive no one has had time to touch.

The evaluation criteria stay constant across all three paths: accuracy and how it gets measured, handling of the hard document types, compliance with HIPAA and data-handling rules, and a clear human-in-the-loop process for low-confidence cases. Technology is necessary. The governance wrapped around it is what makes the technology trustworthy. Security deserves particular weight here, since these documents carry protected health information. Where the data gets processed, how long it is retained, and who can access it are not afterthoughts.

What Changes When Extraction Works

Reliable extraction pays off quietly. Coders spend less time hunting through PDFs and more on judgment work. Care teams open a new patient’s history and find it already structured and searchable. Claims move faster because the supporting data is captured cleanly the first time. McKinsey estimates that automating claims and service work could cut administrative costs for payers by 13 to 25 percent. Document extraction sits near the root of those gains. Cleaner inputs also mean fewer claims bounced back for missing documentation.. Document extraction sits near the root of those gains. Cleaner inputs also mean fewer claims bounced back for missing documentation.

None of this is dramatic on any single record. Across a health system’s volume, it compounds into reclaimed clinician hours and faster, better-informed care. A coder who once spent a third of the day hunting for documents can spend it on the cases that need real attention. Healthcare generates an enormous and growing share of the world’s data. The organizations that turn that data into something usable will spend less time fighting their records and more time acting on them.

Healthcare’s data problem has less to do with how much information exists and more to do with how much of it stays locked in documents that no system can read. Intelligent document processing offers a credible way out, converting faxes, scans, and free-text notes into structured, usable data. The technology has matured to the point where reliable extraction from clinical documents is realistic, not aspirational. The work ahead is disciplined implementation: measuring accuracy, governing the exceptions, and keeping a human in the loop where the stakes run highest. Health systems that treat extraction as a governed capability, rather than a quick fix, will spend the coming years acting on their data instead of retyping it.

Peter Leo

Senior Consultant at Damco Solutions | + posts

Peter Leo is a Senior Consultant at Damco Solutions specializing in strategic partnerships and business growth. With deep expertise in forging high-impact collaborations, he helps organizations drive revenue, expand into new markets, and build lasting value. Known for a data-driven approach and strong relationship management skills, Peter delivers tailored strategies that align with business goals and unlock new opportunities.