Why Healthcare AI Needs Proof, Not Just Promise

2025 has been healthcare’s “all-in” moment for artificial intelligence (AI), with almost every new technology messaging “AI-native”, or some version of “agentic.” This is true for startups and for the largest companies in healthcare, like Epic or athenahealth, who are shipping their own AI solutions.

Agentic tools are incredibly promising and, in time, can automate everything from prior authorizations to care coordination that add significant manual overhead and administrative cost to care delivery. Venture capital agrees and continues to pour money into healthcare AI startups at unprecedented valuations. The growth has been significant in both speed and scale, but raises an important question: Is healthcare making the right trade-offs across safety, trust, and speed?

I’ll defer the question of whether valuations in the healthcare AI space are accurate or frothy to the financial expertise of the professionals. But I will share my opinion that the messaging hype cycle has spun out of control. The potential is real, but there are many steps we must take if AI and agents are going to successfully operate in care delivery settings. Unrealistic expectations and the rapid pace of deployment risk building the future on unsteady foundations, which compromises patient safety and clinician trust.

The Infrastructure Gap and the Bundling Imperative

At the core of the issue lies a radical disconnect between legacy systems and cutting-edge AI. Traditional Electronic Medical Records (EMRs) weren’t designed as platforms for coordination or integration (whether in the context of AI, customizing workflows, or interoperability of patient data) but for top-down enforcement of fixed standards for documentation and billing. Because of their rigid data structures and closed architectures, traditional EMRs struggle to accommodate the dynamic nature of modern AI tools.

The market is responding, and major EMR vendors are moving from third-party integrations to native AI tools, recognizing that bundling AI into existing clinical workflows is the fastest path to safe and effective deployment. The first principle here is that an agent is only as good as its tools and context. When AI operates within a system of record that provides the right tools and context, human decision-makers maintain visibility, and the probabilistic outputs of AI are available for evaluation and continuous improvement. The system of record also serves as the coordination hub across both human workflows and agents, which makes it possible to break down data silos and architectural blockers. Conversely, point solutions running in browsers are disconnected from core clinical systems, introducing combinatorial complexity, challenges in deploying evaluation frameworks, and ultimately significant new risks to patient safety.

There is so much promise in turning conversations and unstructured healthcare data into structured data that is available to AI and to drive action. A few examples, like updating problem lists, drafting orders, recommending medication dosing, and implementing coding automation, all illustrate both the upside (in efficient care delivery) and downside (in real-life-threatening patient safety issues) that we are about to confront. If AI merely creates a well-formatted note, but doesn’t evaluate the output or connect it to workflows to drive downstream action, we will not achieve its promise.

Three Pillars for Responsible Healthcare AI Deployment

Moving forward responsibly requires focusing on three key areas before rushing to deploy the next AI innovation:

1. Efficacy: Going Beyond the Demo

These days, building a prototype AI tool for a given healthcare workflow has become table stakes, but this is vastly different from building safe, proven tools ready for use in production. Clinicians need to trust these tools, and that trust should be built through continuous evaluations and a continuous deployment process.

We need clear evaluation rubrics and benchmarking systems to gauge how well these tools work and measure ongoing tasks such as:

Accuracy of structured data input to the note
Coding conformance and claim acceptance rates
Clinician trust levels
Reduction in after-hours charting time

By running side-by-side comparisons on the same visits, these metrics can help us move beyond impressive demonstrations to clinical reality.

2. Safety: Guardrails for an Autonomous Future

As AI systems scale up and proliferate, the potential for errors grows with it. When multiple AI agents coordinate care, errors can snowball throughout systems, potentially affecting a patient’s treatment decisions.

As we roll out AI, we collect data that leads to continuous improvement of the outputs. There is significant leverage in engaging human expertise and judgment in that process. Both the output itself and the human judgment of it can give us important signals that take something from 80%-90% accuracy (nice in a lab, not safe in production) to “many 9’s,” e.g., 99.99% accuracy, which must be the standard. And if it’s anything less, a human clinician exercising their judgment is the safety protocol that has worked for many thousands of years.

Robust safety frameworks must leverage human judgment alongside real-time monitoring for AI outputs, mechanisms for human oversight at critical moments, and clear guidelines for when to override AI. When AI operates within established workflows rather than as disconnected tools, maintaining safety guardrails becomes more manageable.

3. Integration: The Data Model Dilemma

Data integration is the most notorious problem in healthcare. We can work around it at a small scale or in specific, point solution use cases, but for any organization at scale, they must have a coherent data architecture with a comprehensive view of the patient. The reality is that most EMRs make it difficult to get data out, but they are quite good at collecting and storing that data, and it provides essential context for AI. The key for EMRs is not how to get the data, it is how to make that available as tools and context that AI can leverage. Most EMRs simply cannot do that.

This goes beyond simple interoperability and means that we must rethink, from the ground up, the architecture of the EMR to enable automation. If we try to tack on point solutions to the existing legacy architecture, we will fail, full stop, to realize the promise of AI. This is a unique problem for healthcare, as the EMR, as a system of record, is its own, unique construct. Standardized APIs like FHIR resources and coherent data formats are essential to ensure AI actually supports clinical care without making it more complicated. The trend toward integrated EMR solutions shows that the industry is recognizing this need.

Charting the Course

With agentic and ambient AI entering mass adoption, we should focus on how to roll out AI responsibly rather than how quickly we can deploy it. This means establishing clear, evidence-based validation frameworks, implementing strong governance standards, and building the technical infrastructure necessary to support safe, effective AI integration.

In the end, it’s not about choosing between innovation and caution; it’s about achieving sustainable transformation without resorting to risky shortcuts. For an industry where mistakes can cost lives, mastering this balance is as important as it is critical.