Unlocking new opportunities to enhance provider operational efficiency with multimodal gen AI: Interview with AI expert Yury Khokhlov

Updated on January 8, 2025

Artificial intelligence is rapidly reshaping industries, and healthcare is no exception. Generative AI, once limited to text-based tasks, is now a dynamic multimodal force, processing and creating diverse data including audio, images, and video. This leap in capability offers healthcare providers unprecedented opportunities to revolutionize clinical operations. 

Leading technology firms are fiercely competing in this space, exemplified by Google’s release of its advanced Gemini 2.0, designed for sophisticated multimodal reasoning across text, images, audio, and video to tackle complex tasks. Meanwhile, OpenAI explores multimodal applications beyond ChatGPT, leveraging its vision language models like CLIP and DALL-E for potential text generation from images and Whisper for audio processing, hinting at their capabilities within healthcare. Simultaneously, AWS is developing its HealthScribe suite, offering powerful tools that harness multimodal AI to extract insights from audio and images, enabling innovative healthcare solutions. These investments signal a growing recognition of multimodal AI’s transformative power, suggesting we are at the dawn of its widespread integration.

But how can healthcare providers best leverage these emerging multimodal trends? To shed light on this, we interviewed Yury Khokhlov, an expert in applying Artificial Intelligence (AI) and Generative AI who helps healthcare providers drive operational efficiency and innovation through emerging technology.

Yury, how can multimodal generative AI, beyond text-based applications, transform clinical operations in healthcare?

Large language models (LLMs) that focus on text inputs and outputs have already demonstrated substantial value. Providers use them for chatbots, RAG-powered (Retrieval-Augmented Generation) search, and text summarization. These solutions are undeniably beneficial. However, if we narrow our view to text alone, we miss a much broader opportunity.

The real power of multimodal generative AI emerges when text is integrated with other data types—such as audio, images, and even video. By combining these inputs, AI can accelerate and enhance various clinical operation domains. For instance, audio-based tools can automate documentation and triage, while image and video capabilities can streamline patient education or improve diagnostic accuracy. This multimodal synergy lets providers handle complex tasks more efficiently, ultimately driving better patient outcomes and operational performance.

You mentioned audio-based solutions. How are they shaping clinical operations right now? Can you also provide any examples from your work?

Audio processing is actually the second most common AI application in healthcare, right after text. This momentum is driven in part by advancements in speech-to-text technology and the development of new models that handle audio data with greater accuracy. Here are some key applications I’ve been working on in my projects:

Ambient listening

Ambient listening systems allow physicians to focus more on patients instead of documentation. They record patient-physician conversations, generate transcripts, summarize key clinical information, and automatically update the electronic health record (EHR). This can reduce a physician’s documentation time by up to two hours a day, leading to higher satisfaction for both patients and clinicians. We’re seeing rapid growth and investment here, as evidenced by Commure’s acquisition of Augmedix. Tech leaders like Google, Microsoft, AWS, Oracle, and Epic are also heavily investing, indicating ambient listening is moving into the mainstream.

Emerging audio solutions

Beyond patient interactions, audio solutions are changing other key areas in clinical operations:

  • Nursing operations: Nurses spend a significant portion of their time on documentation tasks such as head-to-toe assessments, admission intakes, and vitals charting, which collectively account for the majority of their administrative workload. By leveraging AI-powered audio solutions, these tasks can be streamlined through automated transcription, real-time data entry, and intelligent summarization, allowing nurses to focus more on direct patient care while ensuring accurate and consistent documentation.
  • Patient transfers: Patient transfers are often complex and time-sensitive, involving the exchange of detailed medical data. An AI-powered solution can improve this by providing accurate transcriptions, structured summaries, and ready-to-use call outputs that can be imported into transfer systems. This significantly reduces manual work for transfer agents and ensures data quality, resulting in faster and more efficient transfers

Using AI-powered audio solutions has many benefits beyond simply speeding up data capture. It reduces the stress and burnout from manual tasks, boosts overall efficiency, and supports compliance.

That’s compelling. What about integrating images and video? Where do you see the biggest opportunities there?

While audio solutions are gaining traction, the potential for image and video applications in healthcare is still largely untapped. New multimodal AI models can generate and interpret both static and dynamic visual content, opening a range of possibilities:

AI-generated patient education videos

Tailored, high-quality patient education is crucial for better health outcomes. Traditionally, creating such materials is costly and time-consuming. Generative AI changes this by producing personalized videos based on a patient’s age, language, and medical condition. This approach improves patient understanding of diagnoses and treatments and lightens the administrative burden on healthcare staff.

AI-powered video recognition solutions

AI-powered video recognition in inpatient settings can automatically interpret and document nursing actions, such as repositioning patients, changing sheets, or adjusting medications, by analyzing footage from cameras in patient rooms. This eliminates the need for manual checklists and free-text EHR entries, ensuring accurate, real-time records. The solution reduces administrative burden and improves efficiency by understanding and recording actions as they occur.

AI-enhanced diagnostic image visualization

Diagnosing conditions from X-rays or MRIs can be time-consuming and error-prone. Generative AI models can highlight subtle anomalies that the human eye might overlook, improving diagnostic accuracy and speeding up workflow by reducing the need for repeat scans. They can even generate 3D reconstructions to support surgical planning. This level of detail offers a direct path to operational gains, enhancing both clinical outcomes and resource allocation.

These applications sound very interesting. However, what are the major risks and considerations when adopting multimodal AI solutions?

Despite the clear advantages of multimodal AI, significant risks remain. Data privacy and security are paramount, given the sensitivity of patient information. Accuracy is critical as even minor transcription or diagnostic errors can have severe consequences. Bias in training data can exacerbate healthcare disparities, and integrating these complex systems into existing workflows raises technical and operational challenges. Finally, patient and clinician trust must be cultivated through transparency, robust data governance, and ongoing validation. Addressing these issues is essential for safely harnessing the transformative potential of multimodal generative AI. 

Finally, can you sum up the key takeaways for healthcare providers considering multimodal AI?

The combination of generative AI and multimodal data is reshaping healthcare operations. Yes, text-based LLMs—like those used for chatbots—still dominate many use cases. However, audio, image, and video capabilities are rapidly emerging frontiers, offering far more comprehensive support across clinical workflows.

Audio solutions can automate documentation and handovers, improving efficiency while reducing burnout. Meanwhile, images and videos can revolutionize patient education, staff training, and diagnostic accuracy.

Major risks—such as data security, algorithmic bias, and integration challenges—must be carefully managed with robust frameworks and oversight.

As these technologies evolve, the possibilities for innovation grow exponentially. We are moving closer to a future where AI is an integral part of delivering high-quality, personalized care for every patient. Those who invest thoughtfully in these solutions now will be well-positioned to lead in this new era of healthcare.

Conclusion

Multimodal generative AI stands at the forefront of healthcare innovation. Healthcare providers that strategically integrate this new technology into their workflows will not only save time and costs but also improve clinical outcomes and patient engagement. As it was highlighted, responsible adoption—with an emphasis on data governance and trust—will be key to realizing the full promise of this technology. The era of multimodal AI has begun, and the healthcare organizations that embrace it will shape the future of patient care.

Screenshot 2025 01 08 at 9.42.03 PM

Expert profile: Yury Khokhlov is a leading expert in applying Artificial Intelligence (AI) and Generative AI to drive operational efficiency across diverse industries and stays at the forefront of developing new innovative solutions.

He holds an MBA from INSEAD and has over six years of experience leading high-impact AI and Generative AI projects at top management consulting firms, serving clients across the US and EMEA regions. As part of his work, he is also focused on leveraging Generative AI to transform healthcare clinical operations, driving advancements in efficiency, patient outcomes, and operational excellence.

2e10e2fa91d159ef7d13140fce12d93f?s=150&d=mp&r=g

Daniel Casciato is a highly accomplished healthcare writer, publisher, and product reviewer with 20 years of experience in the industry. He is the proud owner and publisher of Healthcare Business Today, a leading source for the healthcare industry's latest news, trends, and analysis.

Daniel founded Healthcare Business Today in 2015 to provide healthcare professionals and enthusiasts with timely, well-researched content on the latest healthcare news, trends, and technologies. Since then, he has been at the forefront of healthcare writing, specializing in product reviews and featured stories.

His expertise in the healthcare industry is evident from the numerous publications he has written for, including Cleveland Clinic's Health Essentials, Health Union, EMS World, Pittsburgh Post-Gazette, Providence Journal, and The Tribune-Review. He has also written content for top-notch clients, such as The American Heart Association, Choice Hotels, Crohn's & Colitis Foundation of America, Culver's Restaurants, Google Earth, and Southwest Airlines.

Daniel's work has been instrumental in educating the public and healthcare professionals about the latest industry innovations. In addition, his dedication and passion for healthcare writing have earned him a reputation as a trusted and reliable source of information in the industry.

Through Healthcare Business Today, Daniel is committed to sharing his knowledge and expertise with the world, contributing to the growth and development of the healthcare industry.