Share

Where Generative AI Is Working for Doctors—and Where It’s Falling Short


Artificial intelligence has been touted as a game changer in multiple industries. Health care is no exception.

“It’s certainly the rage in the last year,” Jeffrey Sturman, senior vice president and chief digital information officer at Memorial Healthcare System, told Newsweek. “In my world of networks, if AI doesn’t come up in every conversation, it’s kind of like, ‘Are you sleeping under a rock?'”

The National Bureau of Economic Research estimated that wider adoption of AI within the next five years could save the U.S. health care system up to $360 billion annually. In both hospitals and physician groups, those savings would largely come from improvements to clinical operations.

But health care isn’t the average testing ground for novel technologies. It faces more regulations, cybersecurity concerns and financial constraints than many other industries. More than three-quarters of health system executives label digital and AI transformation a high priority but don’t have the resources to deliver on it, according to a recent survey from McKinsey & Company.

Ai medicine
New generative AI models promise to lower physicians’ cognitive load.

Getty Images

Despite these challenges, Sturman—who leads a team of approximately 1,000 at the South Florida health system—describes his approach to AI as “pretty bullish.”

The Highs and Lows of Generative AI

AI is still in its infancy at Memorial but has shown promise in several departments, including clinical settings, according to Sturman.

The health system has been using AI to assist radiologists in evaluating scans. One platform documents abnormalities that are visible, but not the primary diagnosis, to ensure they can be followed up on. Previously, these issues might have been lost in the medical record since they weren’t of immediate clinical significance—even though they could lead to issues down the road.

Because of this technology, “we can actually get in front of this disease, and honestly, we have saved life already,” Sturman said.

Another platform analyzes the scan and documents the results, so radiologists can simply confirm the automated note.

This type of generative AI is relatively new but is gaining traction in the industry, according to Dr. Thomas Maddox, a cardiologist at St. Louis-based Washington University School of Medicine and vice president of digital products and innovation at BJC HealthCare. Maddox also directs the organizations’ Healthcare Innovation Lab, which has been exploring generative AI since last fall.

Maddox recently deployed a pilot of an ambient note-taking software. Thirty physicians tested the AI for 55 days, primarily in primary care appointments. After obtaining patients’ consent, they used a company-provided smartphone to record their conversations. A transcript is then sent to a generative pretrained transformer (also known as a GPT—think ChatGPT, OpenAI’s virtual assistant), which summarizes it into a clinical note. Physicians can edit or approve the draft before signing it into the medical record.

Doctors’ reviews were mixed, Maddox said.

“Like any new technology, we found that some docs really liked it,” Maddox said. “Other docs said it wasn’t their thing, so they pretty quickly said, ‘We aren’t gonna use this,’ and they moved away from it.”

The AI captured the center of the conversation well, but it struggled to synthesis physical exam findings, Maddox said.

Dr. Matthew Gerling, an internal medicine physician at BJC, was involved in the pilot. He told Newsweek that sometimes doctors need to jot down specificities that don’t need to be shared with the patient. Since the AI relies on speech, docs would need to share their full analyses out loud to capture them.

“If I listen to your heart, and I say, ‘Your heart sounds fine,’ it will still just write, ‘Your heart sounds fine,'” Gerling said. “[When using the AI], I do have to consciously be aware of this when I go into the room. I have to say, ‘I listened to your heart and heard no murmurs, rubs or gallops.”

For some patients, this type of jargon might lead to more questions, concerns or anxieties, Gerling said. But 97 percent of patients who filled out a survey about their experiences praised the pilot, noting that their interactions with physicians felt more personal. Since the doctor does not have to type while they talk, they can make better eye contact and engage with the patient more conversationally.

The system has launched a pilot of a second note-taking AI, which supports more specialties beyond primary care. Gerling was the first to use it on June 28, and told Newsweek it was “working great” on the first day.

Memorial also boosted patient and provider satisfaction with a test run of DAX, Microsoft and Nuance’s ambient note-taking co-pilot. Doctors used the software to transcribe 5,259 clinical encounters between April and July, and said it helped lighten their workload.

“Many times, what happens is they’re writing little notes to themselves and then going home at the end of the day, sometimes even the next day, to actually note the encounter and try to remember the conversation they had with the patient,” Sturman said. “If I can cut down on that pajama time—time at home noting an encounter—because I automated some of that function, I make the clinical quality and information sharing better and relevant. I create a better level of efficiency.”

Overall, 62 percent of clinicians who use DAX Copilot say they spend less time on the computer in the exam room, and 57 percent spend less time on clinical documentation, according to data from Microsoft and Nuance. Nearly three-quarters reported a lighter cognitive burden when using the AI.

LLMs Face Growing Pains

GPTs—like the ones being used to craft physician notes—fall under the umbrella of large language models, or LLMs. These AI models use text from the internet to predict the next word in a sentence and can generate responses that sound human. Ideally, they could help physicians tackle administrative work and make informed decisions about complex patients.

But LLMs aren’t quite where they need to be for clinical use, Dr. Saurabh Gombar told Newsweek. He is an adjunct professor at Stanford School of Medicine and the chief medical officer and co-founder of Atropos Health, a company working to fine-tune its own LLM co-pilot, ChatRWD.

Atropos Health—along with researchers from several medical schools—recently posed 50 questions from doctors to five LLMs. Independent physician reviewers analyzed the responses. They found that general purpose LLMs, including ChatGPT-4 and Gemini Pro 1.5, produced relevant, evidence-based answers less than 10 percent of the time.

LLMs designed specifically for health care performed better: ChatRWD provided evidence-based answers 58 percent of the time, and OpenEvidence did so 24 percent of the time. But that’s still not a comfortable margin of error, Gombar said; most physicians he has spoken with are still rightfully skeptical.

One of the issues with general purpose LLMs is that they have been trained on information that is not relevant to medicine, according to Gombar. If a physician asks ChatGPT about a popular weight loss drug, the model can scan the entire internet for that buzzword. It might provide an answer based on social media claims rather than a legitimate study. As such, these models frequently hallucinate citations—physician reviewers identified irrelevant citations in up to 80 percent of their responses.

Although OpenEvidence and ChatRWD did not hallucinate answers, they could not answer all of the physicians’ questions. The asks were often centered around novel treatments or populations that are underrepresented in clinical trials, like transplant patients.

It makes sense that doctors have questions about these topics—there isn’t much literature about them. But that’s a problem for evidence-trained LLMs, too.

Still, the technology has made a substantial amount of progress in a short period of time, Gombar said.

“If you go back just 18 months, all of the LLMs were sort of like the new kids on the block,” he said. “They hadn’t been well-studied, and it wasn’t really clear where they could and could not be useful.”

When OpenEvidence and ChatRWD work together, they can answer about 70 percent of questions with enough reliability that they can be used to modify treatment decisions. That’s promising, Gombar said. In the next year or two, he thinks the models can reach 90 to 95 percent effectiveness.

“At that point, I would say the things we cannot answer are probably going to be truly unanswerable questions—either about something so new that there aren’t enough patients that exist that we can answer, or patients that are so specific with their comorbidities that they’re a one-of-a-kind example,” Gombar said.

Is the Hype Justified?

Overall, physicians and executives seem hopeful about generative AI. Sturman expects the technology’s time-saving capabilities to broaden access, which could, in turn, boost revenue.

“We’ve already seen huge gain and huge opportunity,” he said. “[The hype] is certainly warranted. I think it’s going to get better. I think we’re going to get smarter about how to leverage it.”

He’s still approaching new platforms with caution, and has connected with a venture capital firm to help identify the right tools for Memorial. There’s a lot of noise to cut through: “Everyone is saying that they have an AI solution,” he said.

Maddox agrees that there’s a lot of hype around AI right now. As test runs continue, he anticipates it will become another important tool for physicians, like a stethoscope or the electronic health record.

“I think the overall lesson here is that just getting a new technology and just doing exactly what you were doing before the technology was in place, you’re not going to get the full value of the technology,” Maddox said. “You need to think, ‘How does the tech work? How do I work?’ And then how do we adjust both to ensure that we get the maximum value of that combination, of human and technology?”