Blog

From Benchmarks to the Real World: Why Healthcare AI Needs Real-World Data

September 1, 2025

From Benchmarks to the Real World: Why Healthcare AI Needs Real-World Data

As AI use in healthcare grows, it is being applied to everything, from diagnostic imaging and drug discovery to patient triage and administrative efficiency. These AI systems promise faster decision-making, reduced costs, and the potential to extend quality care to underserved populations. However, realizing these benefits depends not only on building capable models but also on ensuring that their evaluation reflects real-world conditions.

 

The real problem is, most models are not validated with the messy, unexpected questions that define real-world use. On polished, synthetic benchmark datasets, models often report 90%+ accuracy. But when deployed in real-world contexts like Nigeria, they stumble. This isn’t just a theoretical concern.  We’ve experienced it firsthand at mDoc.

 

That gap became clear to us at mDoc when we deployed Kem, our 24/7 AI-powered health coach for people in Nigeria. The way people actually seek help is far from benchmark-perfect English — a challenge we anticipated through Kem’s inclusive design, which includes support for Nigerian Pidgin English¹ . In practice, users often phrase health concerns in informal, conversational styles, describe symptoms in local idioms, or ask questions that don’t map neatly to clinical language. These patterns reveal the messy, but authentic voice of health demand — exactly the kind of complexity that benchmark datasets tend to smooth over.

 

A photo of a woman chatting with Kem about malaria prevention during pregnancy.

 

The Problem With Polished Benchmarks

 

Imagine you’re testing an AI model with the question: “Is it okay for newborns to drink formula and breastmilk?” Most models handle this effortlessly. The question is grammatically correct, medically standard, and predictable.

 

Reality often contrasts with these carefully curated questions, especially in a context like Nigeria, where ordinary people, particularly women who face heavy socioeconomic burdens in their daily lives, are unlikely to frame their questions in this way.

 

Here’s an actual question from our dataset:

 

“Please my sis baby of 3 months finds it very difficult to poo to the extent that his anus is red and sore. He was on exclusive b4 but the mum just start work so the baby is on formula Sma but once she is back from work she continues with breast feeding. Please what can she do? Look so worried. Thanks.”

 

This isn’t “benchmark English.” It’s real. It’s urgent. It’s full of abbreviations, typos, and emotion. And it reflects exactly how patients engage with Kem.

 

When we ran this through multiple large language models (LLMs), as part of our evaluation approach, many stumbled. Some ignored the context switch between breastmilk and formula. Others misinterpreted the urgency. A few gave vague, almost dismissive answers.

 

That’s why real-world datasets like Kem’s matter. Without them, AI in healthcare looks reliable on paper but fails when people need it most.

 

Our Evaluation: Testing Models the Hard Way

 

We compiled a dataset of 70 authentic medical questions asked by our members across frequently discussed areas, spanning maternal health, family planning, neonatal care, STIs, fever, and HIV/AIDS, as shown in Figure 1. Every question was paired with an expert-validated answer, so we could measure how close models came to safe, actionable guidance

Figure 1: Dataset breakdown by health topic.

 

We then evaluated leading models, including Claude, GPT models, DeepSeek, Gemini, Qwen, Meditron, Llama, and MedGemma.

 

Figure 2: Comparison of model performance across evaluation metrics (BLEU, ROUGE-L, F1 Score, and Semantic Similarity).

 

What We Found

 

The results were eye-opening:

 

  • Figure 2 illustrates that MedGemma-27B (fine-tuned on medical data) clearly outperformed the rest, with the highest clinical accuracy (F1 ≈ 0.46) and ROUGE-L ≈ 0.42. From a healthcare standpoint, it was the safest and most reliable — though it scored lower on semantic similarity (≈ 0.66), meaning its answers were clinically correct but not always phrased like a human healthcare worker.

  • Claude-3.7 and GPT-4.1 / GPT-4o formed the next tier, with F1 scores around 0.27–0.30 and semantic similarity above 0.72. These models handled messy, real-world phrasing better than most, but their answers often lacked the depth and precision needed for clinical decisions.

  • Gemini-2.0, DeepSeek-r1, Qwen-2.5, and Llama4-Mavrick scored in the middle, with F1 scores between 0.26–0.28. These models produced fluent, context-aware responses, but their clinical grounding was inconsistent.

  • GPT-5 and GPT-4.1 Mini were not as strong, with F1 scores around 0.22–0.25. Their answers often sounded convincing but missed critical clinical details.
     

But performance came with trade-offs

 

  • MedGemma-27B delivered the most clinically accurate responses, but it requires local hosting, which means higher compute costs, ongoing maintenance, and infrastructure investment, potentially a significant barrier to its use when leveraged to provide self-care health coaching support in underserved communities where the majority earn less than $2 a day. It also tended to be less flexible in conversational tone — a reminder that fine-tuning improves domain accuracy but doesn’t always guarantee empathy.

  • Claude-3.7 and GPT-4 series were more accessible (API-based) and better at handling messy, real-world phrasing, but their answers often lacked depth and clinical precision — underscoring the need for tighter linkages with continuous clinician oversight

  • Other general-purpose or smaller models (DeepSeek, Gemini, Llama, Qwen, GPT-5, GPT-4.1 Mini) were lighter-weight and more conversationally fluent, but they consistently fell short on clinical grounding. These models may be attractive for scalability, but their reliability in healthcare is limited.

It’s important to note that these results don’t tell the full story. Automated metrics like F1 and ROUGE can approximate correctness, but they can’t fully capture clinical safety, empathy, or usability. That’s why real-world clinician scoring — qualitative evaluation by well-trained practicing healthcare professionals — is essential to get the full picture.

 

The Bigger Picture 

 

  • General-purpose LLMs are not enough. They shine on clean, academic-style prompts, but real-world healthcare requires more contextual finesse.  

 

  • Data quality and representativeness are fundamental. Collecting diverse, real-world healthcare data that reflects the messiness and cultural nuances of actual patient interactions is essential for building trustworthy AI models. 

 

  • Reliable evaluation framework is critical. Proper assessment methodologies are necessary to ensure models meet safety, reliability, and clinical relevance standards before deployment in healthcare settings.  

 

  • Resource demands limit scalability, and cost drives model choice. While models like MedGemma demonstrate outstanding performance in healthcare and clinical domains, their reliance on local deployment makes them significantly more expensive to run. In contrast, token-based models scale more efficiently and cost-effectively, which means healthcare providers must carefully weigh performance gains against the operational costs of maintaining infrastructure.

 

Closing Thoughts

 

Healthcare doesn’t happen with perfect grammar or follow a clearly-defined question set. It happens in unpredictable, emotional, deeply-human areas of nuance. If AI can’t handle that, then it’s not ready.

 

This evaluation isn’t just another benchmark. It’s a window into the deployment gap. Supporting real-world testing means safer care for mothers and infants, trustworthy digital health coaches, and care when patients need it most. 

 

But metrics alone don’t tell the full story. Automated scores like F1 or ROUGE hint at accuracy, but only qualitative scoring by real clinicians can capture safety, empathy, and usability. That’s why investment is critical in dataset collection, expert annotation, clinician-led validation, hosting, and long-term partnerships.

 

Bridging the gap between benchmark-ready and field-ready requires ongoing real-world evaluation, expert validation, and long-term partnerships.

 

References

 

1. Malumi O, Owusu D, Minaye H et al. Meet Kem: Leveraging LLMs to Improve Health Access in Nigeria [version 1; not peer reviewed]. Gates Open Res 2025, 9:26 (document) (https://doi.org/10.21955/gatesopenres.1117197.1)

 

Previous article

Rooted in Change: Empowering Women Through Nutrition, Health Coaching, and Home Gardening

Rooted in Change: Empowering Women Through Nutrition, Health Coaching, and Home Gardening

The mDoc Digital Mom Project, in partnership with Reel Gardening, empowers women in Nigeria to manage chronic conditions through home gardening and personalized nutrition support, improving health, self-sufficiency, and food access.

Next article

A Cry from the Streets: How mDoc's Digital Mom Project is Addressing Nigeria's Maternal Health Crisis

A Cry from the Streets: How mDoc's Digital Mom Project is Addressing Nigeria's Maternal Health Crisis

This blog sheds light on the heartbreaking realities of childbirth in Nigeria and reveals how mDoc’s innovative Digital Mom Project is transforming those stories of despair into hope through digital care, community support, and technology.