Expert consensus outlines a standardized framework to evaluate clinical large language models

A new expert consensus made available online on 10 October 2025 and published in Volume 5, Issue 4 of the journal Intelligent Medicine on 1 November 2025, sets out a structured framework to assess large language models (LLMs) before they are introduced into clinical workflows. The guidance responds to the rapid uptake of artificial intelligence (AI) tools for diagnostic support, medical documentation, and patient communication, and the corresponding need for consistent evaluation of safety, effectiveness, and fairness.

The consensus formalizes retrospective evaluation—testing fully trained models on real or simulated clinical data in specific care contexts, without further modifying the models—to verify performance, ethical compliance, and operational readiness prior to deployment.

Developed in line with World Health Organization guideline methods and registered on the Practice Guideline Registration for Transparency (PREPARE) platform (ID: PREPARE-2025CN503), the consensus draws on literature review, Delphi procedures, and multidisciplinary expert deliberation. In the final round, 35 experts achieved agreement on six recommendations.

What does the framework include?

The consensus also defines six key LLM capability domains for assessment: medical knowledge question and answer; complex medical language understanding; diagnosis and treatment recommendation; medical documentation generation; multi-turn dialogue; and multimodal dialogue.

Emphasizing essential safeguards for patient data protection, bias mitigation, and the need for AI outputs to remain clinically explainable, the authors of the consensus are positioned to support the advancement of safer, more reliable, and ethically governed LLM applications within healthcare systems globally.

***

Reference
DOI: 10.1016/j.imed.2025.09.001

About the journal
Intelligent Medicine is a peer-reviewed, open-access journal focusing on the integration of artificial intelligence, data science, and digital technology in clinical medicine and public health. It is published by the Chinese Medical Association in partnership with Elsevier. To learn more about Intelligent Medicine , please visit https://www.sciencedirect.com/journal/intelligent-medicine

Funding information
The authors received no financial support for this research.

Intelligent Medicine

10.1016/j.imed.2025.09.001

Literature review

Not applicable

2025 Expert consensus on retrospective evaluation of large language model applications in clinical scenarios

1-Nov-2025

All authors declare no conflicts of interest.

Expert consensus outlines a standardized framework to evaluate clinical large language models

Kestrel 3000 Pocket Weather Meter

Keywords

Article Information

Contact Information

Source

How to Cite This Article