Can a specialized AI model steer doctors toward the right scan?

Medical imaging is important in healthcare; however, its overutilization can contribute to resource wastage and can cause harm to patients. While various guidelines are available for its appropriate utilization, their adoption remains a challenge. Now, a new study in Intelligent Medicine finds that domain-specific adaptation may help improve AI-assisted imaging recommendations, pointing to a new direction for value-based clinical decision support.

Every year up to 30% of medical imaging studies ordered in the United States are considered unnecessary. This issue wastes resources, strains healthcare systems, and exposes patients to avoidable risks from radiation. Despite the existence of evidence-based appropriateness guidelines, translating them consistently into day-to-day clinical decisions remains difficult. A new study published in journal Intelligent Medicine this February suggests that large language models adapted to specific clinical domains may offer a meaningful path forward.

The research team, based at Beijing Friendship Hospital and collaborating institutions, developed a model called the Appropriate Medical Imaging Recommendations Generative Pre-trained Transformer (AMIR-GPT). Rather than relying on a general-purpose AI system, they asked whether targeted fine-tuning on structured radiology guidance could produce more accurate, guideline-aligned imaging recommendations for common clinical scenarios.

“Overutilization of medical imaging is not just a cost problem. It reflects a gap between the best available evidence and what happens in practice. Our goal was to explore whether a domain-specific AI model could help bridge that gap in a way that supports clinicians, not replaces them,” says Han Lyu, M.D., corresponding author of the study and associate professor at the Department of Radiology, Beijing Friendship Hospital, Capital Medical University.

Building and testing the model

To train AMIR-GPT, the researchers curated 1,036 question-and-answer pairs derived from 26 guidelines in the American College of Radiology Appropriateness Criteria (ACR AC), covering a broad range of common clinical indications, including low back pain, trauma, fractures, abdominal pain, cancer screening and staging, gastrointestinal bleeding, hearing related complaints, and pediatric fever. Of the 1,036 entries, 932 were used for model training across four iterations, with the remaining 104 reserved for testing.

AMIR-GPT was benchmarked against GPT-4, GPT-3.5, and Gemini using the same test questions. Responses were scored on a 1 to 5 scale for similarity to standard answers through an automated assessment by GPT-3.5 and by two expert radiologists.

What the results show

In the most stringent performance category, perfect agreement with standard guideline answers (score 5 out of 5), AMIR-GPT achieved the highest proportion among all models evaluated, at 33.3% of test responses. This compares to 16.7% for GPT-4, 6.2% for GPT-3.5, and 6.2% for Gemini. The overall difference among models was statistically significant (ANOVA: f = 6.49, P = 0.0004). Pairwise testing confirmed a significant advantage for AMIR-GPT over GPT-3.5 ( P = 0.018).

However, the picture was more nuanced across other performance bands. When high match (score 4 out of 5), medium match (score 3 out of 5) and low match (score less than 3) are considered, the general purpose models were still competitive to AMIR-GPT. This finding matters for interpreting the study's claims. In medical AI evaluation, model ranking depends on whether the benchmark emphasizes exact guideline adherence or partial alignment. In clinical practice, that distinction is not merely academic. A fluent answer is not the same as a clinically appropriate one.

Qualitative review reinforced this point. In one higher-scoring example, AMIR-GPT correctly identified magnetic resonance imaging (MRI) without intravenous contrast as the appropriate first-line imaging study for a surgical candidate with subacute low back pain after six weeks of conservative management. This is consistent with ACR guidance and clinically meaningful. However, lower-scoring outputs revealed familiar risks in medical AI: omissions and deviations from standard recommendations, and in one case, an incorrect characterization of computed tomography (CT) enterography that failed to account for the potential masking of upper gastrointestinal bleeding by oral contrast agents.

Promising direction, preliminary evidence

The study positions domain specific fine tuning as a potentially useful strategy for improving AI performance in specialized clinical tasks. But the authors are careful not to overstate the implications.

The dataset covered only a subset of published ACR criteria, limiting the model's exposure to rarer or more complex clinical scenarios. Outputs that are inaccurate, fabricated, or off-target remain a barrier to unsupervised clinical deployment.

“This is a step toward AI as a collaborative tool in medicine, but responsible integration requires broader datasets, stronger evaluation methods, and validation across diverse real-world settings before these systems can be trusted more widely,” says Dr. Lyu.

Future work will focus on expanding training data to cover a broader range of ACR guidelines and more complex cases, incorporating real-time error correction mechanisms, and exploring applicability in electronic health record analysis and broader clinical decision support.

Importance

The findings contribute to a growing body of evidence suggesting that high performance in healthcare AI may require more than scaling general purpose models. Domain-specific adaptation, disciplined alignment with the standards, evidence structures, and reasoning patterns of a particular medical field, may be just as important as model size.

About the authors

Dr. Han Lyu (吕晗) is an Associate Chief Physician and Associate Professor of Radiology at Beijing Friendship Hospital, Capital Medical University. He specializes in advanced neuroimaging, brain structural-functional networks, tinnitus mechanisms, cerebral perfusion, and AI-enhanced medical diagnostics, with notable contributions to brain aging and neurodegeneration research. He is a former visiting scholar at Stanford University. Email: chrislvhan@126.com

Prof. Wang Zhenchang (王振常) is a distinguished medical imaging expert and Academician of the Chinese Academy of Engineering. He is affiliated with the Department of Radiology, Beijing Friendship Hospital, Capital Medical University, and leads pioneering work in ultra-high-resolution CT (world’s first 50 μm bone-specific scanner) for auditory and visual systems, as well as AI integration in medical imaging and diagnostics. Email: cjr.wzhch@vip.163.com

About the journal
Intelligent Medicine is a peer-reviewed, open-access journal focusing on the integration of AI, data science, and digital technology in clinical medicine and public health. It is published by the Chinese Medical Association in partnership with Elsevier. To learn more about Intelligent Medicine , please visit https://www.sciencedirect.com/journal/intelligent-medicine

Funding information
This study was partially supported by the National Natural Science Foundation of China (62171297, 61931013). The funder had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; or decision to submit the manuscript for publication.

Intelligent Medicine

10.1016/j.imed.2025.03.005

Computational simulation/modeling

Not applicable

Specific fine-tuned GPT-enhanced medical imaging diagnosis recommendations

26-Feb-2026

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Can a specialized AI model steer doctors toward the right scan?

SAMSUNG T9 Portable SSD 2TB

Additional Media

Keywords

Article Information

Contact Information

How to Cite This Article