Decision-making performance of large language models vs. human physicians in challenging lung cancer cases: A real-world case-based study

Background: Despite the promise shown by large language models (LLMs) for standardized tasks, their multidimensional performance in real-world oncology decision-making remains unevaluated.

This study aims to introduce a framework for evaluating LLM and physician decisions in challenging lung cancer cases.

Methods: We curated 50 challenging lung cancer cases (25 local and 25 published) classified as complex, rare, or refractory. Blinded three-dimensional, five-point Likert evaluations (1–5 for comprehensiveness, specificity, and readability) compared standalone LLMs (DeepSeek R1, Claude 3.5, Gemini 1.5, and GPT-4o), physicians by experience level (junior, intermediate, and senior), and AI-assisted juniors; intergroup differences and augmentation effects were analyzed statistically.

Results: Of 50 challenging cases (18 complex, 17 rare, and 15 refractory) rated by three experts, DeepSeek R1 achieved scores of 3.95±0.33, 3.71±0.53, and 4.26±0.18 for comprehensiveness, specificity, and readability, respectively, positioning it between intermediate (3.68, 3.68, 3.75) and senior (4.50, 4.64, 4.53) physicians. GPT-4o and Claude 3.5 reached intermediate physician–level comprehensiveness (3.76±0.39, 3.60±0.39) but junior-to-intermediate physician–level specificity (3.39±0.39, 3.39±0.49). All LLMs scored higher on rare cases than intermediate physicians but fell below junior physicians in refractory-case specificity. AI-assisted junior physicians showed marked gains in rare cases, with comprehensiveness rising from 2.32 to 4.29 (84.8%), specificity from 2.24 to 4.26 (90.8%), and readability from 2.76 to 4.59 (66.0%), while specificity declined by 3.2% (3.17 to 3.07) in refractory cases. Error analysis showed complementary strengths, with physicians demonstrating reasoning stability and LLMs excelling in knowledge updating and risk management.

Conclusions: LLM performance in clinical decision-making tasks varied by case type, performing better in rare cases and worse in refractory cases requiring longitudinal reasoning. Complementary strengths between LLMs and physicians support case- and task-tailored human–AI collaboration.

Intelligent Oncology

Observational study

Not applicable

Decision-making performance of large language models vs. Human physicians in challenging lung cancer cases: A real-world case-based study

26-Jan-2026

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Decision-making performance of large language models vs. human physicians in challenging lung cancer cases: A real-world case-based study

Davis Instruments Vantage Pro2 Weather Station

Keywords

Article Information

Contact Information

Source

How to Cite This Article