Temporal evolution of large language models in oncology: performance trends of ChatGPT-3.5, ChatGPT-4, and Gemini

11.09.25 | FAR Publishing Limited

Time-dependent performance evaluation of LLMs in oncology

(A) Comparison of MD with 95% CI for subjective question accuracy across different LLMs over various time periods. The x-axis represents the magnitude of MD, with blue squares indicating the MD value of each study, the square size being proportional to study weight, and horizontal lines showing 95% CI. Diamonds represent the pooled MD values at the bottom of the figure. (B) Comparison of Risk Ratio (RR) with 95% CI for objective question accuracy across different LLMs over various time periods. Credit: Zilin Qiu, Aimin Jiang,Chang Qi, Wenyi Gan, Lingxuan Zhu, Weiming Mou, Dongqiang Zeng, Mingjia Xiao, Guangdi Chu, Shengkun peng, Hank Z.H. Wong, Lin Zhang, Hengguo Zhang, Xinpei Deng, Quan Cheng, Bufu Tang, Yaxuan Wang, Jian Zhang, Anqi Lin, Peng Luo

Large language models (LLMs) have emerged as transformative tools in healthcare, offering potential value in oncology for information retrieval, clinical decision support, and patient communication. However, the dynamic nature of oncological knowledge—including evolving treatment guidelines and diagnostic standards—raises questions about how LLMs’ performance holds up over time, especially as these models are relied on for increasingly nuanced clinical tasks.

This study, conducted in adherence to PRISMA guidelines, systematically collected relevant literature through 2025 from PubMed, Google Scholar, and Web of Science databases. The research focused on three prominent LLMs: ChatGPT-3.5, ChatGPT-4, and Gemini. Researchers analyzed 614 oncology questions spanning common malignancies (e.g., lung, breast, colorectal cancer) and rare tumors (e.g., glioma, multiple myeloma), using both original study scoring criteria and a standardized five-point Likert scale to assess response accuracy.

Key findings reveal clear divergent temporal trends across the models:

Subjective questions—those requiring complex analysis, integration of clinical context, and nuanced judgment—were far more susceptible to temporal performance degradation than objective, fact-based queries. This disparity highlights the unique challenges LLMs face in applying evolving clinical knowledge to real-world oncology scenarios, where flexibility and alignment with the latest standards are critical.

The study’s results provide vital guidance for the responsible deployment of LLMs in oncology. As healthcare systems increasingly adopt these AI tools to support patient care and clinical decision-making, ongoing performance monitoring, standardized evaluation protocols, and strategies to integrate up-to-date clinical data will be essential to ensure safety and reliability.

Journal of Translational Medicine

10.1186/s12967-025-07227-2

Meta-analysis

People

Temporal Evolution of Large Language Models (LLMs) in Oncology

4-Nov-2025

The authors declare that they have no competing interests.

Keywords

Cancer

Article Information

Journal

Journal of Translational Medicine

DOI

10.1186/s12967-025-07227-2

Method of Research

Meta-analysis

Subject of Research

People

Article Publication Date

2025-11-04

Article Title

Temporal Evolution of Large Language Models (LLMs) in Oncology

COI Statement

The authors declare that they have no competing interests.

Contact Information

Chris Zhou

FAR Publishing Limited

editorial@fargroups.com

How to Cite This Article

Temporal evolution of large language models in oncology: performance trends of ChatGPT-3.5, ChatGPT-4, and Gemini

Apple iPhone 17 Pro

Keywords

Article Information

Contact Information

Source

How to Cite This Article