Bluesky Facebook Reddit Email

PediaBench: a comprehensive Chinese pediatric dataset for benchmarking large language models

04.02.26 | Higher Education Press

Fluke 87V Industrial Digital Multimeter

Fluke 87V Industrial Digital Multimeter is a trusted meter for precise measurements during instrument integration, repairs, and field diagnostics.


Existing datasets for medical QA cannot comprehensively assess the proficiency of LLMs in pediatrics. To fill this problem, a research team led by Hui LI and Yanhao WANG published their new research on the benchmark of LLMs for pediatric QA in Frontiers of Computer Science co-published by Higher Education Press and Springer Nature.

The team introduced PediaBench, the first Chinese pediatric dataset encompassing 5 question types and 12 disease groups, and devised an integrated scoring scheme to thoroughly assess each LLM's proficiency across all types of questions in a unified manner. Finally validated the effectiveness of PediaBench with extensive experiments on 20 open-source and commercial LLMs.

In the research, they first introduced the construction process of the PediaBench dataset. The questions of PediaBench are collected from various public sources, including the Chinese national medical licensing examination, final exams of universities in medicine, pediatric disease diagnosis and treatment standards, and clinical guidelines. The questions are classified into five types: true-or-false (ToF), multiple choice (MC), pairing(PA), essay-type short answer (ES), and case analysis (CA). They use GLM to classify the questions into disease groups according to the International Classification of Diseases (ICD-11) standard issued by the WHO. Then they devise an integrated scoring criterion to evaluate the performance of each LLM. For ToF and MC questions, using accuracy as the basic measure. And assigning a weight to each question based on its difficulty level. For PA questions, using an equal weight of 3 and give a score of 1 for a partially correct result. And for ES and CA questions, using GPT-4o to score each LLM's answers. Finally, they assigned a fixed proportion to each type of question and calculated the integrated score.

The experimental results show that only a few LLMs achieve a passing score of at least 60. the high requirement for factuality in medical applications, there is still a significant gap when deploying LLMs as assistants for pediatricians.

Frontiers of Computer Science

10.1007/s11704-025-41345-w

Experimental study

Not applicable

PediaBench: a comprehensive Chinese pediatric dataset for benchmarking large language models

15-Mar-2026

Keywords

Article Information

Contact Information

Rong Xie
Higher Education Press
xierong@hep.com.cn

Source

How to Cite This Article

APA:
Higher Education Press. (2026, April 2). PediaBench: a comprehensive Chinese pediatric dataset for benchmarking large language models. Brightsurf News. https://www.brightsurf.com/news/LQ40D568/pediabench-a-comprehensive-chinese-pediatric-dataset-for-benchmarking-large-language-models.html
MLA:
"PediaBench: a comprehensive Chinese pediatric dataset for benchmarking large language models." Brightsurf News, Apr. 2 2026, https://www.brightsurf.com/news/LQ40D568/pediabench-a-comprehensive-chinese-pediatric-dataset-for-benchmarking-large-language-models.html.