Testing large language models on scientific literature

ITHACA, N.Y. – Large language models (LLMs) show promise as a tool for exploring the vast scientific literature, but are they trustworthy when it comes to providing full and scientifically accurate answers to complex questions in specialized fields?

To find out, Cornell University physicists and Google researchers engaged a panel of 12 human experts to test the ability of six LLM systems – ChatGPT, Claude and others – to understand scientific literature at the level of a specialist, using the field of high-temperature cuprates, a class of superconducting materials, as an example. Some systems perform better than others, they found. The study also revealed some gaps in current LLM capability and narrowed down a wish list for AI developers to improve in future models.

“This study is about testing out LLMs’ ability to read the literature the way an expert would read,” said Eun-Ah Kim , professor of physics and corresponding author of the study. “This paper is important now because everyone is very curious about what LLMs can and cannot do, especially in the context of artificial general intelligence (AGI). There are critical gaps in what LLMs can do right now, which is clearly showing that they are not at AGI.”

“ Expert Evaluation of LLM World Models: A High-Tc Superconductivity Case Study ” was published in the Proceedings of the National Academy of Sciences . The lead author is Haoyu Guo, Bethe/KIC postdoctoral fellow with Cornell’s Laboratory of Atomic and Solid State Physics.

The researchers created a database of 1,726 scientific papers, curated by human experts, that cover the history of the field of high-temperature cuprates and a set of 67 questions, written by a larger group of experts, that probe deep understanding of the literature.

With these assets, they examined four LLMs – ChatGPT-4, Claude 3.5, Perplexity and Gemini Advanced Pro 1.5 – as well as NotebookLM, a Google product that answers a user’s questions based on provided documents. They also added to the mix a custom retrieval-augmented generation (RAG) system capable of retrieving relevant images as well as text from the curated documents.

The systems that featured curated information – Google’s product and the custom RAG system – did the best.

“LLMs operating on trusted data sources – papers we collected ourselves, not from the LLM searching the Internet – tend to perform better,” Guo said. “Among these, NotebookLM performs better when I have a set of papers that I want to understand better.”

All the LLMs were surprisingly good at pulling out text-based information, Kim said, but “totally incapable” at engaging with data visualization.

On the wish list to AI developers for improved LLMs, Guo said, are more accurate attributions to LLMs’ claims (they sometimes make up references); better ability to synthesize many facets of one problem and to reflect the complexities of the problem; and improved comprehension of plots and figures.

This is the first study out of the Cornell-led National Science Foundation AI-Materials Institute , which Kim directs.

For additional information, see this Cornell Chronicle story .

Cornell University has dedicated television and audio studios available for media interviews.

-30-

Proceedings of the National Academy of Sciences

10.1073/pnas.2533676123

Expert evaluation of LLM world models: A high-Tc superconductivity case study

Testing large language models on scientific literature

Apple iPhone 17 Pro

Keywords

Article Information

Contact Information

How to Cite This Article