Keeping up with the latest research is vital for scientists, but given that millions of scientific papers are published every year, that can prove difficult. Artificial intelligence systems show promise for quickly synthesizing seas of information, but they still tend to make things up, or “hallucinate.”
For instance, when a team led by researchers at the University of Washington and The Allen Institute for AI , or Ai2, studied a recent OpenAI model, GPT-4o , they found it fabricated 78-90% of its research citations. And general-purpose AI models like ChatGPT often can’t access papers that were published after their training data was collected.
So the UW and Ai2 research team built OpenScholar, an open-source AI model designed specifically to synthesize current scientific research. The team also created the first large, multi-domain benchmark for evaluating how well models can synthesize and cite scientific research. In tests, OpenScholar cited sources as accurately as human experts, and 16 scientists preferred its response to those written by subject experts 51% of the time.
The team published its findings Feb. 4 in Nature. The project’s code, data and a demo are publicly available and free to use.
“After we started this work, we put the demo online and quickly, we got a lot of queries, far more than we’d expected,” said senior author Hannaneh Hajishirzi , a UW associate professor in the Paul G. Allen School of Computer Science & Engineering and senior director at Ai2. “When we started looking through the responses we realized our colleagues and other scientists were actively using OpenScholar. It really speaks to the need for this sort of open-source, transparent system that can synthesize research.”
Researchers trained the model and then created a set of 45 million scientific papers for OpenScholar to pull from to ground its answers in established research. They coupled this with a technique called " retrieval-augmented generation ,” which lets the model search for new sources, incorporate them and cite them after it’s been trained.
“Early on we experimented with using an AI model with Google’s search data, but we found it wasn’t very good on its own,” said lead author Akari Asai , a research scientist at Ai2 who completed this research as a UW doctoral student in the Allen School. “It might cite some research papers that weren’t the most relevant, or cite just one paper, or pull from a blog post randomly. We realized we needed to ground this in scientific papers. We then made the system flexible so that it could incorporate emerging research through results.”
To test their system, the team created ScholarQABench, a benchmark against which to test systems on scientific search. They gathered 3,000 queries and 250 longform answers written by experts in computer science, physics, biomedicine and neuroscience.
“AI is getting better and better at real world tasks,” Hajishirzi said. “But the big question ultimately is whether we can trust that its answers are correct.”
The team compared OpenScholar against other state-of-the-art AI models, such as OpenAI’s GPT-4o and two models from Meta. ScholarQABench automatically evaluated AI models’ answers on metrics such as their accuracy, writing quality and relevance.
OpenScholar outperformed all the systems it was tested against. The team had 16 scientists review answers from the models and compare them with human-written responses. The scientists preferred OpenScholar answers to human answers 51% of the time, but when they combined OpenScholar citation methods and pipelines with GPT-4o (a much bigger model), the scientists preferred the AI written answers to human answers 70% of the time. They picked answers from GPT-4o on its own only 32% of the time.
“Scientists see so many papers coming out every day that it’s impossible to keep up,” Asai said. “But the existing AI systems weren’t designed for scientists’ specific needs. We’ve already seen a lot of scientists using OpenScholar and because it’s open-source, others are building on this research and already improving on our results. We’re working on a followup model, DR Tulu , which builds on OpenScholar’s findings and performs multi-step search and information gathering to produce more comprehensive responses.”
Other co-authors include Jacqueline He , Rulin Shao , Weijia Shi , all UW doctoral students in the Allen School; Dan Weld , a UW professor emeritus in the Allen School and general manager and chief scientist at Ai2; Varsha Kishore , a UW postdoc in the Allen School and postdoc at Ai2; Luke Zettlemoyer , a UW professor in the Allen School; Pang Wei Koh , a UW assistant professor in the Allen School; Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D’Arcy, David Wadden, Matt Latzke, Jenna Sparks and Jena D. Hwang of Ai2; Wen-tau Yih of Meta; Minyang Tian, Shengyan Liu, Hao Tong and Bohao Wu of University of Illinois Urbana-Champaign; Pan Ji of University of North Carolina; Yanyu Xiong of Stanford University; and Graham Neubig of Carnegie Mellon University.
For more information, contact Asai at akaria@allenai.org and Hajishirzi at hannaneh@cs.washington.edu .
Nature
Synthesizing scientific literature with retrieval-augmented language models
4-Feb-2026