When artificial intelligence systems began acing long‑standing academic assessments, researchers realized they had a problem: the tests were too easy. Popular evaluations, such as the Massive Multitask Language Understanding (MMLU) exam, once considered formidable, are no longer challenging enough to meaningfully test advanced AI systems.
To address this gap, a global consortium of nearly 1,000 researchers, including a Texas A&M University professor, created something different — an exam so broad, so challenging and so deeply rooted in expert human knowledge that current AI systems consistently fail it.
“Humanity’s Last Exam” (HLE) introduces a 2,500‑question assessment spanning mathematics, humanities, natural sciences, ancient languages and highly specialized subfields. The team's work is outlined in a paper published in Nature with documentation from the project available at lastexam.ai .
Among the long list of contributors is Dr. Tung Nguyen , instructional associate professor in the Department of Computer Science and Engineering at Texas A&M, who participated in authoring and refining questions.
“When AI systems start performing extremely well on human benchmarks, it’s tempting to think they’re approaching human‑level understanding,” Nguyen said. “But HLE reminds us that intelligence isn’t just about pattern recognition — it’s about depth, context and specialized expertise.”
The point wasn’t to stump humans. It was to reveal, precisely and systematically, what AI cannot do, at least not yet.
Questions for HLE were written and reviewed by experts in their fields from all over the world, who ensured each one had a single, unambiguous, verifiable answer that couldn’t be solved instantly through internet retrieval. The prompts draw from expert-level academic problems: from translating ancient Palmyrene inscriptions to identifying microanatomical structures in birds or analyzing the intricate features of Biblical Hebrew pronunciation.
Each question was tested against leading AI models. If any system could answer it correctly, the question was removed. The result is an exam deliberately engineered to sit just beyond current AI capability.
And it worked. Early results showed that even the most advanced models struggled. GPT‑4o scored 2.7%; Claude 3.5 Sonnet reached 4.1%; OpenAI’s flagship o1 model achieved only 8%. The most advanced models, including Gemini 3.1 Pro and Claude Opus 4.6, have reached around 40% to 50% accuracy.
The problem with AI outgrowing traditional benchmarks isn’t simply academic, said Nguyen, who contributed 73 of the 2,500 public questions (the second-highest author), and authored the most questions in math and computer science.
“Without accurate assessment tools, policymakers, developers and users risk misinterpreting what AI systems can actually do,” he said. “Benchmarks provide the foundation for measuring progress and identifying risks.”
As the team’s paper notes, while AI may excel on exams designed for humans, those tests aren’t necessarily measuring “intelligence.” They measure performance on a set of tasks crafted for a very different kind of learner.
Despite its apocalyptic name, Humanity’s Last Exam isn’t meant to suggest the end of human relevance. Instead, it highlights how much knowledge remains uniquely human and how far AI systems still have to go.
“This isn’t a race against AI,” Nguyen said. “It’s a method for understanding where these systems are strong and where they struggle. That understanding helps us build safer, more reliable technologies. And, importantly, it reminds us why human expertise still matters.”
HLE is intended to serve as a long‑term, transparent benchmark for evaluating advanced AI systems. As part of that mission, the team has made some of the exam publicly available, while keeping most of the test questions hidden so AI models can’t memorize the answers.
“For now, Humanity’s Last Exam stands as one of the clearest assessments of the gap between AI and human intelligence,” Nguyen said, “and despite rapid technological advances, it remains wide.”
Nguyen noted the massive project reflects the importance of interdisciplinary, international research efforts.
“What made this project extraordinary was the scale,” he said. “Experts from nearly every discipline contributed. It wasn’t just computer scientists; it was historians, physicists, linguists, medical researchers. That diversity is exactly what exposes the gaps in today’s AI systems —perhaps ironically, it’s humans working together.”
###
Nature
A benchmark of expert-level academic questions to assess AI capabilities
28-Jan-2026