Bluesky Facebook Reddit Email

AI not yet good enough to grade university essays, rewarding ‘style over substance’

05.21.26 | University of Cambridge

SAMSUNG T9 Portable SSD 2TB

SAMSUNG T9 Portable SSD 2TB transfers large imagery and model outputs quickly between field laptops, lab workstations, and secure archives.

Researchers have used top Generative AI models to grade hundreds of undergraduate essays and found that AI only matched human-awarded degree classification around half the time, with AI often failing to accurately assess the best and worst submissions.

A University of Cambridge-led team of psychologists and AI experts tested three “frontier” systems including the latest versions (as of April 2026) of Claude and ChatGPT on over 750 student essays from three UK universities submitted as part of a psychology degree.

While accuracy of AI in grading the essays, from coursework to exam answers, was “not uniformly high”, say researchers, it did manage to match the broad grading bands – a first, 2:1, 2:2 and so on – given out by human examiners between 35-65% of the time.

However, major stumbling blocks for AI include routinely undervaluing work awarded top marks by humans, or overvaluing essays ranked among the lowest.

Unlike human examiners, all the AI systems were “oversensitive to linguistic features”: giving out higher marks based on essay length, vocabulary range, and sentence complexity, regardless of the academic quality of the essay.

In the latest report, researchers suggest that AI could be valuable for aspects of student assessment such as error detection and consistency checks – a “second pair of eyes” – as well as triaging feedback for students.

For example, large discrepancies between AI and human marks could help flag assignments requiring further review by a human assessor.

However, the team cautions that AI alone is far too shallow and inconsistent to grade undergraduate work, and a human should always determine the final mark.

“Universities are under huge pressure to reduce staff workload and improve efficiency, all while meeting rising student expectations, and some may start to lean on AI for assessment,” said Dr Deborah Talmi, the Cambridge psychologist who leads the OpRaise project behind the new report.

“AI could perhaps automate some of the labour-intensive aspects of marking, freeing academics up for direct student engagement.”

“We find that leaning heavily on the best current AI models would see student grading that is homogenised, underestimates brilliance, and favours linguistic style over the substance of sound academic judgement,” said Talmi.

“Assessment is not just a system for distributing marks. It is part of how educational meaning is made, so students feel seen, standards are upheld, and trust is maintained. Use of AI in assessment poses a risk to these values.”

The report, ‘AI in University Assessment: Evaluating the Opportunities and Risks of Automated Marking’, is supported by ai@cam, Cambridge University's flagship mission to develop AI for the benefit of society, and the Accelerate Programme for Scientific Discovery, made possible by a donation from Schmidt Sciences. It is launched today at an event with the British Psychological Society.

For the study, AI was also asked to provide student feedback, and it churned out reflections between 3-8 times longer than those provided by the original assessors.

However, when AI responses were kept to a word count comparable to those from humans, focus groups of staff and students found it difficult to distinguish between human and AI feedback. Once the identity of the writer was revealed, not everyone appreciated AI-generated insights.

University staff and students who took part in the study told researchers that, while current assessment practices are not perfect, being graded and receiving feedback from humans is fundamental to the “social contract” between academics and students.

“Many students said they would feel cheated if AI marked their work, and staff warned that relying on AI risks weakening trust, motivation, professional judgement, and the human engagement at the heart of higher education,” said Dr Yael Benn, a collaborator on the project from Manchester Metropolitan University.

The study used 761 undergraduate essays in psychology submitted and marked between 2022 and 2025 from a total of 125 students from the universities of Cambridge, Manchester Metropolitan and Nottingham.

The researchers chose to focus on psychology as essays are central to degree results in the subject. “Academic psychology is an ideal testing ground for AI assessment as it values evidence synthesis and critical judgement over single correct answers,” said Talmi.

Researchers tested AI systems with the same essays at different times, and found AI gave the same or similar marks each time. The different AI models were much closer to each other than to humans in their marking.

The AI managed to match the right UK degree classification band of the five available (First, 2:1, 2:2, Third, Fail) some 63% of the time for Cambridge essays, while for Nottingham it was 53% and for Manchester Metropolitan it was 35%.

Researchers suspect that the difference in AI accuracy across institutions is due to the range of grades, which was narrowest among Cambridge students, whose essays were all written in invigilated exam halls, and widest at Manchester Metropolitan, where all analysed essays were coursework. Nottingham essays were a mixture of both.

This illustrates the heart of the problem when relying on AI to assess students: inconsistent performances across institutions, types of prompting, and work that sits near grading boundaries, say the report’s authors, who describe AI as having a “central tendency bias”.

All papers are scored out of 100, standard practice in higher education. An essay marked 75 – a solid first – by a human is, on average, scored several points lower by every AI system. While an essay marked 50 – a low 2:2 – is scored several points higher.

The range on the marking scale where AI and humans most frequently align across institutions lies in the upper-50s to low-60s, so around a low 2:1, near the centre of the grade distribution.

The researchers point out in the report that academic judgement is based on reasoning, while AI marks are based on statistical predictions.

“Across models, the same pattern emerges,” said co-author Dr Alexandru Marcoci, from Cambridge’s Institute for Technology and Humanity. “The AI assigns middling marks to all submissions, resulting in particularly inaccurate marking of the best and worst essays.”

“The practical consequence of this bias is that the AI is least accurate precisely where assessment decisions matter most, at the boundaries that distinguish Firsts from Upper Seconds, or passes from fails,” he added.

Notes:

Researchers tested the performance of three frontier LLMs: Claude Opus 4.6 (Anthropic), GPT-5.4 (OpenAI), and Gemini 3 Flash (Google).

The dataset: 125 students in 3 UK universities volunteered 761 authentic long-form undergraduate psychology essays (University of Cambridge: 133, University of Nottingham: 172, Manchester Metropolitan University: 456). All essays were submissions to formal assessments between 2022-2025.

They spanned 50 modules and 87 distinct assignments across all years of study. Assessments spanned coursework, open book at-home examinations and invigilated examinations. Essay marks, on a 0-100 scale, were moderated formal marks provided by expert human assessors who followed routine institutional processes.

Prompt design: Rather than committing to a single prompt, the team systematically varied the prompt under three dimensions - criteria specificity, calibration intervention, and scoring strategy - to isolate each component's influence on scoring accuracy and identify the best prompt for each model.

At the most basic level, models were prompted by the following statement: “ You are an experienced <University name> examiner marking <degree name> undergraduate assignment .”

At the other end, models were given the full marking rubric, information about the expected mark distribution, and asked to justify aspects of the evaluation prior to providing a mark.

Best-performing prompts per model were selected on a 20 % calibration subset (n = 153); the same prompt configurations were then applied to the full corpus for the analyses reported here.

AI in University Assessment: Evaluating the Opportunities and Risks of Automated Marking

22-May-2026

Keywords

Article Information

Contact Information

Fred Lewsey
University of Cambridge
fred.lewsey@admin.cam.ac.uk

How to Cite This Article

APA:
University of Cambridge. (2026, May 21). AI not yet good enough to grade university essays, rewarding ‘style over substance’. Brightsurf News. https://www.brightsurf.com/news/LPEZ5XV8/ai-not-yet-good-enough-to-grade-university-essays-rewarding-style-over-substance.html
MLA:
"AI not yet good enough to grade university essays, rewarding ‘style over substance’." Brightsurf News, May. 21 2026, https://www.brightsurf.com/news/LPEZ5XV8/ai-not-yet-good-enough-to-grade-university-essays-rewarding-style-over-substance.html.