Watching the detectors: Researchers probe efficacy – and danger

Patrick Traynor, Ph.D., has questions.

When the professor and interim chair of the University of Florida Department of Computer & Information Science & Engineering saw reports in the media positing that scientific literature is increasingly being generated by artificial intelligence, he wondered, “How do they know?”

Traynor knows the detectors that determine the presence of AI-generated text, known as AIGT, in publications are themselves AI systems. They use the same large language models, called LLMs, that less than honest researchers could be using to generate their text.

How good could they be?

Spoiler alert: They’re not very good.

In a paper to be presented at this week’s 2026 IEEE Symposium on Security and Privacy , Traynor and his co-authors assert that current AIGT detectors are not effective or robust tools for determining the presence of AI-generated text. The results, the researchers said, indicate that commercially available AIGT detectors are “poorly suited for deployment in academic or high-stakes contexts.”

In other words, when the results matter, the tools are not effective. The paper, “AI Wrote My Paper and All I Got Was This False Negative: Measuring the Efficacy of Commercial AI Text Detectors,” co-authored by Seth Layton, Ph.D., Bernardo B.P. Madeiros and Kevin Butler, Ph.D., examined common commercially available solutions that test for AIGT and found wildly inconsistent efficacy rates — false positive rates between 0.05% and 68.6%, false negative rates of between 0.3% and 99.6%.

Researchers also found that with a simple tweak to the LLM, the detectors were rendered basically useless, incapable of distinguishing AIGT from human-generated text.

“These current tools are not reliable or robust enough to use to measure the problem,” co-author Traynor said. “We really can’t use them to adjudicate these decisions. People’s careers are on the line here.”

A recent article in Nature asserts the problem is widespread.

“The fear of many in the research community is that poor-quality or entirely fabricated research produced by large language models could overwhelm the ability of current quality-control systems to detect it, thereby polluting the scientific canon," Nature reporter Miryam Naddaf noted in the article.

However, as Traynor and his team point out, this conclusion could not be reliably reached using current tools. In short, while it may feel like the use of AI in academic writing is widespread, current evidence to support that claim is inadequate.

The issue, of course, is personal for Traynor. He pointed out that in academia, individual merit is literally measured by an author’s intellectual output and publications. Suspicion or accusations of having used AI-generated text in scholarly submissions can stain a researcher’s reputation and can adversely affect their career.

The study’s methodology was very meta. Using a set of all papers submitted to top-tier security conferences prior to the advent of ChatGPT (about 6,000 papers), they directed LLMs to create AIGT clones of the very same papers. The combined dataset was then evaluated by the five most popular commercial AIGT detectors on the market.

While two of the five detectors performed well, researchers found that by making trivial changes to the AIGT, the reliability fell dramatically. Researchers simply asked the LLM to generate the AI version of the papers using more complex vocabulary (researchers call this a lexical complexity attack), and the detectors were more easily fooled.

AIGT detectors, it seems, are easily swayed by fancy words.

And while the research concludes AIGT detectors pose high-stakes risks in academia, don’t think Traynor and his team are AI naysayers. They believe LLMs and AI have great potential to speed up science, to help us find new insights. But Traynor cautions against a prevailing notion that AI is all-knowing.

“It’s not an oracle. It doesn’t always know the answer,” Traynor said. “It’s happy to give us answers, but whether or not those answers have value, we still need people to figure that out. This paper shows us that for as many studies as we see claiming that a certain percentage of academic work is AI-generated, we actually don’t have tools to measure any of that.”

Co-author Layton agreed, adding that the research should remind the public to view all AI claims skeptically, just as scientists view all evidence with skepticism.

“We demand that such claims include substantial proof that they are correct,” Traynor reiterated.

Watching the detectors: Researchers probe efficacy – and danger – of AI detection tools

Additional Media

Keywords

Contact Information

Source

How to Cite This Article