IU scientists devise means to test for phony technical papers

April 24, 2006

BLOOMINGTON, Ind. - Authors of bogus technical articles beware. A team of researchers at the Indiana University School of Informatics has designed a tool that distinguishes between real and fake papers.

It's called the Inauthentic Paper Detector -- one of the first of its kind anywhere -- and it uses compression to determine whether technical texts are generated by man or machine.

"This is a potential problem since no existing systems, the Web for example, can or do discriminate between content that is meaningful or bogus," says assistant professor Mehmet Dalkilic, a data mining expert. "We believe that there are subtle, short- and long-range word or even word string repetitions that exist in human texts, but not in many classes of computer-generated texts that can be used to discriminate based on meaning."

Joining Dalkilic on the IPD project are Assistant Professor Predrag Radivojac, informatics doctoral student James Costello, and Wyatt T. Clark, who will graduate in May with a bachelor's degree in informatics.

The IPD system is based on a combination of compression algorithms that reduce the amount of data to save space and speed transmission time.

To begin their study, the team identified two kinds of texts they would analyze. "Authentic text" (or document) is a collection of several hundreds or thousands of syntactically correct sentences that are wholly meaningful. "Inauthentic text" (or document) is a collection of several hundreds of thousands of syntactically correct sentences that, taken all together, have no meaning.

The researchers' work is documented in the very authentic paper, "Using Compression to Identify Classes of Inauthentic Texts," which they presented at the Society for Industrial and Applied Mathematics Conference on Data Mining in Bethesda, Md., this weekend.

The informatics study largely was inspired by a prank pulled by three Massachusetts Institute of Technology students, who in 2004 developed a computer program that churned out randomly generated fake computer science language, essentially a four-page compilation of gibberish. They submitted it as a research paper to an international conference on computer science and informatics - and it was accepted without review.

Radivojac, whose research expertise is machine learning, says the IPD easily detected numerous inauthentic technical papers tested, including the MIT students' spurious submission.

"We hypothesized we could build a reliable and fast model that recognizes fake papers automatically," says Radivojac. "We combined these with machine-learning methods to build a predictor of these kinds of papers."

In general, identifying meaning in a technical document is difficult, Dalkilic says. "We don't claim we have found a way to distinguish between meaning and nonsense, but we do emphasize that there are many nontrivial classes of inauthentic documents that can be easily distinguished based on compression algorithms."
-end-
Costello's and Clark's involvement in the IPD project earned them travel expenses to the SIAM Conference, compliments of the Lawrence Livermore National Laboratory in California.

To see how the Inauthentic Paper Detector works, visit its Web site at http://montana.informatics.indiana.edu/fsi/about.html.

To speak with Dalkilic or Radivojac, please contact Joe Stuteville, IU School of Informatics, at 812-856-3141 (office) 317-946-9930 (cell), or jstutevi@indiana.edu.

Indiana University School of Informatics
The Indiana University School of Informatics offers a unique, interdisciplinary curriculum that focuses on developing specialized skills and knowledge of information technology. The School has a variety of undergraduate degrees and specialized master's and doctorate degrees in bioinformatics, chemical informatics, health informatics, human-computer interaction, laboratory informatics, new media and computer science. Each degree is an interdisciplinary endeavor that combines course work and field experiences from a traditional subject area or discipline with intensive study of information and technology.

Indiana University

Related Data Mining Articles from Brightsurf:

Mining molecular data with cryo-EM unveils hidden biological secrets
In the new study, Abhishek Singharoy and his colleagues demonstrate that cryo-EM can be pushed to even greater extremes of clarity, by extracting precious information previously buried in the reams of cryo-EM data.

Foiling illicit cryptocurrency mining with artificial intelligence
Los Alamos National Laboratory computer scientists have developed a new artificial intelligence (AI) system that may be able to identify malicious codes that hijack supercomputers to mine for cryptocurrency such as Bitcoin and Monero.

Gold mining restricts Amazon rainforest recovery
Gold mining significantly limits the regrowth of Amazon forests, greatly reducing their ability to accumulate carbon, according to a new study.

Analysing the effects two decades after a mining spill
There has been an important fall in the total concentrations and evolution of the metal fraction towards their more innocuous forms, so the environmental risk is much reduced.

Sand mining is threatening lives along the Mekong River
It's a resource used in global construction and mined from rivers and coasts across the world.

All Bitcoin mining should be environmentally friendly
The energy used to mine for cryptocurrencies like Bitcoin is on par with the energy consumed by Ireland.

Estimating the environmental impact of Bitcoin mining
As an alternative to government-issued money, the cryptocurrency Bitcoin offers relative anonymity, no sales tax and freedom from bank and government interference.

Lead poisoning reduced with safer mining practices
We report on an extremely successful and novel project to reduce lead poisoning among artisanal gold miners in Nigeria.  This report highlights the success of OK International in partnership with Doctors Without Borders to introduce safer mining practices in an area where thousands are severely lead poisoned and where hundreds of deaths have been recorded from acute poisoning.

Science snapshots -- Waste to fuel, moire superlattices, mining cellphones for energy data
As reported in Nature Physics, a Berkeley Lab-led team of physicists and materials scientists was the first to unambiguously observe and document the unique optical phenomena that occur in certain types of synthetic materials called moire; superlattices.

Mining 25 years of data uncovers a new predictor of age of onset for Huntington disease
Investigators at the University of British Columbia (UBC)/Centre for Molecular Medicine & Therapeutics (CMMT) and BC Children's Hospital have examined more than 25 years of data to reveal new insights into predicting the age of onset for Huntington disease.

Read More: Data Mining News and Data Mining Current Events
Brightsurf.com is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to Amazon.com.