New system solves the 'who is J. Smith' puzzle

December 14, 2006

Penn State researchers have developed an automated system that can determine which "J. Smith" is authoring papers on computer science--the one who teaches at Penn State or the one who teaches at M.I.T--as well as whether "J. Smith" is John Smith, Jane Smith, Joanna L. Smith or James H. Smith.

The system, which retrieves classes of authors with similar names, considers not just names in making its determination but also other information such as co-authors, dates of publications, citations and keywords.

When tested with 3,355 academic papers written by 490 authors, the system correctly identified authors 90.6 percent of the time.

"It works very similarly to how humans would figure out authors' identity--by looking at affiliations, topics, publications," said C. Lee Giles, the David Reese Professor of Information Sciences and Technology and principal researcher.

"The system works by using machine-learning methods to cluster together names that the system believes to be similar. If you think there's another parameter that's relevant, you can change the algorithm and include it," Giles said.

The system is explained in a paper, "Efficient Name Disambiguation for Large-Scale Databases," presented at the recent 17th European Conference on Machine Learning and the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases in Berlin. Co-authors were Jian Huang, a doctoral student in the College of Information Sciences and Technology, and Seyda Ertekin, a doctoral student in the Department of Computer Science and Engineering. Even in academic publications, figuring out an author's identity can be difficult as publications vary in how individuals' names are presented. For instance, some publications opt just for first initial and last name as in "J. Smith." Others include full name--C. Lee Giles, for instance. But if the surname is common, as in "Smith" or "Chen," first names may not suffice to accurately identify the author.

Confusion also can occur because of how entities are listed with some publications choosing Penn State, The Pennsylvania State University or PSU. The researchers' algorithm can clear up ambiguities surrounding entities whether institutions, businesses, funding agencies or organizations.

"This method will work on many entity disambiguation problems," Giles said.

The algorithm uses a clustering method to train computers to extract information based on similar properties. Each time information is clustered, the result is a smaller and smaller grouping.

The algorithm will be a part of the next generation CiteSeer, the largest academic search engine for computer and information-science literature. Giles was co-creator of CiteSeer when he was at NEC.
The research was supported by the National Science Foundation and Microsoft.

Penn State

Related Algorithm Articles from Brightsurf:

CCNY & partners in quantum algorithm breakthrough
Researchers led by City College of New York physicist Pouyan Ghaemi report the development of a quantum algorithm with the potential to study a class of many-electron quantums system using quantum computers.

Machine learning algorithm could provide Soldiers feedback
A new machine learning algorithm, developed with Army funding, can isolate patterns in brain signals that relate to a specific behavior and then decode it, potentially providing Soldiers with behavioral-based feedback.

New algorithm predicts likelihood of acute kidney injury
In a recent study, a new algorithm outperformed the standard method for predicting which hospitalized patients will develop acute kidney injury.

New algorithm could unleash the power of quantum computers
A new algorithm that fast forwards simulations could bring greater use ability to current and near-term quantum computers, opening the way for applications to run past strict time limits that hamper many quantum calculations.

QUT algorithm could quash Twitter abuse of women
Online abuse targeting women, including threats of harm or sexual violence, has proliferated across all social media platforms but QUT researchers have developed a sophisticated statistical model to identify misogynistic content and help drum it out of the Twittersphere.

New learning algorithm should significantly expand the possible applications of AI
The e-prop learning method developed at Graz University of Technology forms the basis for drastically more energy-efficient hardware implementations of Artificial Intelligence.

Algorithm predicts risk for PTSD after traumatic injury
With high precision, a new algorithm predicts which patients treated for traumatic injuries in the emergency department will later develop posttraumatic stress disorder.

New algorithm uses artificial intelligence to help manage type 1 diabetes
Researchers and physicians at Oregon Health & Science University have designed a method to help people with type 1 diabetes better manage their glucose levels.

A new algorithm predicts the difficulty in fighting fire
The tool completes previous studies with new variables and could improve the ability to respond to forest fires.

New algorithm predicts optimal materials among all possible compounds
Skoltech researchers have offered a solution to the problem of searching for materials with required properties among all possible combinations of chemical elements.

Read More: Algorithm News and Algorithm Current Events is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to