Carnegie Mellon project boosts book digitization efforts

May 24, 2007

PITTSBURGH--A Carnegie Mellon University computer scientist is enlisting the unwitting help of thousands, if not millions, of Web users each day to eliminate a technical bottleneck that has slowed efforts to transform books, newspapers and other printed materials into digitized text that is computer searchable. Luis von Ahn, an assistant professor of computer science and recipient of a MacArthur Foundation "genius grant," says the project will also improve Web security systems used to reduce spam and make it possible for individuals to safeguard their own email addresses from spammers.

Key to the new project is assigning a new, dual use to existing technology: CAPTCHAs, the distorted-letter tests found at the bottom of registration forms on Yahoo, Hotmail, PayPal, Wikipedia and hundreds of other sites worldwide. CAPTCHAs, an acronym for Completely Automated Public Turing Test to Tell Computers and Humans Apart, distinguish between legitimate human users and malevolent computer programs designed by spammers to harvest thousands of free email accounts. The tests require users to type the distorted letters they see inside a box -- a task that is difficult for computers, but easy for humans.

Working with a team that includes computer science professor Manuel Blum, undergraduate student Ben Maurer and research programmer Mike Crawford, von Ahn invented a new version of the tests, called reCAPTCHAs, that will help convert printed text into computer-readable letters on behalf of the Internet Archive. The San Francisco-based non-profit group administers the Open Content Alliance and is one of several large initiatives working to digitize books and other printed materials under open principles, making the text searchable by computer and capable of being reformatted for new uses.

Optical character recognition (OCR) systems that automatically perform this conversion are often stumped by underlined text, scribbles and fuzzy or otherwise poorly printed letters. ReCAPTCHAs will use words from these troublesome passages to replace the artificially distorted letters and numbers typically used in CAPTCHAs.

The new tests continue to distinguish between humans and machines because they use text that OCR systems have already failed to read. And because people must decipher these words to pass the reCAPTCHA test, they will help complete the expensive digitization process.

"I think it's a brilliant idea -- using the Internet to correct OCR mistakes," said Brewster Kahle, director of the Internet Archive. ReCAPTCHAs will speed the digitization process while also helping to improve OCR methods and perhaps extend them to additional languages, he said. "This is an example of why having open collections in the public domain is important," he added. "People are working together to build a good, open system." Von Ahn hopes to substitute his reCAPTCHAs for as many conventional CAPTCHAs as possible. "It is estimated that 60 million or more CAPTCHAs are solved each day, with each test taking about 10 seconds," he said. "That's more than 150,000 precious hours of human work that are lost each day, but that we can put to good use with reCAPTCHAs."

With support from Intel Corp., von Ahn's team has devised a free, Web-based service that allows individual webmasters to install reCAPTCHAs to protect their sites. Individuals can also use the service to protect their own email addresses, or lists of addresses they post on personal Web pages. In the case of some commercial Web sites with heavy traffic, reCAPTCHA may charge a fee to pay for additional bandwidth.

To make certain that people are correctly deciphering the printed text, the reCAPTCHA system will require Web site visitors to type two words, one of which the system already knows. Each unknown word will be submitted to multiple visitors. If the visitor types the known word correctly, the system has greater confidence that the unknown word is being typed correctly. If several visitors type the same answer for the unknown word, that answer will be assumed to be correct.

An audio version of reCAPTCHA, which will transcribe portions of radio programs that have defied speech recognition programs, will also be available for blind Web users.
-end-
Note to Editors: To download a high-resolution image of a reCAPTCHA, visit www.cs.cmu.edu/~biglou/redSpaceCrop3.pdf.

About Carnegie Mellon: Carnegie Mellon is a private research university with a distinctive mix of programs in engineering, computer science, robotics, business, public policy, fine arts and the humanities. More than 10,000 undergraduate and graduate students receive an education characterized by its focus on creating and implementing solutions for real problems, interdisciplinary collaboration, and innovation. A small student-to-faculty ratio provides an opportunity for close interaction between students and professors. While technology is pervasive on its 144-acre Pittsburgh campus, Carnegie Mellon is also distinctive among leading research universities for the world-renowned programs in its College of Fine Arts. A global university, Carnegie Mellon has campuses in Silicon Valley, Calif., and Qatar, and programs in Australia, Greece, Japan, Portugal, Singapore and Taiwan. For more, see www.cmu.edu.

Carnegie Mellon University

Related Computer Articles from Brightsurf:

UCLA computer scientists set benchmarks to optimize quantum computer performance
Two UCLA computer scientists have shown that existing compilers, which tell quantum computers how to use their circuits to execute quantum programs, inhibit the computers' ability to achieve optimal performance.

Digitize your dog into a computer game
Researchers from CAMERA at the University of Bath have developed motion capture technology that enables you to digitise your dog without a motion capture suit and using only one camera.

Stabilizing brain-computer interfaces
Researchers from Carnegie Mellon University (CMU) and the University of Pittsburgh (Pitt) have published research in Nature Biomedical Engineering that will drastically improve brain-computer interfaces and their ability to remain stabilized during use, greatly reducing or potentially eliminating the need to recalibrate these devices during or between experiments.

Computer-generated genomes
Professor Beat Christen, ETH Zurich to speak in the AAAS 2020 session, 'Synthetic Biology: Digital Design of Living Systems.' Christen will describe how computational algorithms paired with chemical DNA synthesis enable digital manufacturing of biological systems up to the size of entire microbial genomes.

Computer-based weather forecast: New algorithm outperforms mainframe computer systems
The exponential growth in computer processing power seen over the past 60 years may soon come to a halt.

A computer that understands how you feel
Neuroscientists have developed a brain-inspired computer system that can look at an image and determine what emotion it evokes in people.

Computer program looks five minutes into the future
Scientists from the University of Bonn have developed software that can look minutes into the future: The program learns the typical sequence of actions, such as cooking, from video sequences.

Computer redesigns enzyme
University of Groningen biotechnologists used a computational method to redesign aspartase and convert it to a catalyst for asymmetric hydroamination reactions.

Mining for gold with a computer
Engineers from Texas A&M University and Virginia Tech report important new insights into nanoporous gold -- a material with growing applications in several areas, including energy storage and biomedical devices -- all without stepping into a lab.

Teaching quantum physics to a computer
An international collaboration led by ETH physicists has used machine learning to teach a computer how to predict the outcomes of quantum experiments.

Read More: Computer News and Computer Current Events
Brightsurf.com is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to Amazon.com.