Nav: Home

Researchers produce first major database of non-native English

August 01, 2016

After thousands of hours of work, MIT researchers have released the first major database of fully annotated English sentences written by non-native speakers.

The researchers who led the project had already shown that the grammatical quirks of non-native speakers writing in English could be a source of linguistic insight. But they hope that their dataset could also lead to applications that would improve computers' handling of spoken or written language of non-native English speakers.

"English is the most used language on the Internet, with over 1 billion speakers," says Yevgeni Berzak, a graduate student in electrical engineering and computer science, who led the new project. "Most of the people who speak English in the world or produce English text are non-native speakers. This characteristic is often overlooked when we study English scientifically or when we do natural-language processing for English."

Most natural-language-processing systems, which enable smartphone and other computer applications to process requests phrased in ordinary language, are based on machine learning, in which computer systems look for patterns in huge sets of training data. "If you want to handle noncanonical learner language, in terms of the training material that's available to you, you can only train on standard English," Berzak explains.

Systems trained on nonstandard English, on the other hand, could be better able to handle the idiosyncrasies of non-native English speakers, such as tendencies to drop or add prepositions, to substitute particular tenses for others, or to misuse particular auxiliary verbs. Indeed, the researchers hope that their work could lead to grammar-correction software targeted to native speakers of other languages.

Diagramming sentences

The researchers' dataset consists of 5,124 sentences culled from exam essays written by students of English as a second language (ESL). The sentences were drawn, in approximately equal distribution, from native speakers of 10 languages that are the primary tongues of roughly 40 percent of the world's population.

Every sentence in the dataset includes at least one grammatical error. The original source of the sentences was a collection made public by Cambridge University, which included annotation of the errors, but no other grammatical or syntactic information.

To provide that additional information, Berzak recruited a group of MIT undergraduate and graduate students from the departments of Electrical Engineering and Computer Science (EECS), Linguistics, and Mechanical Engineering, led by Carolyn Spadine, a graduate student in both EECS and linguistics.

After eight weeks of training in how to annotate both grammatically correct and error-ridden sentences, the students began working directly on the data. There are three levels of annotation. The first involves basic parts of speech -- whether a word is a noun, a verb, a preposition, and so on. The next is a more detailed description of parts of speech -- plural versus singular nouns, verb tenses, comparative and superlative adjectives, and the like.

Next, the annotators charted the syntactic relationships between the words of the sentences, using a relatively new annotation scheme called the Universal Dependency formalism. Syntactic relationships include things like which nouns are the objects of which verbs, which verbs are auxiliaries of other verbs, which adjectives modify which nouns, and so on.

The annotators created syntactic charts for both the corrected and uncorrected versions of each sentence. That required some prior conceptual work, since grammatical errors can make words' syntactic roles difficult to interpret.

Berzak and Spadine wrote a 20-page guide to their annotation scheme, much of which dealt with the handling of error-ridden sentences. Consistency in the treatment of such sentences is essential to any envisioned application of the dataset: A machine-learning system can't learn to recognize an error if the error is described differently in different training examples.

Repeatable results

The researchers' methodology, however, provides good evidence that annotators can chart ungrammatical sentences consistently. For every sentence, one evaluator annotated it completely; another reviewed the annotations and flagged any areas of disagreement; and a third ruled on the disagreements.

There was some disagreement on how to handle ungrammatical sentences -- but there was some disagreement on how to handle grammatical sentences, too. In general, levels of agreement were comparable for both types of sentences.

The researchers report these and other results in a paper being presented at the Association for Computational Linguistics annual conference in August. Joining Berzak and Spadine on the paper are Boris Katz, who is Berzak's advisor and a principal research scientist at MIT's Computer Science and Artificial Intelligence Laboratory; and the undergraduate annotators:Jessica Kenney, Jing Xian Wang, Lucia Lam, Keiko Sophie Mori, and Sebastian Garza.

The researchers' dataset is now one of the 59 datasets available from the organization that oversees the Universal Dependency (UD) standard. Berzak also created an online interface for the dataset, so that researchers can look for particular kinds of errors, in sentences produced by native speakers of particular languages, and the like.

The work was funded in part by the National Science Foundation, under the auspices of MIT's Center for Brains, Minds, and Machines.
-end-
Additional background

ARCHIVE: Essays in English yield information about other languages

http://news.mit.edu/2014/essays-english-yield-information-about-other-languages-0723

ARCHIVE: Computer system automatically solves word problems

http://news.mit.edu/2014/computer-system-automatically-solves-word-problems-0502

ARCHIVE: Mining physicians' notes for medical insights

http://news.mit.edu/2012/digital-medical-records-offer-insights-1031

ARCHIVE: Explaining the origins of word order using information theory

http://news.mit.edu/2012/applying-information-theory-to-linguistics-1010

Massachusetts Institute of Technology

Related Language Articles:

The world's most spoken language is...'Terpene'
If you're small, smells are a good way to stand out.
Study analyzes what 'a' and 'the' tell us about language acquisition
A study co-authored by an MIT professor suggests that experience is an important component of early-childhood language usage although it doesn't necessarily account for all of a child's language facility.
Why do people switch their language?
Due to increasing globalization, the linguistic landscape of our world is changing; many people give up use of one language in favor of another.
Discovering what shapes language diversity
A research team led by Colorado State University is the first to use a form of simulation modeling to study the processes that shape language diversity patterns.
'Speaking my language': Method helps prepare teachers of dual language learners
Researchers at Lehigh University, led by L. Brook Sawyer and Patricia H.
The brain watched during language learning
Researchers from Nijmegen, the Netherlands, have for the first time captured images of the brain during the initial hours and days of learning a new language.
'Now-or-never bottleneck' explains language acquisition
We are constantly bombarded with linguistic input, but our brains are unable to remember long strings of linguistic information.
The secret language of microbes
Social microbes often interact with each other preferentially, favoring those that share certain genes in common.
A programming language for living cells
New language lets MIT researchers design novel biological circuits.
Syntax is not unique to human language
Human communication is powered by rules for combining words to generate novel meanings.

Related Language Reading:

Best Science Podcasts 2019

We have hand picked the best science podcasts for 2019. Sit back and enjoy new science podcasts updated daily from your favorite science news services and scientists.
Now Playing: TED Radio Hour

Digital Manipulation
Technology has reshaped our lives in amazing ways. But at what cost? This hour, TED speakers reveal how what we see, read, believe — even how we vote — can be manipulated by the technology we use. Guests include journalist Carole Cadwalladr, consumer advocate Finn Myrstad, writer and marketing professor Scott Galloway, behavioral designer Nir Eyal, and computer graphics researcher Doug Roble.
Now Playing: Science for the People

#529 Do You Really Want to Find Out Who's Your Daddy?
At least some of you by now have probably spit into a tube and mailed it off to find out who your closest relatives are, where you might be from, and what terrible diseases might await you. But what exactly did you find out? And what did you give away? In this live panel at Awesome Con we bring in science writer Tina Saey to talk about all her DNA testing, and bioethicist Debra Mathews, to determine whether Tina should have done it at all. Related links: What FamilyTreeDNA sharing genetic data with police means for you Crime solvers embraced...