Nav: Home

New framework brings accuracy, efficiency to identifying stop words

December 02, 2019

A research team led by Northwestern Engineering's Luis Amaral has developed an algorithmic approach for data analysis that automatically recognizes uninformative words -- known as stop words -- in a large collection of text. The findings could dramatically save time during natural language processing as well as reduce its energy footprint.

"One of the challenges in machine learning and artificial intelligence approaches is that you don't know which data is useful to an algorithm and which data is unhelpful," said Amaral, Erastus Otis Haven Professor of Chemical and Biological Engineering at the McCormick School of Engineering. "Using information theory, we created a framework that reveals which words are uninformative for the task at hand."

The trouble with stop words

One of the most common techniques data scientists use in natural language processing is the bag-of-words model, which analyzes the words in a given text without considering the order in which they appear. To streamline the process, researchers filter out stop words, those adding no context to the data analysis. Many stop word lists are manually curated by researchers, making them time consuming to develop and maintain as well as difficult to generalize across languages and disciplines.

"Imagine you analyze millions of blog posts and want to learn what topic each post addresses," said Amaral, who codirects the Northwestern Institute on Complex Systems. "You would typically filter out common words like 'the' and 'you,' which don't provide any background about the topic."

However, the majority of words that are not useful for that specific task depend on the language and the blog's particular subject area. "For a collection of blogs on electronics, for example, there are many words that could not enable an algorithm to determine whether a blog post is about quantum computing or semiconductors," he added.

An information theoretic framework

The research team used information theory to develop a model that more accurately and efficiently identifies stop words. Central to the model is a 'conditional entropy' metric that quantifies a given word's certainty of being informative. The more informative the word, the lower its conditional entropy. By comparing the observed and the expected values of conditional entropy, the researchers could measure the information content of specific words.

To test the model, the researchers compared its performance to common topic modelling approaches, which infers the words most related to a given topic by comparing them to other text in the data set. This framework produced improved accuracy and reproducibility across the texts studied, while also being more applicable to other languages in a straightforward manner. Additionally, the system achieved optimal performance using significantly less data.

"Using our approach, we could filter 80 percent or more of the data and actually increase the performance of existing algorithms for topic classification of text corpora," Amaral said. "In addition, by filtering so much of the data, we are able to dramatically reduce the amount of computational resources needed."

Beyond saving time, the filtering system could lead to long-term energy savings, combating the negative impact large-scale computing has on climate change.

A paper describing the work was published December 2 in the journal Nature Machine Intelligence. Amaral was a co-corresponding author on the paper along with Martin Gerlach, a postdoctoral fellow in Amaral's lab.

While the researchers' analysis was restricted to bag-of-words approaches, Amaral is confident that his system could be expanded to account for additional structural features of language, including sentences and paragraphs.

In addition, since information theory provides a general framework for the analysis of any sequence of symbols, the researchers' system could be applicable beyond text analysis, supporting pre-processing methods for analyzing audio, images -- even genes.

"We have begun applying this approach to the analysis of data from experiments measuring gene-specific RNA-molecules in individual cells as a way to automatically identify different cell types," Gerlach said. "Filtering uninformative genes -- think of them as "stop genes" -- is particularly promising to increase accuracy. Those measurements are much more difficult compared to texts and current heuristics are not nearly as well developed."

Northwestern University

Related Language Articles:

Human language most likely evolved gradually
One of the most controversial hypotheses for the origin of human language faculty is the evolutionary conjecture that language arose instantaneously in humans through a single gene mutation.
'She' goes missing from presidential language
MIT researchers have found that although a significant percentage of the American public believed the winner of the November 2016 presidential election would be a woman, people rarely used the pronoun 'she' when referring to the next president before the election.
How does language emerge?
How did the almost 6000 languages of the world come into being?
New research quantifies how much speakers' first language affects learning a new language
Linguistic research suggests that accents are strongly shaped by the speaker's first language they learned growing up.
Why the language-ready brain is so complex
In a review article published in Science, Peter Hagoort, professor of Cognitive Neuroscience at Radboud University and director of the Max Planck Institute for Psycholinguistics, argues for a new model of language, involving the interaction of multiple brain networks.
Do as i say: Translating language into movement
Researchers at Carnegie Mellon University have developed a computer model that can translate text describing physical movements directly into simple computer-generated animations, a first step toward someday generating movies directly from scripts.
Learning language
When it comes to learning a language, the left side of the brain has traditionally been considered the hub of language processing.
Learning a second alphabet for a first language
A part of the brain that maps letters to sounds can acquire a second, visually distinct alphabet for the same language, according to a study of English speakers published in eNeuro.
Sign language reveals the hidden logical structure, and limitations, of spoken language
Sign languages can help reveal hidden aspects of the logical structure of spoken language, but they also highlight its limitations because speech lacks the rich iconic resources that sign language uses on top of its sophisticated grammar.
Lying in a foreign language is easier
It is not easy to tell when someone is lying.
More Language News and Language Current Events

Trending Science News

Current Coronavirus (COVID-19) News

Top Science Podcasts

We have hand picked the top science podcasts of 2020.
Now Playing: TED Radio Hour

Climate Mindset
In the past few months, human beings have come together to fight a global threat. This hour, TED speakers explore how our response can be the catalyst to fight another global crisis: climate change. Guests include political strategist Tom Rivett-Carnac, diplomat Christiana Figueres, climate justice activist Xiye Bastida, and writer, illustrator, and artist Oliver Jeffers.
Now Playing: Science for the People

#562 Superbug to Bedside
By now we're all good and scared about antibiotic resistance, one of the many things coming to get us all. But there's good news, sort of. News antibiotics are coming out! How do they get tested? What does that kind of a trial look like and how does it happen? Host Bethany Brookeshire talks with Matt McCarthy, author of "Superbugs: The Race to Stop an Epidemic", about the ins and outs of testing a new antibiotic in the hospital.
Now Playing: Radiolab

Speedy Beet
There are few musical moments more well-worn than the first four notes of Beethoven's Fifth Symphony. But in this short, we find out that Beethoven might have made a last-ditch effort to keep his music from ever feeling familiar, to keep pushing his listeners to a kind of psychological limit. Big thanks to our Brooklyn Philharmonic musicians: Deborah Buck and Suzy Perelman on violin, Arash Amini on cello, and Ah Ling Neu on viola. And check out The First Four Notes, Matthew Guerrieri's book on Beethoven's Fifth. Support Radiolab today at