Nav: Home

New framework brings accuracy, efficiency to identifying stop words

December 02, 2019

A research team led by Northwestern Engineering's Luis Amaral has developed an algorithmic approach for data analysis that automatically recognizes uninformative words -- known as stop words -- in a large collection of text. The findings could dramatically save time during natural language processing as well as reduce its energy footprint.

"One of the challenges in machine learning and artificial intelligence approaches is that you don't know which data is useful to an algorithm and which data is unhelpful," said Amaral, Erastus Otis Haven Professor of Chemical and Biological Engineering at the McCormick School of Engineering. "Using information theory, we created a framework that reveals which words are uninformative for the task at hand."

The trouble with stop words

One of the most common techniques data scientists use in natural language processing is the bag-of-words model, which analyzes the words in a given text without considering the order in which they appear. To streamline the process, researchers filter out stop words, those adding no context to the data analysis. Many stop word lists are manually curated by researchers, making them time consuming to develop and maintain as well as difficult to generalize across languages and disciplines.

"Imagine you analyze millions of blog posts and want to learn what topic each post addresses," said Amaral, who codirects the Northwestern Institute on Complex Systems. "You would typically filter out common words like 'the' and 'you,' which don't provide any background about the topic."

However, the majority of words that are not useful for that specific task depend on the language and the blog's particular subject area. "For a collection of blogs on electronics, for example, there are many words that could not enable an algorithm to determine whether a blog post is about quantum computing or semiconductors," he added.

An information theoretic framework

The research team used information theory to develop a model that more accurately and efficiently identifies stop words. Central to the model is a 'conditional entropy' metric that quantifies a given word's certainty of being informative. The more informative the word, the lower its conditional entropy. By comparing the observed and the expected values of conditional entropy, the researchers could measure the information content of specific words.

To test the model, the researchers compared its performance to common topic modelling approaches, which infers the words most related to a given topic by comparing them to other text in the data set. This framework produced improved accuracy and reproducibility across the texts studied, while also being more applicable to other languages in a straightforward manner. Additionally, the system achieved optimal performance using significantly less data.

"Using our approach, we could filter 80 percent or more of the data and actually increase the performance of existing algorithms for topic classification of text corpora," Amaral said. "In addition, by filtering so much of the data, we are able to dramatically reduce the amount of computational resources needed."

Beyond saving time, the filtering system could lead to long-term energy savings, combating the negative impact large-scale computing has on climate change.

A paper describing the work was published December 2 in the journal Nature Machine Intelligence. Amaral was a co-corresponding author on the paper along with Martin Gerlach, a postdoctoral fellow in Amaral's lab.

While the researchers' analysis was restricted to bag-of-words approaches, Amaral is confident that his system could be expanded to account for additional structural features of language, including sentences and paragraphs.

In addition, since information theory provides a general framework for the analysis of any sequence of symbols, the researchers' system could be applicable beyond text analysis, supporting pre-processing methods for analyzing audio, images -- even genes.

"We have begun applying this approach to the analysis of data from experiments measuring gene-specific RNA-molecules in individual cells as a way to automatically identify different cell types," Gerlach said. "Filtering uninformative genes -- think of them as "stop genes" -- is particularly promising to increase accuracy. Those measurements are much more difficult compared to texts and current heuristics are not nearly as well developed."

Northwestern University

Related Language Articles:

How does language emerge?
How did the almost 6000 languages of the world come into being?
New research quantifies how much speakers' first language affects learning a new language
Linguistic research suggests that accents are strongly shaped by the speaker's first language they learned growing up.
Why the language-ready brain is so complex
In a review article published in Science, Peter Hagoort, professor of Cognitive Neuroscience at Radboud University and director of the Max Planck Institute for Psycholinguistics, argues for a new model of language, involving the interaction of multiple brain networks.
Do as i say: Translating language into movement
Researchers at Carnegie Mellon University have developed a computer model that can translate text describing physical movements directly into simple computer-generated animations, a first step toward someday generating movies directly from scripts.
Learning language
When it comes to learning a language, the left side of the brain has traditionally been considered the hub of language processing.
Learning a second alphabet for a first language
A part of the brain that maps letters to sounds can acquire a second, visually distinct alphabet for the same language, according to a study of English speakers published in eNeuro.
Sign language reveals the hidden logical structure, and limitations, of spoken language
Sign languages can help reveal hidden aspects of the logical structure of spoken language, but they also highlight its limitations because speech lacks the rich iconic resources that sign language uses on top of its sophisticated grammar.
Lying in a foreign language is easier
It is not easy to tell when someone is lying.
American sign language and English language learners: New linguistic research supports the need for policy changes
A new study of the educational needs of students who are native users of American Sign Language (ASL) shows glaring disparities in their treatment by the U.S Department of Education.
The language of facial expressions
University of Miami Psychology Professor Daniel Messinger collaborated with researchers at Western University in Canada to show that our brains are pre-wired to perceive wrinkles around the eyes as conveying more intense and sincere emotions.
More Language News and Language Current Events

Top Science Podcasts

We have hand picked the top science podcasts of 2019.
Now Playing: TED Radio Hour

In & Out Of Love
We think of love as a mysterious, unknowable force. Something that happens to us. But what if we could control it? This hour, TED speakers on whether we can decide to fall in — and out of — love. Guests include writer Mandy Len Catron, biological anthropologist Helen Fisher, musician Dessa, One Love CEO Katie Hood, and psychologist Guy Winch.
Now Playing: Science for the People

#542 Climate Doomsday
Have you heard? Climate change. We did it. And it's bad. It's going to be worse. We are already suffering the effects of it in many ways. How should we TALK about the dangers we are facing, though? Should we get people good and scared? Or give them hope? Or both? Host Bethany Brookshire talks with David Wallace-Wells and Sheril Kirschenbaum to find out. This episode is hosted by Bethany Brookshire, science writer from Science News. Related links: Why Climate Disasters Might Not Boost Public Engagement on Climate Change on The New York Times by Andrew Revkin The other kind...
Now Playing: Radiolab

An Announcement from Radiolab