# Technique shrinks data sets for easier analysis

December 15, 2016One way to handle big data is to shrink it. If you can identify a small subset of your data set that preserves its salient mathematical relationships, you may be able to perform useful analyses on it that would be prohibitively time consuming on the full set.

The methods for creating such "coresets" vary according to application, however. Last week, at the Annual Conference on Neural Information Processing Systems, researchers from MIT's Computer Science and Artificial Intelligence Laboratory and the University of Haifa in Israel presented a new coreset-generation technique that's tailored to a whole family of data analysis tools with applications in natural-language processing, computer vision, signal processing, recommendation systems, weather prediction, finance, and neuroscience, among many others.

"These are all very general algorithms that are used in so many applications," says Daniela Rus, the Andrew and Erna Viterbi Professor of Electrical Engineering and Computer Science at MIT and senior author on the new paper. "They're fundamental to so many problems. By figuring out the coreset for a huge matrix for one of these tools, you can enable computations that at the moment are simply not possible."

As an example, in their paper the researchers apply their technique to a matrix -- that is, a table -- that maps every article on the English version of Wikipedia against every word that appears on the site. That's 1.4 million articles, or matrix rows, and 4.4 million words, or matrix columns.

That matrix would be much too large to analyze using low-rank approximation, an algorithm that can deduce the topics of free-form texts. But with their coreset, the researchers were able to use low-rank approximation to extract clusters of words that denote the 100 most common topics on Wikipedia. The cluster that contains "dress," "brides," "bridesmaids," and "wedding," for instance, appears to denote the topic of weddings; the cluster that contains "gun," "fired," "jammed," "pistol," and "shootings" appears to designate the topic of shootings.

Joining Rus on the paper are Mikhail Volkov, an MIT postdoc in electrical engineering and computer science, and Dan Feldman, director of the University of Haifa's Robotics and Big Data Lab and a former postdoc in Rus's group.

The researchers' new coreset technique is useful for a range of tools with names like singular-value decomposition, principal-component analysis, and nonnegative matrix factorization. But what they all have in common is dimension reduction: They take data sets with large numbers of variables and find approximations of them with far fewer variables.

In this, these tools are similar to coresets. But coresets simply reduce the size of a data set, while the dimension-reduction tools change its description in a way that's guaranteed to preserve as much information as possible. That guarantee, however, makes the tools much more computationally intensive than coreset generation -- too computationally intensive for practical application to large data sets.

The researchers believe that their technique could be used to winnow a data set with, say, millions of variables -- such as descriptions of Wikipedia pages in terms of the words they use -- to merely thousands. At that point, a widely used technique like principal-component analysis could reduce the number of variables to mere hundreds, or even lower.

The researchers' technique works with what is called sparse data. Consider, for instance, the Wikipedia matrix, with its 4.4 million columns, each representing a different word. Any given article on Wikipedia will use only a few thousand distinct words. So in any given row -- representing one article -- only a few thousand matrix slots out of 4.4 million will have any values in them. In a sparse matrix, most of the values are zero.

Crucially, the new technique preserves that sparsity, which makes its coresets much easier to deal with computationally. Calculations become lot easier if they involve a lot of multiplication by and addition of zero.

The new coreset technique uses what's called a merge-and-reduce procedure. It starts by taking, say, 20 data points in the data set and selecting 10 of them as most representative of the full 20. Then it performs the same procedure with another 20 data points, giving it two reduced sets of 10, which it merges to form a new set of 20. Then it does another reduction, from 20 down to 10.

Even though the procedure examines every data point in a huge data set, because it deals with only small collections of points at a time, it remains computationally efficient. And in their paper, the researchers prove that, for applications involving an array of common dimension-reduction tools, their reduction method provides a very good approximation of the full data set.

That method depends on a geometric interpretation of the data, involving something called a hypersphere, which is the multidimensional analogue of a circle. Any piece of multivariable data can be thought of as a point in a multidimensional space. In the same way that the pair of numbers (1, 1) defines a point in a two-dimensional space -- the point one step over on the X-axis and one step up on the Y-axis -- a row of the Wikipedia table, with its 4.4 million numbers, defines a point in a 4.4-million-dimensional space.

The researchers' reduction algorithm begins by finding the average value of the subset of data points -- let's say 20 of them -- that it's going to reduce. This, too, defines a point in a high-dimensional space; call it the origin. Each of the 20 data points is then "projected" onto a hypersphere centered at the origin. That is, the algorithm finds the unique point on the hypersphere that's in the direction of the data point.

The algorithm selects one of the 20 data projections on the hypersphere. It then selects the projection on the hypersphere farthest away from the first. It finds the point midway between the two and then selects the data projection farthest away from the midpoint; then it finds the point midway between those two points and selects the data projection farthest away from it; and so on.

The researchers were able to prove that the midpoints selected through this method will converge very quickly on the center of the hypersphere. The method will quickly select a subset of points whose average value closely approximates that of the 20 initial points. That makes them particularly good candidates for inclusion in the coreset.

-end-

**Additional background**

PAPER: Dimensionality reduction of massive sparse datasets using coresets

ARCHIVE: To handle big data, shrink it

ARCHIVE: Collecting just the right data

ARCHIVE: Speeding algorithms by shrinking data

Massachusetts Institute of Technology

**Related Algorithm Articles:**

New algorithm could unleash the power of quantum computers

A new algorithm that fast forwards simulations could bring greater use ability to current and near-term quantum computers, opening the way for applications to run past strict time limits that hamper many quantum calculations.

A new algorithm that fast forwards simulations could bring greater use ability to current and near-term quantum computers, opening the way for applications to run past strict time limits that hamper many quantum calculations.

QUT algorithm could quash Twitter abuse of women

Online abuse targeting women, including threats of harm or sexual violence, has proliferated across all social media platforms but QUT researchers have developed a sophisticated statistical model to identify misogynistic content and help drum it out of the Twittersphere.

Online abuse targeting women, including threats of harm or sexual violence, has proliferated across all social media platforms but QUT researchers have developed a sophisticated statistical model to identify misogynistic content and help drum it out of the Twittersphere.

New learning algorithm should significantly expand the possible applications of AI

The e-prop learning method developed at Graz University of Technology forms the basis for drastically more energy-efficient hardware implementations of Artificial Intelligence.

The e-prop learning method developed at Graz University of Technology forms the basis for drastically more energy-efficient hardware implementations of Artificial Intelligence.

Algorithm predicts risk for PTSD after traumatic injury

With high precision, a new algorithm predicts which patients treated for traumatic injuries in the emergency department will later develop posttraumatic stress disorder.

With high precision, a new algorithm predicts which patients treated for traumatic injuries in the emergency department will later develop posttraumatic stress disorder.

New algorithm uses artificial intelligence to help manage type 1 diabetes

Researchers and physicians at Oregon Health & Science University have designed a method to help people with type 1 diabetes better manage their glucose levels.

Researchers and physicians at Oregon Health & Science University have designed a method to help people with type 1 diabetes better manage their glucose levels.

A new algorithm predicts the difficulty in fighting fire

The tool completes previous studies with new variables and could improve the ability to respond to forest fires.

The tool completes previous studies with new variables and could improve the ability to respond to forest fires.

New algorithm predicts optimal materials among all possible compounds

Skoltech researchers have offered a solution to the problem of searching for materials with required properties among all possible combinations of chemical elements.

Skoltech researchers have offered a solution to the problem of searching for materials with required properties among all possible combinations of chemical elements.

New algorithm to help process biological images

Skoltech researchers have presented a new biological image processing method that accurately picks out specific biological objects in complex images.

Skoltech researchers have presented a new biological image processing method that accurately picks out specific biological objects in complex images.

Skoltech scientists break Google's quantum algorithm

In the near term, Google has devised new quantum enhanced algorithms that operate in the presence of realistic noise.

In the near term, Google has devised new quantum enhanced algorithms that operate in the presence of realistic noise.

The most human algorithm

A team from the research group SEES:lab of the Department of Chemical Engineering of the Universitat Rovira I Virgili and ICREA has made a breakthrough with the development of a new algorithm that makes more accurate predictions and generates mathematical models that also make it possible to understand these predictions.

A team from the research group SEES:lab of the Department of Chemical Engineering of the Universitat Rovira I Virgili and ICREA has made a breakthrough with the development of a new algorithm that makes more accurate predictions and generates mathematical models that also make it possible to understand these predictions.

## Trending Science News

**Current Coronavirus (COVID-19) News**

## Top Science Podcasts

We have hand picked the**top science podcasts of 2020**.

**Now Playing: TED Radio Hour**

**Warped Reality**

False information on the internet makes it harder and harder to know what's true, and the consequences have been devastating. This hour, TED speakers explore ideas around technology and deception. Guests include law professor Danielle Citron, journalist Andrew Marantz, and computer scientist Joy Buolamwini.

**Now Playing: Science for the People**

**#576 Science Communication in Creative Places**

When you think of science communication, you might think of TED talks or museum talks or video talks, or... people giving lectures. It's a lot of people talking. But there's more to sci comm than that. This week host Bethany Brookshire talks to three people who have looked at science communication in places you might not expect it. We'll speak with Mauna Dasari, a graduate student at Notre Dame, about making mammals into a March Madness match. We'll talk with Sarah Garner, director of the Pathologists Assistant Program at Tulane University School of Medicine, who takes pathology instruction out of...

**Now Playing: Radiolab**

**What If?**

There's plenty of speculation about what Donald Trump might do in the wake of the election. Would he dispute the results if he loses? Would he simply refuse to leave office, or even try to use the military to maintain control? Last summer, Rosa Brooks got together a team of experts and political operatives from both sides of the aisle to ask a slightly different question. Rather than arguing about whether he'd do those things, they dug into what exactly would happen if he did. Part war game part choose your own adventure, Rosa's Transition Integrity Project doesn't give us any predictions, and it isn't a referendum on Trump. Instead, it's a deeply illuminating stress test on our laws, our institutions, and on the commitment to democracy written into the constitution. This episode was reported by Bethel Habte, with help from Tracie Hunte, and produced by Bethel Habte. Jeremy Bloom provided original music. Support Radiolab by becoming a member today at Radiolab.org/donate. You can read The Transition Integrity Project's report here.