Nav: Home

Finding patterns in corrupted data

October 26, 2016

CAMBRIDGE, Mass -- Data analysis -- and particularly big-data analysis -- is often a matter of fitting data to some sort of mathematical model. The most familiar example of this might be linear regression, which finds a line that approximates a distribution of data points. But fitting data to probability distributions, such as the familiar bell curve, is just as common.

If, however, a data set has just a few corrupted entries -- say, outlandishly improbable measurements -- standard data-fitting techniques can break down. This problem becomes much more acute with high-dimensional data, or data with many variables, which is ubiquitous in the digital age.

Since the early 1960s, it's been known that there are algorithms for weeding corruptions out of high-dimensional data, but none of the algorithms proposed in the past 50 years are practical when the variable count gets above, say, 12.

That's about to change. Earlier this month, at the IEEE Symposium on Foundations of Computer Science, a team of researchers from MIT's Computer Science and Artificial Intelligence Laboratory, the University of Southern California, and the University of California at San Diego presented a new set of algorithms that can efficiently fit probability distributions to high-dimensional data.

Remarkably, at the same conference, researchers from Georgia Tech presented a very similar algorithm.

The pioneering work on "robust statistics," or statistical methods that can tolerate corrupted data, was done by statisticians, but both new papers come from groups of computer scientists. That probably reflects a shift of attention within the field, toward the computational efficiency of model-fitting techniques.

"From the vantage point of theoretical computer science, it's much more apparent how rare it is for a problem to be efficiently solvable," says Ankur Moitra, the Rockwell International Career Development Assistant Professor of Mathematics at MIT and one of the leaders of the MIT-USC-UCSD project. "If you start off with some hypothetical thing -- 'Man, I wish I could do this. If I could, it would be robust' -- you're going to have a bad time, because it will be inefficient. You should start off with the things that you know that you can efficiently do, and figure out how to piece them together to get robustness."

Resisting corruption

To understand the principle behind robust statistics, Moitra explains, consider the normal distribution -- the bell curve, or in mathematical parlance, the one-dimensional Gaussian distribution. The one-dimensional Gaussian is completely described by two parameters: the mean, or average, value of the data, and the variance, which is a measure of how quickly the data spreads out around the mean.

If the data in a data set -- say, people's heights in a given population -- is well-described by a Gaussian distribution, then the mean is just the arithmetic average. But suppose you have a data set consisting of height measurements of 100 women, and while most of them cluster around 64 inches -- some a little higher, some a little lower -- one of them, for some reason, is 1,000 inches. Taking the arithmetic average will peg a woman's mean height at 6 feet 4 inches, not 5 feet 4 inches.

One way to avoid such a nonsensical result is to estimate the mean, not by taking the numerical average of the data, but by finding its median value. This would involve listing all the 100 measurements in order, from smallest to highest, and taking the 50th or 51st. An algorithm that uses the median to estimate the mean is thus more robust, meaning it's less responsive to corrupted data, than one that uses the average.

The median is just an approximation of the mean, however, and the accuracy of the approximation decreases rapidly with more variables. Big-data analysis might require examining thousands or even millions of variables; in such cases, approximating the mean with the median would often yield unusable results.

Identifying outliers

One way to weed corrupted data out of a high-dimensional data set is to take 2-D cross sections of the graph of the data and see whether they look like Gaussian distributions. If they don't, you may have located a cluster of spurious data points, such as that 80-foot-tall woman, which can simply be excised.

The problem is that, with all previously known algorithms that adopted this approach, the number of cross sections required to find corrupted data was an exponential function of the number of dimensions. By contrast, Moitra and his coauthors -- Gautam Kamath and Jerry Li, both MIT graduate students in electrical engineering and computer science; Ilias Diakonikolas and Alistair Stewart of USC; and Daniel Kane of USCD -- found an algorithm whose running time increases with the number of data dimensions at a much more reasonable rate (or, polynomially, in computer science jargon).

Their algorithm relies on two insights. The first is what metric to use when measuring how far away a data set is from a range of distributions with approximately the same shape. That allows them to tell when they've winnowed out enough corrupted data to permit a good fit.

The other is how to identify the regions of data in which to begin taking cross sections. For that, the researchers rely on something called the kurtosis of a distribution, which measures the size of its tails, or the rate at which the concentration of data decreases far from the mean. Again, there are multiple ways to infer kurtosis from data samples, and selecting the right one is central to the algorithm's efficiency.

The researchers' approach works with Gaussian distributions, certain combinations of Gaussian distributions, another common distribution called the product distribution, and certain combinations of product distributions. Although they believe that their approach can be extended to other types of distributions, in ongoing work, their chief focus is on applying their techniques to real-world data.

ARCHIVE: Dueling algorithms

Massachusetts Institute of Technology

Related Algorithm Articles:

Scientists use algorithm to peer through opaque brains
A new algorithm helps scientists record the activity of individual neurons within a volume of brain tissue.
Algorithm generates origami folding patterns for any shape
A new algorithm generates practical paper-folding patterns to produce any 3-D structure.
New algorithm tracks neurons in bendy brain of freely crawling worm
Scientists at Princeton University have developed a new algorithm to track neurons in the brain of the worm Caenorhabditis elegans while it crawls.
Does my algorithm work? There's no shortcut for community detection
Community detection is an important tool for scientists studying networks, but a new paper published in Science Advances calls into question the common practice of using metadata for ground truth validation.
'Cyclops' algorithm spots daily rhythms in cells
Humans, like virtually all other complex organisms on Earth, have adapted to their planet's 24-hour cycle of sunlight and darkness.
An algorithm that knows when you'll get bored with your favorite mobile game
Researchers from the Tokyo-based company Silicon Studio, led by Spanish data scientist África Periáñez, have developed a new algorithm that predicts when a user will leave a mobile game.
Algorithm identified Trump as 'not-married'
Scientists from Russia and Singapore created an algorithm that predicts user marital status with 86% precision using data from three social networks instead of one.
A novel positioning algorithm based on self-adaptive algorithm
Much attention has been paid to the Taylor series expansion (TSE) method these years, which has been extensively used for solving nonlinear equations for its good robustness and accuracy of positioning.
Algorithm can create a bridge between Clinton and Trump supporters
The article that received the best student-paper award in the Tenth International Conference on Web Search and Data Mining (WSDM 2017) builds algorithmic techniques to mitigate the rising polarization by connecting people with opposing views -- and evaluates them on Twitter.
Deep learning algorithm does as well as dermatologists in identifying skin cancer
In hopes of creating better access to medical care, Stanford researchers have trained an algorithm to diagnose skin cancer.

Related Algorithm Reading:

Introduction to Algorithms, 3rd Edition (The MIT Press)
by Thomas H. Cormen (Author), Charles E. Leiserson (Author), Ronald L. Rivest (Author), Clifford Stein (Author)

Algorithms (4th Edition)
by Robert Sedgewick (Author), Kevin Wayne (Author)

The Algorithm Design Manual
by Steven S Skiena (Author)

Grokking Algorithms: An illustrated guide for programmers and other curious people
by Aditya Bhargava (Author)

by Sanjoy Dasgupta Algorithms (Author), Christos H. Papadimitriou Algorithms (Author), Umesh Vazirani Algorithms (Author)

Introduction to Algorithms (Eastern Economy Edition)
by Thomas H. Cormen (Author), Charles E. Leiserson (Author), Ronald L. Rivest (Author), Clifford Stein (Author)

Data Structures and Algorithm Analysis in Java (3rd Edition)
by Mark A. Weiss (Author)

Algorithm Design
by Jon Kleinberg (Author), Éva Tardos (Author)

Computer Vision: Algorithms and Applications (Texts in Computer Science)
by Richard Szeliski (Author)

Data Structures & Algorithm Analysis in C++
by Mark A. Weiss (Author)

Best Science Podcasts 2019

We have hand picked the best science podcasts for 2019. Sit back and enjoy new science podcasts updated daily from your favorite science news services and scientists.
Now Playing: TED Radio Hour

Approaching With Kindness
We often forget to say the words "thank you." But can those two words change how you — and those around you — look at the world? This hour, TED speakers on the power of gratitude and appreciation. Guests include author AJ Jacobs, author and former baseball player Mike Robbins, Dr. Laura Trice, Professor of Management Christine Porath, and former Danish politician Özlem Cekic.
Now Playing: Science for the People

#509 Anisogamy: The Beginning of Male and Female
This week we discuss how the sperm and egg came to be, and how a difference of reproductive interest has led to sexual conflict in bed bugs. We'll be speaking with Dr. Geoff Parker, an evolutionary biologist credited with developing a theory to explain the evolution of two sexes, about anisogamy, sexual reproduction through the fusion of two different gametes: the egg and the sperm. Then we'll speak with Dr. Roberto Pereira, research scientist in urban entomology at the University of Florida, about traumatic insemination in bed bugs.