# Mining for gold in a mountain of data

February 15, 2018After shopping at your favorite grocery store week after week, you finally earned a free turkey.

The cashier scanned your loyalty card at every checkout, rewarding you with points toward a holiday turkey or ham - while at the same time sending an itemization of everything you bought to a database.

The grocery analyzes the "big data" it collects from you and other shoppers to uncover hidden patterns, correlations and other insights. The result is smarter business moves, more efficient operations, higher profits and the promise of happier customers.

Researchers estimate that more than 2.5 exabytes (that's 2.5 billion gigabytes) of data are generated throughout the world every day. The use of loyalty cards, fitness trackers, web-based email, public services and social media - including every post, comment, like, photo and geotag - all contribute to this vast warehouse of information.

Big data involves not only collecting data, but storing, analyzing, searching, sharing, transferring, visualizing, querying and updating it. In fact, big data is so voluminous and complex that traditional ways of processing have proved inadequate. Hundreds or even thousands of computers running in parallel are needed for proper analyses.

To help address these computational bottlenecks, a team from the Industrial and Systems Engineering department at Lehigh University gathered with their colleagues at King Abdullah Univ. of Science and Technology in Saudi Arabia Feb. 5-7, 2018.

The KAUST Research Workshop on Optimization and Big Data brought researchers from across academia and industry to discuss big data optimization algorithms, theory, applications and systems.

Tamás Terlaky, the George N. and Soteria Kledaras '87 Endowed Chair professor, was the keynote speaker at KAUST. Terlaky opened the workshop with his presentation, "60 Years of Interior Point Methods: From Periphery to Glory."

Terlaky's keynote focused on a technique pioneered in 1984 known as Interior Point Method (IPM). This novel methodology ignited far-reaching, intensive research toward discovering effective ways to solve large-scale optimization problems such as those found in big data analytics.

"Increasingly, we are getting different kinds of solutions in optimization," Terlaky said. "Computation has become ubiquitous, and thanks also to the 'Interior Point Revolution' we have seen tremendous advances in computing."

The concepts of IPMs and "machine learning" - where computers acquire the ability to learn and make decisions - were first proposed in the '50s and were ahead of their time, Terlaky said. With computer technology still in its infancy, they failed to make any real impact. By the '80s, however, the stars aligned to make the IPM revolution possible.

Now that we are in the era of big data, Terlaky said, recent advances in computer and information technology both enables and requires revolutionary advances in machine learning methodologies. "History always repeats itself," Terlaky said. "You should learn from it."

Terlaky concluded the day's session with "A Polynomial-time Rescaled von Neumann Algorithm for Linear Feasibility Problems."

Furthering the discussion on optimization was Katya Scheinberg, Harvey E. Wagner Endowed Chair professor, with her presentation "Direct and Efficient Optimization of Prediction Error and AUC of Linear Classifiers." Also presenting was Martin Takáč, assistant professor, with "SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient."

Xi He, a fourth-year Ph.D. candidate, gave the poster presentation "An Inexact Regularized Stochastic Newton Method for Nonconvex Optimization." Takáč is his advisor.

**Abstracts of the four talks and the poster presentation are given below.**

**60 Years of Interior Point Methods: From Periphery to Glory - Tamás Terlaky**

The basic concepts of Interior Point Methods (IPMs) were introduced by Frish in 1950s, and further developed in the 1960s, among others by Fiacco-McCormick (SUMT) and Dikin (Affince scaling). By the early 70s it was concluded that, mostly due to numerical instability, IPMs most probably will not be viable algorithms for solving large scale optimization problems. Karmarkar's 1984 paper and the subsequent "Interior Point Revolution" fundamentally changed the landscape of optimization. IPMs become the method of choice to solve large-scale linear optimization problems and new classes of conic and convex optimization problems become efficiently solvable. The new powerful algorithmic and software tools opened new areas of applications. In this talk we walk through the history of IPMs, highlight the scientific and computer technology advances that make the Interior Point revolution possible.

**A Polynomial-time Rescaled von Neumann Algorithm for Linear Feasibility Problems - Tamás Terlaky**

The perceptron and von Neumann algorithms are known to be closely related, like duals. A deterministic rescaled version of the perceptron algorithm was proved to be polynomial by Pena and Soheil. Recently, Chubanov proposed a method which solves homogeneous linear equality systems with positive variables in polynomial time. Chubanov's method can be considered as a column-wise rescaling procedure. We adapt Chubanov's method to the von Neumann problem, and so we design a polynomial time column-wise rescaling von Neumann algorithm. This algorithm is the first variant of the von Neumann algorithm with polynomial complexity. Joint work with Dan Li and Kees Roos.

**Direct and Efficient Optimization of Prediction Error and AUC of Linear Classifiers - Katya Scheinberg**

The predictive quality of most machine learning models is measured by expected prediction error or so-called Area Under the Curve (AUC). However, these functions are not used in the empirical loss minimization, because their empirical approximations are nonconvex and discontinuous, and more importantly have zero derivative almost everywhere. Instead, other loss functions are used, such as logistic loss. In this work, we show that in the case of linear predictors, and under the assumption that the data has normal distribution, the expected error and the expected AUC are not only smooth, but have well defined derivatives, which depend on the first and second moments of the distribution. We show that these derivatives can be approximated and used in empirical risk minimization, thus proposing a gradient-based optimization method for direct optimization of prediction error and AUC. Moreover, the proposed algorithm has no dependence on the size of the dataset, unlike logistic regression and all other well-known empirical risk minimization techniques.

**SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient - Martin Takáč**

In this paper, we propose a StochAstic Recursive grAdient algoritHm (SARAH), as well as its practical variant SARAH+, as a novel approach to the finite-sum minimization problems. Different from the vanilla SGD and other modern stochastic methods such as SVRG, S2GD, SAG and SAGA, SARAH admits a simple recursive framework for updating stochastic gradient estimates; when comparing to SAG/SAGA, SARAH does not require a storage of past gradients. The linear convergence rate of SARAH is proven under strong convexity assumption. We also prove a linear convergence rate (in the strongly convex case) for an inner loop of SARAH, the property that SVRG does not possess. The convergence rate for convex and non-convex case is also discussed. Numerical experiments demonstrate the efficiency of our algorithm. See the paper published in

*Proceedings of Machine Learning Research*.

An Inexact Regularized Stochastic Newton Method for Nonconvex Optimization - Xi He

-end-

**Key Links:**

Department of Industrial and Systems Engineering, Lehigh University

Faculty profile and research website: Tamás Terlaky

Faculty profile and research website: Katya Scheinberg

Faculty profile and research website: Martin Takáč

Student profile and research website: Xi He

Lehigh University

**Related Algorithm Articles:**

Algorithm personalizes which cancer mutations are best targets for immunotherapy

As tumor cells multiply, they often spawn tens of thousands of genetic mutations.

As tumor cells multiply, they often spawn tens of thousands of genetic mutations.

Universal algorithm set to boost microscopes

EPFL scientists have developed an algorithm that can determine whether a super-resolution microscope is operating at maximum resolution based on a single image.

EPFL scientists have developed an algorithm that can determine whether a super-resolution microscope is operating at maximum resolution based on a single image.

Algorithm designed to map universe, solve mysteries

Cornell University researchers have developed an algorithm designed to visualize models of the universe in order to solve some of physics' greatest mysteries.

Cornell University researchers have developed an algorithm designed to visualize models of the universe in order to solve some of physics' greatest mysteries.

Algorithm tells robots where nearby humans are headed

A new tool for predicting a person's movement trajectory may help humans and robots work together in close proximity.

A new tool for predicting a person's movement trajectory may help humans and robots work together in close proximity.

Algorithm to transform investment banking with higher returns

A University of Bath researcher has created an algorithm which aims to remove the elements of chance, bias or emotion from investment banking decisions, a development which has the potential to reduce errors in financial decision making and improve financial returns in global markets.

A University of Bath researcher has created an algorithm which aims to remove the elements of chance, bias or emotion from investment banking decisions, a development which has the potential to reduce errors in financial decision making and improve financial returns in global markets.

Algorithm provides customized caffeine strategy for alertness

A web-based caffeine optimization tool successfully designs effective strategies to maximize alertness while avoiding excessive caffeine consumption, according to preliminary results from a new study.

A web-based caffeine optimization tool successfully designs effective strategies to maximize alertness while avoiding excessive caffeine consumption, according to preliminary results from a new study.

New algorithm optimizes quantum computing problem-solving

Tohoku University researchers have developed an algorithm that enhances the ability of a Canadian-designed quantum computer to more efficiently find the best solution for complicated problems, according to a study published in the journal Scientific Reports.

Tohoku University researchers have developed an algorithm that enhances the ability of a Canadian-designed quantum computer to more efficiently find the best solution for complicated problems, according to a study published in the journal Scientific Reports.

Machine learning algorithm helps in the search for new drugs

Researchers have designed a machine learning algorithm for drug discovery which has been shown to be twice as efficient as the industry standard, which could accelerate the process of developing new treatments for disease.

Researchers have designed a machine learning algorithm for drug discovery which has been shown to be twice as efficient as the industry standard, which could accelerate the process of developing new treatments for disease.

Researchers create algorithm to predict PEDV outbreaks

Researchers from North Carolina State University have developed an algorithm that could give pig farms advance notice of porcine epidemic diarrhea virus (PEDV) outbreaks.

Researchers from North Carolina State University have developed an algorithm that could give pig farms advance notice of porcine epidemic diarrhea virus (PEDV) outbreaks.

New algorithm provides a more detailed look at urban heat islands

Urban areas are warmer than the adjacent undeveloped land, a phenomenon known as the urban heat island effect.

Urban areas are warmer than the adjacent undeveloped land, a phenomenon known as the urban heat island effect.

## Top Science Podcasts

We have hand picked the**top science podcasts of 2019**.

**Now Playing: TED Radio Hour**

**In & Out Of Love**

We think of love as a mysterious, unknowable force. Something that happens to us. But what if we could control it? This hour, TED speakers on whether we can decide to fall in â and out of â love. Guests include writer Mandy Len Catron, biological anthropologist Helen Fisher, musician Dessa, One Love CEO Katie Hood, and psychologist Guy Winch.

**Now Playing: Science for the People**

**#543 Give a Nerd a Gift**

Yup, you guessed it... it's Science for the People's annual holiday episode that helps you figure out what sciency books and gifts to get that special nerd on your list. Or maybe you're looking to build up your reading list for the holiday break and a geeky Christmas sweater to wear to an upcoming party. Returning are pop-science power-readers John Dupuis and Joanne Manaster to dish on the best science books they read this past year. And Rachelle Saunders and Bethany Brookshire squee in delight over some truly delightful science-themed non-book objects for those whose bookshelves are already full. Since...