Nav: Home

Democratizing data science

January 15, 2019

MIT researchers are hoping to advance the democratization of data science with a new tool for nonstatisticians that automatically generates models for analyzing raw data.

Democratizing data science is the notion that anyone, with little to no expertise, can do data science if provided ample data and user-friendly analytics tools. Supporting that idea, the new tool ingests datasets and generates sophisticated statistical models typically used by experts to analyze, interpret, and predict underlying patterns in data.

The tool currently lives on Jupyter Notebook, an open-source web framework that allows users to run programs interactively in their browsers. Users need only write a few lines of code to uncover insights into, for instance, financial trends, air travel, voting patterns, the spread of disease, and other trends.

In a paper presented at this week's ACM SIGPLAN Symposium on Principles of Programming Languages, the researchers show their tool can accurately extract patterns and make predictions from real-world datasets, and even outperform manually constructed models in certain data-analytics tasks.

"The high-level goal is making data science accessible to people who are not experts in statistics," says first author Feras Saad '15, MEng '16, a PhD student in the Department of Electrical Engineering and Computer Science (EECS). "People have a lot of datasets that are sitting around, and our goal is to build systems that let people automatically get models they can use to ask questions about that data."

Ultimately, the tool addresses a bottleneck in the data science field, says co-author Vikash Mansinghka '05, MEng '09, PhD '09, a researcher in the Department of Brain and Cognitive Sciences (BCS) who runs the Probabilistic Computing Project. "There is a widely recognized shortage of people who understand how to model data well," he says. "This is a problem in governments, the nonprofit sector, and places where people can't afford data scientists."

The paper's other co-authors are Marco Cusumano-Towner, an EECS PhD student; Ulrich Schaechtle, a BCS postdoc with the Probabilistic Computing Project; and Martin Rinard, an EECS professor and researcher in the Computer Science and Artificial Intelligence Laboratory.

Bayesian modeling

The work uses Bayesian modeling, a statistics method that continuously updates the probability of a variable as more information about that variable becomes available. For instance, statistician and writer Nate Silver uses Bayesian-based models for his popular website FiveThirtyEight. Leading up to a presidential election, the site's models make an initial prediction that one of the candidates will win, based on various polls and other economic and demographic data. This prediction is the variable. On Election Day, the model uses that information, and weighs incoming votes and other data, to continuously update that probability of a candidate's potential of winning.

More generally, Bayesian models can be used to "forecast" -- predict an unknown value in the dataset -- and to uncover patterns in data and relationships between variables. In their work, the researchers focused on two types of datasets: time-series, a sequence of data points in chronological order; and tabular data, where each row represents an entity of interest and each column represents an attribute.

Time-series datasets can be used to predict, say, airline traffic in the coming months or years. A probabilistic model crunches scores of historical traffic data and produces a time-series chart with future traffic patterns plotted along the line. The model may also uncover periodic fluctuations correlated with other variables, such as time of year.

On the other hand, a tabular dataset used for, say, sociological research, may contain hundreds to millions of rows, each representing an individual person, with variables characterizing occupation, salary, home location, and answers to survey questions. Probabilistic models could be used to fill in missing variables, such as predicting someone's salary based on occupation and location, or to identify variables that inform one another, such as finding that a person's age and occupation are predictive of their salary.

Statisticians view Bayesian modeling as a gold standard for constructing models from data. But Bayesian modeling is notoriously time-consuming and challenging. Statisticians first take an educated guess at the necessary model structure and parameters, relying on their general knowledge of the problem and the data. Using a statistical programming environment, such as R, a statistician then builds models, fits parameters, checks results, and repeats the process until they strike an appropriate performance tradeoff that weighs the model's complexity and model quality.

The researchers' tool automates a key part of this process. "We're giving a software system a job you'd have a junior statistician or data scientist do," Mansinghka says. "The software can answer questions automatically from the data -- forecasting predictions or telling you what the structure is -- and it can do so rigorously, reporting quantitative measures of uncertainty. This level of automation and rigor is important if we're trying to make data science more accessible."

Bayesian synthesis

With the new approach, users write a line of code detailing the raw data's location. The tool loads that data and creates multiple probabilistic programs that each represent a Bayesian model of the data. All these automatically generated models are written in domain-specific probabilistic programming languages -- coding languages developed for specific applications -- that are optimized for representing Bayesian models for a specific type of data.

The tool works using a modified version of a technique called "program synthesis," which automatically creates computer programs given data and a language to work within. The technique is basically computer programming in reverse: Given a set of input-output examples, program synthesis works its way backward, filling in the blanks to construct an algorithm that produces the example outputs based on the example inputs.

The approach is different from ordinary program synthesis in two ways. First, the tool synthesizes probabilistic programs that represent Bayesian models for data, whereas traditional methods produce programs that do not model data at all. Second, the tool synthesizes multiple programs simultaneously, while traditional methods produce only one at a time. Users can pick and choose which models best fit their application.

"When the system makes a model, it spits out a piece of code written in one of these domain-specific probabilistic programming languages ... that people can understand and interpret," Mansinghka says. "For example, users can check if a time series dataset like airline traffic volume has seasonal variation just by reading the code -- unlike with black-box machine learning and statistics methods, where users have to trust a model's predictions but can't read it to understand its structure."

Probabilistic programming is an emerging field at the intersection of programming languages, artificial intelligence, and statistics. This year, MIT hosted the first International Conference on Probabilistic Programming, which had more than 200 attendees, including leading industry players in probabilistic programming such as Microsoft, Uber, and Google.
Written by Rob Matheson, MIT News Office

Related links

Paper: "Bayesian synthesis of probabilistic programs for automatic data modeling"

ARCHIVE: Graphics in reverse

ARCHIVE: Machines that learn better

Massachusetts Institute of Technology

Related Artificial Intelligence Articles:

A hidden history of artificial intelligence in primary care
Artificial intelligence methods are being utilized in radiology, cardiology and other medical specialty fields to quickly and accurately process large quantities of health data to improve the diagnostic and treatment power of health care teams.
Identifying light sources using artificial intelligence
Identifying sources of light plays an important role in the development of many photonic technologies, such as lidar, remote sensing, and microscopy.
Artificial intelligence could serve as backup to radiologists' eyes
Deploying artificial intelligence could help radiologists to more accurately classify lung diseases.
Reducing the carbon footprint of artificial intelligence
MIT system cuts the energy required for training and running neural networks.
Researchers rebuild the bridge between neuroscience and artificial intelligence
In an article in the journal Scientific Reports, researchers reveal that they have successfully rebuilt the bridge between experimental neuroscience and advanced artificial intelligence learning algorithms.
Artificial intelligence can help some businesses but may not work for others
The temptation for businesses to use artificial intelligence and other technology to improve performance, drive down labor costs, and better the bottom line is understandable.
Artificial intelligence could help predict future diabetes cases
A type of artificial intelligence called machine learning can help predict which patients will develop diabetes, according to an ENDO 2020 abstract that will be published in a special supplemental section of the Journal of the Endocrine Society.
Artificial intelligence for very young brains
Montreal's CHU Sainte-Justine children's hospital and the ÉTS engineering school pool their expertise to develop an innovative new technology for the segmentation of neonatal brain images.
Putting artificial intelligence to work in the lab
An Australian-German collaboration has demonstrated fully-autonomous SPM operation, applying artificial intelligence and deep learning to remove the need for constant human supervision.
Composing new proteins with artificial intelligence
Scientists have long studied how to improve proteins or design new ones.
More Artificial Intelligence News and Artificial Intelligence Current Events

Trending Science News

Current Coronavirus (COVID-19) News

Top Science Podcasts

We have hand picked the top science podcasts of 2020.
Now Playing: TED Radio Hour

Listen Again: Meditations on Loneliness
Original broadcast date: April 24, 2020. We're a social species now living in isolation. But loneliness was a problem well before this era of social distancing. This hour, TED speakers explore how we can live and make peace with loneliness. Guests on the show include author and illustrator Jonny Sun, psychologist Susan Pinker, architect Grace Kim, and writer Suleika Jaouad.
Now Playing: Science for the People

#565 The Great Wide Indoors
We're all spending a bit more time indoors this summer than we probably figured. But did you ever stop to think about why the places we live and work as designed the way they are? And how they could be designed better? We're talking with Emily Anthes about her new book "The Great Indoors: The Surprising Science of how Buildings Shape our Behavior, Health and Happiness".
Now Playing: Radiolab

The Third. A TED Talk.
Jad gives a TED talk about his life as a journalist and how Radiolab has evolved over the years. Here's how TED described it:How do you end a story? Host of Radiolab Jad Abumrad tells how his search for an answer led him home to the mountains of Tennessee, where he met an unexpected teacher: Dolly Parton.Jad Nicholas Abumrad is a Lebanese-American radio host, composer and producer. He is the founder of the syndicated public radio program Radiolab, which is broadcast on over 600 radio stations nationwide and is downloaded more than 120 million times a year as a podcast. He also created More Perfect, a podcast that tells the stories behind the Supreme Court's most famous decisions. And most recently, Dolly Parton's America, a nine-episode podcast exploring the life and times of the iconic country music star. Abumrad has received three Peabody Awards and was named a MacArthur Fellow in 2011.