Stanford students deploy machine learning to aid environmental monitoring

October 01, 2018

As Hurricane Florence ground its way through North Carolina, it released what might politely be called an excrement storm. Massive hog farm manure pools washed a stew of dangerous bacteria and heavy metals into nearby waterways.

More efficient oversight might have prevented some of the worst effects, but even in the best of times, state and federal environmental regulators are overextended and underfunded. Help is at hand, however, in the form of machine learning - training computers to automatically detect patterns in data - according to Stanford researchers.

Their study, published in Nature Sustainability, finds that machine learning techniques could catch two to seven times as many infractions as current approaches, and suggests far-reaching applications for public investments.

"Especially in an era of decreasing budgets, identifying cost-effective ways to protect public health and the environment is critical," said study coauthor Elinor Benami, a graduate student in the Emmett Interdisciplinary Program on Environment and Resources (E-IPER) in Stanford's School of Earth, Energy & Environmental Sciences.

Optimizing resources

Just as the IRS can't audit every taxpayer, most government agencies must constantly make decisions about how to allocate resources. Machine learning methods can help optimize that process by predicting where funds can yield the most benefit. The researchers focused on the Clean Water Act, under which the U.S. Environmental Protection Agency and state governments are responsible for regulating more than 300,000 facilities but are able to inspect less than 10 percent of those in a given year.

Using data from past inspections, the researchers deployed a series of models to predict the likelihood of failing an inspection, based on facility characteristics, such as location, industry and inspection history. Then, they ran their models on all facilities, including ones that had yet to be inspected.

This technique generated a risk score for every facility, indicating how likely it was to fail an inspection. The group then created four inspection scenarios reflecting different institutional constraints - varying inspection budgets and inspection frequencies, for example - and used the score to prioritize inspections and predict violations.

Under the scenario with the fewest constraints - unlikely in the real world - the researchers predicted catching up to seven times the number of violations compared to the status quo. When they accounted for more constraints, the number of violations detected was still double the status quo.

Limits of algorithms

Despite its potential, machine learning has flaws to guard against, the researchers warn. "Algorithms are imperfect, they can perpetuate bias at times and they can be gamed," said study lead author Miyuki Hino, also a graduate student in E-IPER.

For example, agents, such hog farm owners, may manipulate their reported data to influence the likelihood of receiving benefits or avoiding penalties. Others may alter their behavior - relaxing standards when the risk of being caught is low - if they know their likelihood of being selected by the algorithm. Institutional, political and financial constraints could limit machine learning's ability to improve upon existing practices. The approach could potentially exacerbate environmental justice concerns if it systematically directs oversight away from facilities located in low-income or minority areas. Also, the machine learning approach does not account for potential changes over time, such as in public policy priorities and pollution control technologies.

The researchers suggest remedies to some of these challenges. Selecting some facilities at random, regardless of their risk scores, and occasionally re-training the model to reflect up-to-date risk factors could help keep low-risk facilities on their toes about compliance. Environmental justice concerns could be built into inspection targeting practices. Examining the value and trade-offs of using self-reported data could help manage concerns about strategic behavior and manipulation by facilities.

The researchers suggest future work could examine additional complexities of integrating a machine learning approach into the EPA's broader enforcement efforts, such as incorporating specific enforcement priorities or identifying technical, financial and human resource limitations. In addition, these methods could be applied in other contexts within the U.S. and beyond where regulators are seeking to make efficient use of limited resources.

"This model is a starting point that could be augmented with greater detail on the costs and benefits of different inspections, violations and enforcement responses," said co-author and fellow E-IPER graduate student Nina Brooks.
The researchers received support from the National Science Foundation, the Stanford Department of Earth System Science and the Stanford Graduate Fellowship/David and Lucile Packard Foundation.

Miyuki Hino
Emmett Interdisciplinary Program in Environment and Resources

Elinor Benami
Emmett Interdisciplinary Program in Environment and Resources

Nina Brooks
Emmett Interdisciplinary Program in Environment and Resources

Stanford University

Related Data Articles from Brightsurf:

Keep the data coming
A continuous data supply ensures data-intensive simulations can run at maximum speed.

Astronomers are bulging with data
For the first time, over 250 million stars in our galaxy's bulge have been surveyed in near-ultraviolet, optical, and near-infrared light, opening the door for astronomers to reexamine key questions about the Milky Way's formation and history.

Novel method for measuring spatial dependencies turns less data into more data
Researcher makes 'little data' act big through, the application of mathematical techniques normally used for time-series, to spatial processes.

Ups and downs in COVID-19 data may be caused by data reporting practices
As data accumulates on COVID-19 cases and deaths, researchers have observed patterns of peaks and valleys that repeat on a near-weekly basis.

Data centers use less energy than you think
Using the most detailed model to date of global data center energy use, researchers found that massive efficiency gains by data centers have kept energy use roughly flat over the past decade.

Storing data in music
Researchers at ETH Zurich have developed a technique for embedding data in music and transmitting it to a smartphone.

Life data economics: calling for new models to assess the value of human data
After the collapse of the blockchain bubble a number of research organisations are developing platforms to enable individual ownership of life data and establish the data valuation and pricing models.

Geoscience data group urges all scientific disciplines to make data open and accessible
Institutions, science funders, data repositories, publishers, researchers and scientific societies from all scientific disciplines must work together to ensure all scientific data are easy to find, access and use, according to a new commentary in Nature by members of the Enabling FAIR Data Steering Committee.

Democratizing data science
MIT researchers are hoping to advance the democratization of data science with a new tool for nonstatisticians that automatically generates models for analyzing raw data.

Getting the most out of atmospheric data analysis
An international team including researchers from Kanazawa University used a new approach to analyze an atmospheric data set spanning 18 years for the investigation of new-particle formation.

Read More: Data News and Data Current Events is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to