Carnegie Mellon, Stanford researchers devise method to safely share password data

February 22, 2016

PITTSBURGH--An unfortunate reality for cybersecurity researchers is that real-world data for their research too often comes via a security breach. Now computer scientists have devised a way to let organizations share statistics about their users' passwords without putting those same customers at risk of being hacked.

The work at Carnegie Mellon University and Stanford University, part of an emerging field on rigorous human authentication, persuaded Yahoo! to publicly share password frequency statistics for about 70 million of its users.

"This is the first time a major company has released frequency information on user passwords," said Anupam Datta, associate professor of computer science and electrical and computer engineering at CMU. "It's the kind of information that legitimate researchers can use to assess the impact of a security breach and to make informed decisions about password defenses. This is extremely valuable, so we hope other organizations will follow Yahoo's lead."

The researchers are presenting their method on Wednesday at the Network and Distributed System Security Symposium in San Diego. Their method distorts numbers in the dataset so the list is "differentially private," a precise mathematical definition that guarantees the released statistics don't reveal whether any specific individual's password is included in the dataset.

The information at issue isn't actual passwords or user IDs, but password frequency lists -- the number of times passwords are selected by a group of users. In a simplified case involving 10 users, if eight users select "123456" as a password, and two users select "abc123," the frequency list would be (8,2).

Password frequency lists for large user groups can be analyzed to help organizations set authentication policies that balance security with usability, or to predict which user accounts are most vulnerable, said Jeremiah Blocki, a post-doctoral researcher at Microsoft Research who began this study while a post-doc at Carnegie Mellon.

But getting access to frequency lists is difficult because of the potential for misuse. Alone, frequency lists don't help hackers identify individual passwords, Blocki said, but they could potentially provide important clues if cross-referenced to other databases. For instance, in the earlier example, if an adversary knew the passwords for nine of 10 users, it would be child's play to figure out the 10th password knowing that the frequency was (8,2).

Most companies are reluctant to provide access to their frequency lists, so researchers make do with data that has been inadvertently released, such as the 32 million user accounts of the defunct RockYou social app site, which suffered a data breach in 2009.

Several years ago, Joseph Bonneau, a Stanford post-doctoral researcher and a technology fellow with the Electronic Frontier Foundation, obtained samples of password frequency from Yahoo. He was able to publish some aggregate statistics, but Yahoo wouldn't let him publicly share the raw data because of potential privacy concerns.

"Here was this data that was incredibly useful to people like me, but we couldn't get access to it," Blocki said.

So Blocki, Datta and Bonneau created a new algorithm to add just enough distortion to the frequency lists to make them useless to hackers, but still enable researchers to see the high-level patterns they seek in the data.

Their algorithm is based on a powerful differentially private tool called the exponential mechanism, which introduces minimal distortion but is not computationally efficient in general. By exploiting the inherent mathematical structure of a password frequency list, the researchers were able to develop a computationally efficient version of the exponential mechanism tailored to the lists.

"With our new approach, we can provide precise guarantees about privacy," Bonneau said. "I hope this convinces more organizations to share data publicly about passwords and potentially other data that might be useful for security."

Blocki said getting additional organizations to release password frequency lists would enable researchers to explore the impact of differing password policies. The method also might be extended to social networks -- enabling the study of degree distribution lists that track the number of friends users have -- and to more complicated data structures.
This research was supported by the National Science Foundation, the Air Force Office of Scientific Research, the Simons Institute for the Theory of Computing and the Open Technology Fund.

About Carnegie Mellon University: Carnegie Mellon is a private, internationally ranked research university with programs in areas ranging from science, technology and business, to public policy, the humanities and the arts. More than 13,000 students in the university's seven schools and colleges benefit from a small student-to-faculty ratio and an education characterized by its focus on creating and implementing solutions for real problems, interdisciplinary collaboration and innovation.

Carnegie Mellon University

Related Data Articles from Brightsurf:

Keep the data coming
A continuous data supply ensures data-intensive simulations can run at maximum speed.

Astronomers are bulging with data
For the first time, over 250 million stars in our galaxy's bulge have been surveyed in near-ultraviolet, optical, and near-infrared light, opening the door for astronomers to reexamine key questions about the Milky Way's formation and history.

Novel method for measuring spatial dependencies turns less data into more data
Researcher makes 'little data' act big through, the application of mathematical techniques normally used for time-series, to spatial processes.

Ups and downs in COVID-19 data may be caused by data reporting practices
As data accumulates on COVID-19 cases and deaths, researchers have observed patterns of peaks and valleys that repeat on a near-weekly basis.

Data centers use less energy than you think
Using the most detailed model to date of global data center energy use, researchers found that massive efficiency gains by data centers have kept energy use roughly flat over the past decade.

Storing data in music
Researchers at ETH Zurich have developed a technique for embedding data in music and transmitting it to a smartphone.

Life data economics: calling for new models to assess the value of human data
After the collapse of the blockchain bubble a number of research organisations are developing platforms to enable individual ownership of life data and establish the data valuation and pricing models.

Geoscience data group urges all scientific disciplines to make data open and accessible
Institutions, science funders, data repositories, publishers, researchers and scientific societies from all scientific disciplines must work together to ensure all scientific data are easy to find, access and use, according to a new commentary in Nature by members of the Enabling FAIR Data Steering Committee.

Democratizing data science
MIT researchers are hoping to advance the democratization of data science with a new tool for nonstatisticians that automatically generates models for analyzing raw data.

Getting the most out of atmospheric data analysis
An international team including researchers from Kanazawa University used a new approach to analyze an atmospheric data set spanning 18 years for the investigation of new-particle formation.

Read More: Data News and Data Current Events is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to