2 methods to de-identify large patient datasets greatly reduced risk of re-identification

July 28, 2017

Bottom Line: Two de-identification methods, k-anonymization and adding a "fuzzy factor," significantly reduced the risk of re-identification of patients in a dataset of 5 million patient records from a large cervical cancer screening program in Norway.

Journal in Which the Study was Published: Cancer Epidemiology, Biomarkers & Prevention, a journal of the American Association for Cancer Research.

Author: Giske Ursin, MD, PhD, director of Cancer Registry of Norway, Institute of Population-based Research.

Background: "Researchers typically get access to de-identified data, that is, data without any personal identifying information, such as names, addresses, and Social Security numbers. However, this may not be sufficient to protect the privacy of individuals participating in a research study," said Ursin.

Patient datasets often have sensitive data, such as information about a person's health and disease diagnosis that an individual may not want to share publicly, and data custodians are responsible for safeguarding such information, Ursin added. "People who have the permission to access such datasets have to abide by the laws and ethical guidelines, but there is always this concern that the data might fall into the wrong hands and be misused," she added. "As a data custodian, that's my worst mightmare."

How the Study Was Conducted: To test the strength of their de-identification technique, Ursin and colleagues used screening data containing 5,693,582 records from 911,510 women in the Norwegian Cervical Cancer Screening Program. The data included patients' dates of birth, and cervical screening dates, results, names of the labs that ran the tests, subsequent cancer diagnoses, if any, and date of death, if deceased.

The researchers used a tool called ARX to evaluate the risk of re-identification by approaching the dataset using a "prosecutor scenario," in which the tool assumes the attacker knows that some data about an individual are in the dataset. An attack is considered successful if a large portion of individuals in the dataset could be re-identified by someone who had access to some of the information about these individuals.

The team assessed the re-identification risk in three different ways: First they used the original data to create a realistic dataset that contained all the abovementioned patient information (D1). Next, they "k-anonymized" the data by changing all the dates in the records to the 15th of the month (D2). Third, they fuzzied the data by adding a random factor between -4 to +4 months (except zero) to each month in the dataset (D3).

By adding a fuzzy factor to each patient's records, the months of birth, screening, and other events are changed; however, the intervals between the procedures and the sequence of the procedures are retained, which ensures that the dataset is still usable for research purposes.

Results: "We found that changing the dates using the standard procedure of k-anonymization drastically reduced the chances of re-identifiying most individuals in the dataset," Ursin noted.

In D1, the average risk of a prosecutor identifying a person was 97.1 percent. More than 94 percent of the patient records were unique, and therefore those patients ran the risk of being re-identified. In D2, the average risk of a prosecutor identifying a person dropped to 9.7 percent; however, 6 percent of the records were still unique and ran the risk of being re-identified. Adding a fuzzy factor, in D3, did not lower the risk of re-identification further: The average risk of a prosecutor identifying a person was 9.8 percent, and 6 percent of the records ran the risk of being re-identified.

This meant that there were as many unique records in D3 as in D2. However, scrambling the months of all records in a dataset by adding a fuzzy factor makes it more difficult for a prosecutor to link a record from this dataset to the records in other datasets and re-identify an individual, Ursin explained.

Author Comment: "Every time a research group requests permission to access a dataset, data custodians should ask the question, 'What information do they really need and what are the details that are not required to answer their research question,' and make every effort to collapse and fuzzy the data to ensure protection of patients' privacy," Ursin said.

Patient data are in general very well safeguarded and re-identification is not yet a major threat, Ursin added. "However, given the recent trend in sharing data and combining datasets for big-data analyses--which is a good development--there is always a chance of information falling into the hands of someone with malicious intent. Data custodians are, therefore, rightly concerned about potential future challenges and continue to test preventive measures."

Limitations: According to Ursin, the main limitation of the study is that the approaches to anonymize data in this study are specific to the dataset used; such approaches are unique for each dataset and should be designed based on the nature of the data.

Disclosures: Ursin declares no conflicts of interest.
To interview Giske Ursin, contact Julia Gunther at julia.gunther@aacr.org or 215-446-6896.

Follow us: Cancer Research Catalyst http://blog.aacr.org; Twitter @AACR; and Facebook http://www.facebook.com/aacr.org

About the American Association for Cancer Research

Founded in 1907, the American Association for Cancer Research (AACR) is the world's first and largest professional organization dedicated to advancing cancer research and its mission to prevent and cure cancer. AACR membership includes more than 37,000 laboratory, translational, and clinical researchers; population scientists; other health care professionals; and patient advocates residing in 108 countries. The AACR marshals the full spectrum of expertise of the cancer community to accelerate progress in the prevention, biology, diagnosis, and treatment of cancer by annually convening more than 30 conferences and educational workshops, the largest of which is the AACR Annual Meeting with more than 21,900 attendees. In addition, the AACR publishes eight prestigious, peer-reviewed scientific journals and a magazine for cancer survivors, patients, and their caregivers. The AACR funds meritorious research directly as well as in cooperation with numerous cancer organizations. As the Scientific Partner of Stand Up To Cancer, the AACR provides expert peer review, grants administration, and scientific oversight of team science and individual investigator grants in cancer research that have the potential for near-term patient benefit. The AACR actively communicates with legislators and other policymakers about the value of cancer research and related biomedical science in saving lives from cancer. For more information about the AACR, visit http://www.AACR.org.

American Association for Cancer Research

Related Data Articles from Brightsurf:

Keep the data coming
A continuous data supply ensures data-intensive simulations can run at maximum speed.

Astronomers are bulging with data
For the first time, over 250 million stars in our galaxy's bulge have been surveyed in near-ultraviolet, optical, and near-infrared light, opening the door for astronomers to reexamine key questions about the Milky Way's formation and history.

Novel method for measuring spatial dependencies turns less data into more data
Researcher makes 'little data' act big through, the application of mathematical techniques normally used for time-series, to spatial processes.

Ups and downs in COVID-19 data may be caused by data reporting practices
As data accumulates on COVID-19 cases and deaths, researchers have observed patterns of peaks and valleys that repeat on a near-weekly basis.

Data centers use less energy than you think
Using the most detailed model to date of global data center energy use, researchers found that massive efficiency gains by data centers have kept energy use roughly flat over the past decade.

Storing data in music
Researchers at ETH Zurich have developed a technique for embedding data in music and transmitting it to a smartphone.

Life data economics: calling for new models to assess the value of human data
After the collapse of the blockchain bubble a number of research organisations are developing platforms to enable individual ownership of life data and establish the data valuation and pricing models.

Geoscience data group urges all scientific disciplines to make data open and accessible
Institutions, science funders, data repositories, publishers, researchers and scientific societies from all scientific disciplines must work together to ensure all scientific data are easy to find, access and use, according to a new commentary in Nature by members of the Enabling FAIR Data Steering Committee.

Democratizing data science
MIT researchers are hoping to advance the democratization of data science with a new tool for nonstatisticians that automatically generates models for analyzing raw data.

Getting the most out of atmospheric data analysis
An international team including researchers from Kanazawa University used a new approach to analyze an atmospheric data set spanning 18 years for the investigation of new-particle formation.

Read More: Data News and Data Current Events
Brightsurf.com is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to Amazon.com.