Researchers simulate privacy leaks in functional genomics studies

November 12, 2020

The functional genomics field, which looks at the activities of the genome and levels of gene expression rather than particular gene mutations, generally relies on aggregating information from many samples for its statistical power. This means that broadly sharing raw data is vital; however, sharing these data currently is challenging because of the privacy concerns of individuals within those datasets, leading to these data being largely inaccessible behind firewalls. In a study publishing November 12 in the journal Cell, a team of investigators demonstrates that it's possible to de-identify those data to ensure patient privacy. They also demonstrate how these raw data could be linked back to specific individuals through their gene variants by something as simple as an abandoned coffee cup if these sanitation measures are not put in place.

"The purpose of this study is to come up with practical ways to broadly share the raw data without creating undue privacy concerns," says senior author Mark Gerstein (@MarkGerstein), a professor of bioinformatics at Yale University.

Functional genomics research is frequently tied to a specific disease. For example, an investigation into a particular psychiatric condition might look at the expression of certain genes in a type of neuron. And, by nature of having their genetic material included in such a study, an individual's medical status with regard to that condition could inadvertently be revealed.

This can happen through what's known as a quasi-identifier. The way a quasi-identifier works is that if someone has enough individual data points about you, even if those data on their own are not sensitive or unique, they can be combined to create an identifier that is unique to you. In a non-genetic setting, this means if someone has your zip code, birthday, the model of car you drive, and other similar data that might not be considered private or sensitive on their own, they might eventually be able to combine them and create a unique profile that would link you to other data that you wouldn't want public--data like financial records that were collected when you applied for a car loan. The same thing could happen if someone were able to obtain some of your genetic variants and link those variants to the presence of your genetic material in a study on a particular disease. This could in turn reveal a diagnosis, such as HIV status or an inherited cancer predisposition, that you would prefer to keep private.

In their study, the researchers constructed a "linkage attack" scenario to demonstrate how someone could make these kinds of connections from functional genomics studies' data by using DNA obtained from a discarded coffee cup. After adding samples from two consenting participants to a functional genomics database, the researchers gathered used coffee cups from the same individuals. They sequenced genetic material left on the cups and were able to successfully match that material to the samples in the database and infer sensitive health information about the participants. The researchers were also able to use DNA information "stolen" from a genotyping database to match the identities of 421 people with phenotypic information found in a test functional-genomics dataset that the researchers constructed for 436 people.

However, the researchers also identified steps that can be taken to thwart these kinds of linkage attacks and safeguard participants' health information when functional genomics datasets are shared. "Functional genomics is special because variants are usually not needed for data processing," says first author Gamze Gürsoy, a postdoctoral researcher at the Gerstein lab. "Because of this, we can sanitize the variants to prevent data being linked back to the private information connected to the phenotypes included in these studies, while still retaining the utility of the data."

To achieve this balance between privacy and data usefulness, the researchers propose a file-format manipulation that will allow raw functional genomics data to be shared while largely reducing sensitive information leakage by generalizing information about phenotypic variants. The file format is based on a widely used standard file-format system, is compatible with a range of software and pipelines, and when tested, showed little loss of utility. The researchers have also developed a framework with which other researchers can tune the level of privacy and utility balance they want to achieve with the file format based on the policies and consents of the donors.

"As more data are released for these kinds of functional genomics studies, concerns about security and privacy shouldn't be lost," Gerstein says. "At the dawn of the Internet, people didn't realize how important their online activities would become. Now that type of digital privacy has become so important to us. If we move into an era where getting your genome sequenced becomes routine, we don't want these worries about health privacy to become dominating."
This work was supported by the National Institutes of Health, the AL Williams Professorship fund, and the Chan Zuckerberg Initiative Donor-Advised Fund.

Cell, Gürsoy et al.: "Data sanitization to reduce private information leakage from functional genomics"

Cell (@CellCellPress), the flagship journal of Cell Press, is a bimonthly journal that publishes findings of unusual significance in any area of experimental biology, including but not limited to cell biology, molecular biology, neuroscience, immunology, virology and microbiology, cancer, human genetics, systems biology, signaling, and disease mechanisms and therapeutics. Visit: To receive Cell Press media alerts, contact

Cell Press

Related Health Information Articles from Brightsurf:

Readability of public health information on COVID-19 from governments, international agencies
The readability of information about COVID-19 was evaluated on websites of public health agencies and governments of 15 countries.

Electronic health information exchange improves public health disease reporting
Disease tracking is an important area of focus for health departments in the midst of the COVID-19 pandemic.

The interface of genomic information with the electronic health record
In an effort to provide practical guidance and important considerations regarding how genomic information can be incorporated into electronic health records, the American College of Medical Genetics and Genomics (ACMG) has released, 'The interface of genomic information with the electronic health record: a points to consider statement of the American College of Medical Genetics and Genomics (ACMG).'

Leveraging information technology to address health disparities
Within the supplement are 12 original research papers and five editorials and commentaries.

Using information technology to promote health equity -- update in Medical Care
An innovative health information technology (IT) program helps primary care providers to detect and manage depression and posttraumatic stress disorder (PTSD) in traumatized refugees, reports a study in a special June supplement to Medical Care.

Do we have an epidemic? Enhancing disease surveillance using a health information exchange
While disease surveillance has shifted toward greater use of electronically transmitted information to decrease the reporting burden on physicians, the challenge of getting the right information to public health officials at the right time has not been completely solved.

Information on reproductive health outcomes lacking in Catholic hospitals
As Catholic health care systems expand nationwide, little is known about the reproductive outcomes of their patients compared to patients in other settings, according to researchers at the University of Colorado Anschutz Medical Campus.

Harvesting health information from an unusual place: The wastewater treatment plant
Every day, people all over the world unwittingly release a flood of data on what drugs they are taking and what illnesses they are battling, simply by going to the bathroom and flushing.

Are hospitals improperly disposing of personal health information?
A substantial amount of personal information, most of it personal health information, was found in the recycling at five hospitals in Toronto, Canada, despite policies in place for protection of personal information.

New method extracts information on psychiatric symptoms from electronic health records
Researchers at Massachusetts General Hospital and Harvard Medical School have developed a new method to extract valuable symptom information from doctors' notes, allowing them to capture the complexity of psychiatric disorders that is missed by traditional sources of clinical data.

Read More: Health Information News and Health Information Current Events is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to