Biomedical data is widely collected in the field of medicine, although sharing such data can raise privacy concerns about the re-identification of seemingly anonymous records. Risk assessment frameworks for formal re-identification can inform decisions on the process of sharing data, and current methods focus on scenarios where data recipients use only one resource to identify purposes. However, this can affect privacy where adversaries can access multiple resources to enhance the chance of their success. In a new report now in Science Advances, Zhiyu Wan and a team of scientists in electrical engineering and computer engineering and biomedical informatics in the U.S. represented a re-identification game using a two-player Stackelberg game of perfect information to assess risk. They suggest an optimal data-sharing strategy based on a privacy-utility trade-off. The team used experiments with large-scale genomic datasets and game theoretic models to induce adversarial capabilities to effectively share data with low re-identification risk.
A new method for data collection
Researchers typically collect most biomedical data on a large scale across a wide range of settings, where personal health data is routinely clinically stored as electronical health records. Biomedical researchers now support studies that collect data across a diverse array of participants and most recent improvements include a number of ventures, including direct-to-consumer genetic testing companies that collect data from various consumers to build repositories. It is assumed that sharing this data beyond their initial point of collection is critical to maximize its social value. However, privacy concerns surrounding such practices include the identifiability of data subjects, including the individuals to whom the data correspond. Genome data are shared across various settings in the United States to provide a clear illustration of the threat of data re-identification and concerns on the possibility. Linking genomic data to identifiers pose a threat to the anonymity of data subjects. In this work, Wan et al. introduced a new method to assess and strategically mitigate risks by explicitly modeling and quantifying the privacy trade-off for subjects during multistage attacks. In this way, the team bridged the gap between more complex models and informed data sharing decisions.
Maintaining the privacy of genomic data
Cyberscentists have developed many methods to prevent the re-identification of biomedical data from regulatory and technical perspectives. Nevertheless, most approaches focus on worst-case scenarios, which lead to overestimate the privacy risk. To avert this problem, researchers have introduced risk assessment and mitigation based on game theoretic models. Wan et al. showed how a game theoretic model could reveal the optimal sharing strategy to data subjects in which the team conducted experiments using protection against a multistage attack using either real-world data or large-scale simulations. The results showed how the game theoretic model could efficiently assess and effectively mitigate privacy risks. The sharing strategy or model recommended here could minimize the chance of successfully re-identifying a data subject while maximizing the use of data to keep the released data set useful and the process of data sharing fair.
Experiments with game theory models.
Wan et al. showed how a game theory model could reveal the optimal sharing strategy by conducting experiments using protection against a multistage attack with either real-world datasets or large-scale simulations, where the game theoretic model could efficiently assess and mitigate the privacy risks. The sharing strategy/model minimized the chance of successfully re-identifying a data subject, while maximizing the use of the data to keep the released dataset useful and the process of data sharing fair. During the experiments, the scientists identified a situation where a data subject could choose how much of their genomic data would be shared in a public repository such as the 1000 Genomes project or the Personal Genome Project. In the scenario, the subject could be willing to share the entire sequence, a subset of short tandem repeats or nothing at all. The goal of this work was to assess the optimal sharing decision of the subject (the leader) relative to the monetary benefit of data sharing and the risk of re-identification by a follower (the adversary). In this model, the subject acted as a leader choosing how much of their genomic data to share and the follower/adversary obtained the shared data to then decide if to execute an attack. When the subject decided and thereby chose a masking strategy, the adversary observed the data with motive depending on the masking strategy.
Outlook: Data analysis and anonymization
The team then simulated populations that maintained 20 attributes including ID and surname, as well as 16 genomic characteristics. The simulation compared several scenarios in which the adversary claimed to re-identify all records in all scenarios. Among them, the ‘no-protection’ scenario showed the highest variation relative to data utility. The team used a machine with a six-core, 64-bit central processing unit to compute the resulting strategies for all 1000 data subjects in each scenario.
Wan et al. also tested the sensitivity of the model against eight parameters and three experimental settings and conducted a cast study to show the applications of the model to real-world datasets. Using Craig Venter’s demographic attributes including year of birth, state of residence, and gender, the method allowed the subjects to make informed data-sharing decisions when faced with complex, state-of-the-art re-identification models. The flexible method can help answer questions on the risks of sharing de-identified data to an open data repository, and help individuals discern the portion of the data to be shared. In this way, Zhiyu Wan and colleagues discuss the game theoretic model, including its limitations and note how it can be expanded to provide directions for future work. For instance, the team envisions integrating the solution as a service into existing anonymization software.
The best way to protect personal biomedical data from hackers could be to treat the problem like a game
1. Zhiyu Wan et al, Using game theory to thwart multistage privacy intrusions when sharing data, Science Advances (2021). DOI: 10.1126/sciadv.abe9986
2. W. Nicholson Price et al, Privacy in the age of medical big data, Nature Medicine (2018). DOI: 10.1038/s41591-018-0272-7
© 2021 Science X Network
Using game theory to thwart multistage privacy intrusions when sharing data (2021, December 23)
retrieved 23 December 2021
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.