Interview: On data protection risks with medical research data
Disease combinations are often unique – anonymization of health data is therefore particularly complex. An interview on reconstruction risks.
(Image: PopTika/Shutterstock.com)
The electronic patient record (elektronische Patientenakte, ePA) and the planned European Health Data Space (EHDS) are intended to make medical information usable across Europe through national contact points for research and care. The vision: more knowledge about disease progression, better therapies, faster research. To this end, data is collected at the Forschungsdatenzentrum Gesundheit (FDZ Gesundheit), which is located at the Federal Institute for Drugs and Medical Devices (Bundesinstitut fĂĽr Arzneimittel und Medizinprodukte, BfArM). Prospectively, more data is to be added, for example, from medical registries and from the electronic patient records of the statutory insured.
Currently, the BfArM has the billing data of almost all statutory insured individuals since 2022. According to the BfArM, 36 applications for access to research data have already been received. To accelerate the research process, the BfArM provides on Github and Zenodo data models, test data, and instructions in addition to the dataset description.
BfArM wants to improve anonymization
The BfArM is currently working with researchers on more data protection within the framework of the project "Anonymization for Medical Applications“ (Anomed 2). "The goal of the FDZ Gesundheit in this project is to develop innovative approaches for further improved anonymization of health data together with the project partners – including through the use of synthetic data and differential privacy methods," it says from the BfArM at the end of October. It is unclear whether the first companies have already gained access to the research data. The BfArM has not yet answered a question about this.
Jorge Andresen and Esfandiar Mohammadi from the Institute for IT Security at the University of LĂĽbeck, who is also involved in the project, have investigated how easily individual information can be reconstructed from supposedly anonymized health statistics. The two presented their as yet unpublished study "Reconstructing Health Data from Published Statistics" at this year's AnoSiDat conference.
(Image:Â Jorge Andresen)
They simulated a huge health dataset, carried out reconstruction attacks with algorithms – and found out: Even aggregated data is not automatically secure. We spoke with Jorge Andresen about the background of the research.
What was the focus of your work?
We wanted to show that even seemingly harmless health statistics can pose a data protection risk. The electronic patient record and the EHDS are intended to provide large amounts of anonymized data for research. The idea is, of course, sensible – but without additional protective mechanisms, details about individual people can be derived from aggregated results. This has already been observed with the US census, and we have now been able to transfer this to medical data.
Videos by heise
I asked myself: What do we need to do to keep the data secure? Can you just use it to ask simple questions, for example: How many people have colon cancer? How many of them also have lung cancer? Initially, these are just statistical correlations that say nothing about individual people. But similar approaches were tested with the US census – and it was found that it was indeed possible to reconstruct people from supposedly anonymous statistics. Of course, you want to avoid that, especially with health data.
How did you recreate the attacks?
We created a synthetic health dataset – based on publicly available data from the RKI, health insurance companies, and more than 100 medical studies. Using a so-called Bayesian network, we were able to generate four million fact-based, but fictitious, datasets. Each dataset contains 44 characteristics – age, various cardiovascular diseases, cancer, etc. This allowed us to simulate a realistic population.
What did the concrete attack look like then?
We used the so-called Rap-Rank Reconstruction Attack, which was originally developed for the US census. The attack trains several AI models to reconstruct a plausible dataset from aggregated results – such as "48.8 percent are male" or "40 percent of smokers have heart disease." In principle, it's puzzle work: you know many individual statistics and try to guess the original distribution from them.
And how well does it work?
Surprisingly well. We were able to reconstruct about six percent of all data points, of which 90 percent are correct. With an assumed target population of 73 million datasets, that would be beyond three million people. It is particularly concerning that many of these are unique datasets – combinations of diseases and characteristics that exist only once in the dataset. These are precisely the ones that would be easiest to identify in real life.
In your study, you show that attacks remain successful even if only part of the statistics is known. How do you explain that?
This is because the statistical correlations are strongly correlated with each other. Even if you only know distributions and some combinations, an algorithm can draw surprisingly many conclusions. In our experiments, attackers were still able to reconstruct about one percent of the data even with reduced knowledge – that's not trivial.
You don't assign names, but the risk remains that individuals can be identified?
Yes, exactly. My attack does not assign names. But the reconstruction of such individual combinations would be another step towards that. If you then also have publicly available information or leaks, you could theoretically assign names – as happened with the US census, for example.
That sounds like aggregations are not a sufficient protection mechanism?
Correct. Simply publishing results does not automatically protect. That is the main conclusion of our study. Additional mathematical protective measures are needed – for example, differential privacy, which involves deliberately inserting statistical noise into the data to prevent conclusions about individuals and which only minimally alters the statistics.
What role does the EHDS play in this?
The EHDS is intended to enable such research access at the European level. If the issue of data protection is not handled properly from the outset, such reconstruction attacks could also become applicable to real health data. That would be fatal, because health data is highly sensitive. One of the challenges is likely to be that data accumulates over decades.
Did you include this aspect in your simulation?
Not yet fully. So far, we are looking at the static perspective. But of course: If data is collected over decades and new factors are added, the risk of re-identification increases significantly. A time-extended analysis is therefore a logical next step.
Would such attacks also be relevant for other data sources, such as social or educational data?
Wherever aggregated statistics with many cross-references are published, there are risks. For our main attack, we needed about 2400 different queries to the dataset. How well the attack works in other areas ultimately also depends on the complexity of the dataset. Therefore, it cannot be said across the board that an attack works just as well in other areas.
In your opinion, what urgently needs to happen before data from the ePA, the FDZ Gesundheit, and the EHDS is actually released for research?
On the one hand, it must be clear how "anonymous" is defined. The Federal Statistical Office has defined definitions for itself, but these have not yet reached many. The protective mechanisms themselves should rather be evaluated for the respective dataset. On the other hand, there needs to be institutional control over which requests are allowed to be made to the data. Currently, these are still very open concepts in many EU drafts. In Germany, too, for example, research is still being conducted at the FDZ Gesundheit on how to protect the data from the ePA later on. But precisely this clarity would strengthen the trust of the population.
What are your plans?
We plan to test different data protection technologies against each other to show which offers the best protection while maintaining high research utility. The goal is that data protection and research do not exclude each other, but can function together technologically.
It's not about hindering research, but about realistically understanding risks. Only those who know where the weaknesses lie can reliably close them. Ultimately, it is of course desirable to find a solution that makes everyone happy.
(mack)