Secure handling of medical data: AnoMed enters its second round
There are various methods for securely handling health data – some are still too computationally intensive, others still too insecure.
(Image: metamorworks/Shutterstock.com)
“The classic anonymization procedures have not worked,” says Prof. Esfandiar Mohammadi, head of the “AnoMed” project, at an event in Berlin. The project is about using health data for research and AI development without endangering the privacy of patients. The participating teams demonstrated how this is to be achieved at the kick-off meeting for the continuation of the “AnoMed 2” project, funded by the Federal Ministry of Research, Technology, and Space (BMFTR).
For Mohammadi, this is not only a technical but also a societal issue. “Privacy does not only concern the individual but is also a matter of a liberal democracy,” he told heise online. “If we all become transparent people and some companies or governments have a perfect personality profile of every person, then they can specifically manipulate masses of voters.” That is precisely why research must now step in and develop procedures for using particularly sensitive data without influencing or endangering patient privacy.
Videos by heise
Critical view from outside: anonymization in practice
However, a presentation by Prof. Dr. Fabian Prasser from Charité Berlin made it clear how far research is still from truly reaching medical practice. Prasser has been researching how to make health data more accessible for medical research for more than ten years and draws a sobering interim conclusion: Despite decades of research, extensive literature, and numerous conferences, privacy-enhancing technologies (PETs) have hardly found their way into everyday life so far. “There are so many ideas, but if you look at what is actually used, it is a tiny fraction of all scientific approaches,” said Prasser.
He attributes this not to a lack of research quality but to structural hurdles: high infrastructure costs, a lack of expertise in data-producing institutions such as hospitals, legal uncertainties, and the limited flexibility of many procedures. The immediate benefit for data-providing institutions often simply does not outweigh the effort.
In addition, there is a core methodological problem that concerns the AnoMed consortium: data protection through anonymization always comes at the cost of information content. Prasser illustrated this with a concrete example from the Corona crisis, where his team published anonymized patient registry data. It turned out that the case fatality rate calculated from the anonymized data deviated from the actual rate by up to ten percent, which is not tolerable for many clinical questions. Another study on the reproducibility of medical research results confirmed that none of the anonymization methods tested could fully replicate all the results of the original study.
Anonymization has so far worked well for feasibility studies, exploratory analyses, hypothesis generation, software tests, and as a supplement for training AI models – but it is not suitable as a substitute for original data in primary clinical studies with clear evidence requirements. Prasser sees the solution in tiered data usage that combines different access levels – an approach he outlined using the example of the Medical Informatics Initiative, where federated analyses, differential privacy, and pseudonymization interlinked.
For the future, Prasser is relying on the European Health Data Space (EHDS) and secure processing environments where researchers do not receive the data but gain protected access to the data infrastructure. “The fact that it is now playing such a prominent role in the EHDS also says something about the hurdles that other methods have faced in practice.”
29 million euros for a new AI computing center
The goal of AnoMed is also a question of infrastructure. As part of the project, the University of Lübeck has inaugurated a new AI computing center that will provide significantly more computing power for research work in the future. The Federal Ministry of Research is funding its construction with 29 million euros. A GPU cluster based on the latest water-cooled NVIDIA servers is being built on around 400 square meters, with an expected computing power of over 3,000 PetaFlops – enough to train very large AI models under high-security conditions.
Digital Sovereignty
As a public institution, the computing center is intended to enable partners such as hospitals to process sensitive data locally for research – without dependence on commercial cloud services. “In the spirit of digital sovereignty, we are building a computing center that is large enough to run agentic systems and conduct machine learning research,” Mohammadi explained to heise online. “We will thus offer local services for our research and our research partners, such as hospitals. Unlike large cloud providers, our obligations are clear: we are a public institution and have the mandate to advance public research.”
Together with the research projects of the first and second phases, the BMFTR is thus funding the AnoMed research center with around 46 million euros.
Numerous Projects
In the second funding phase, numerous projects ranging from cryptographic foundations to concrete medical application fields will be continued. Among others, the algorithm “DP-Hype” developed in AnoMed 1 was presented, a hyperparameter search that functions in a privacy-preserving and federated manner. The special feature lies in the underlying cryptographic protocol: clients can perform all calculations locally and then only aggregate statistics. In AnoMed 2, DP-Hype is to be integrated into the open-source framework for federated learning, "Flower," to make the method easier to use.
However, those who want to train models must not only control their parameters but also protect the data itself. AnoMed pursues two paths for this: on the one hand, machine learning on encrypted data should become possible. While fully homomorphic encryption is considered the gold standard, it is still too computationally intensive for everyday use, which is why the project is investigating alternative cryptographic approaches. On the other hand, sensitive material should not have to be passed on in its original form at all. Synthetic data should reflect the properties of sensitive original data without allowing conclusions to be drawn about individuals. At the same time, targeted attacks on these synthesis methods are being developed to find vulnerabilities before others do.
Identifying groups of people
The work of Jorge Andresen, also a researcher at the University of LĂĽbeck, shows how real this danger is. Using a simulated health dataset with four million entries, he was able to show that individual datasets can be reconstructed from aggregated statistics, making specific groups of people identifiable from a supposedly anonymous overall population. Linked to this is the cooperation with the Federal Institute for Drugs and Medical Devices, which stores highly sensitive billing data at the Research Data Center (FDZ) Health and wants to test more robust methods for secure data sharing in the future.
The MammothDP project therefore combines differential privacy, constant-time implementations, trusted execution environments, and role-based access controls into a holistic protection system. The group around Prof. Thomas Eisenbarth is also investigating how AI systems can be attacked through misinjection, e.g., via the security vulnerability Rowhammer or voltage glitching, or via microarchitecture side channels.
Security, he explains, doesn't start with the algorithm but with the chip. The German Research Center for Artificial Intelligence, in turn, is working on preparing input data in such a way that classifiers function more reliably. On the medical application side, generative models for synthetic ECG data are the focus, for example, regarding atrial fibrillation.
Anonymization of images
The project around Prof. Thomas Martinetz from the Institute for Neuro and Bioinformatics at the University of Lübeck also showed how complex real anonymization, or rather the preservation of privacy, can be. The team is working on processing facial images in such a way that they no longer allow conclusions to be drawn about gender without the alteration being visible. “Changing individual pixels is easy. The challenge is to holistically remove sensitive information while preserving everything else as much as possible so that the data remains usable for research,” said Martinetz.
All technical projects are accompanied by legal and regulatory analyses as well as studies on user acceptance of anonymized health statistics. Because whether new methods would ultimately be adopted is just as crucial as whether they worked.
Board games for science communication
Another initiative from the AnoMed environment shows that data protection and privacy do not have to be topics confined to expert conferences. The internationally recognized privacy researcher Dr. Sebastian Meiser has already developed the educational board game “Spurensuche in der KI – Privatsphäreangriffe auf neuronale Netze” (Trace Search in AI – Privacy Attacks on Neural Networks), which can also be played online at anomed.de/anomed-brettspiel. Now a second game is following, which deals with the question of what lies behind the mathematical concept of differential privacy. In the game, which is based on the randomized response technique, each participant draws a person card with fictitious characteristics such as profession, hobby, or whether they snore.
Cards for privacy poker. Epsilon indicates the probability of truth.
When asked a question, a person first decides how much privacy they want to give up and draws a card from a deck with a specific epsilon value. The smaller the epsilon, the less can be inferred from the answer. The larger the epsilon, the more is revealed. In the end, all players try to de-anonymize their table neighbors based on the collected answers.
(mack)