AI models: data protection does not prevent abusive secondary use
Satisfy AI's hunger for data simply with anonymized data? A lawyer and an ethicist explain why anonymous data actually poses a greater risk.
(Image: good render/Shutterstock.com)
Data is valuable, especially for training AI models. There is regular discussion about whether and how complete anonymization can be guaranteed so that the data can be passed on.
(Image: FernUniversität)
However, another aspect is neglected: anonymization does not protect the data. This also emerges from recent research by lawyer Hannah Ruschemeier and philosopher Rainer MĂĽhlhoff, which sheds light on companies that reuse data with questionable intentions. For example, one company offers the services of its software in the HR sector to recognize depressed applicants by their voice.
(Image:Â 2020 Felix Noak)
To prevent this and similar practices, Ruschemeier and MĂĽhlhoff are calling for AI models to be earmarked for a specific purpose. We spoke to the two of them for an assessment of the potential risks involved in the careless sharing and use of sensitive data.
The interview was conducted in German.
The regulation for a European Health Data Space is intended, among other things, to ensure that large amounts of data are available to researchers for training AI models; at national level, the Health Data Use Act paves the way. Can you first say a few words about the positive aspects and where our limits lie?
Ruschemeier: Of course, there is potential to improve medical treatment processes or even therapies with artificial intelligence. This is already happening to some extent, for example in the field of imaging procedures.
However, we also need to be aware that unless we define precisely what the positive benefits of something are in terms of the common good, we are opening the door to misuse even with supposedly good applications. The data that is made available and the AI models that are built can then also be used for purposes that are not oriented towards the common good or are harmful, for example for discriminatory or purely profit-oriented applications.
Does that then no longer serve the common good?
MĂĽhlhoff: It is a common and realistic phenomenon that data is collected for charitable purposes or AI models are built for this purpose, but are then used for a secondary purpose that no longer serves the common good.
For example: If patient data is used to build an AI that can diagnose whether a person has depression, for example, based on their voice, then this initially offers a medical benefit. In principle, everyone can benefit from this if it is used in medicine. However, it is conceivable that an AI model of this kind could change its application context: If it is used in a job interview, for example, it could be used to discriminate against people. Unfortunately, this example is not fictitious, but it is precisely such AI systems that are currently in demand in the field of personnel management.
In order to understand this risk of abusive secondary use, we first need to be aware that there are not only the positive applications of medical data and AI that are usually at the forefront of public discussions, but also abusive or harmful uses. Defining exactly where the dividing line lies in between is important if we want regulation that promotes the potential for innovation in the interests of the common good. We want regulation that enables positive applications and restricts misuse.
Is this not defined for the European Health Data Space?
Ruschemeier: The EHDS explicitly standardizes permitted and prohibited purposes for the secondary use of health data. The permitted purposes are very broad and include education or health-related research. It is very important that certain commercial secondary uses are prohibited: the conclusion of credit and insurance contracts and the performance of advertising activities.
It is unclear how the further use of anonymized data beyond secondary use can be controlled, in particular how health data users who have permission to access the EHDS can be ensured. This is because passing on data to third parties is not in itself an unauthorized secondary use.
Two aspects are relevant for us: Firstly, a democratic understanding of purposes is required: what are good, public welfare-oriented purposes and what are bad purposes? Certain purposes should be prohibited, while others are decidedly worthy of support. The EHDS already has a very rudimentary approach to this, but we would argue that it should not just focus on health data.
One could argue that the EDHS provides for a right of objection to secondary use for the data subject, otherwise it is permitted. We would say that this is not enough to limit the risk of secondary use. And for several reasons: Firstly, the data subjects must also be effectively informed that this right to object exists and what consequences secondary use of their anonymized data may have for third parties. Secondly, the EDHS also provides for exceptions to the right to object, for example for "scientific research for important reasons of public interest". Thirdly, and this is the crucial point: the effects of using anonymized health data or the resulting AI models potentially affect all members of society. This is because the data and tools can then, in principle, be applied to any third party.
However, this collective dimension of responsibility is not reflected at all by the right of objection of those contained in the data. Nobody is aware that they are not deciding for themselves, but for the entire community – and no individual can make such a big decision. Consent and objection are therefore the wrong instruments in situations like this, where the collective impact of data processing or AI is at stake.
So the data is effectively unregulated then?
MĂĽhlhoff: The anonymization of data does not effectively prevent its harmful use, especially regarding applications that could impact any third party, i.e. all of us. Anonymized data can be used to train models that can make predictions about third parties who are not even in the data set. This makes a new type of privacy violation possible: in our research, we call this "predictive privacy". It can be violated through prediction, i.e. through predicted information and not through leaked or stolen information.
Anonymizing the data therefore does not prevent the training of AI models from creating what mathematician Catherine Helen O'Neil would call "weapons of mass destruction". In other words, very powerful AI tools that make it possible to predict diseases, for example, about any third parties who are not included in the data. Such a tool in the hands of third parties can enable discrimination, for example by the insurance industry. In job interviews, it can lead to people with a higher health risk no longer being offered jobs.
The new AI Regulation prohibits emotion recognition systems in the workplace and classifies the use of AI in the work environment as high-risk systems. However, it is questionable whether the regulations are sufficient – for example, private providers are not even subject to the obligation to carry out a fundamental rights impact assessment.
What is not considered in the debate is that this risk of misuse is not only present with anonymous data, but is even greater because anonymous data is less well regulated. The General Data Protection Regulation does not apply to anonymous data. As long as the data is not anonymized, there are precise rules for handling it, such as purpose limitation. The moment the data is anonymized or an AI model is trained from the anonymous data, this purpose limitation is also broken and there are no longer any regulations. The anonymization of data has only a limited protective function in the age of AI, as data protection regulations no longer apply, but risks still exist.
Karl Lauterbach announced at the end of 2024 that companies such as OpenAI, Google and Microsoft are already queuing up for the pseudonymized data at the Health Research Data Center. How do you assess this?
Ruschemeier: Private companies have a considerable economic interest in such data. These players are also the only ones who have the computing power, the technical infrastructure and the know-how to quickly build very powerful AI tools. All other players who want to do this would have to fall back on their infrastructure. In the state sector, this can lead to dependencies and problems for digital sovereignty. At the same time, small companies that may actually have something good in mind are disadvantaged. Here we see the strong concentration of power in the AI sector, which means that the big players will initially benefit most from the health data that is now available.
The current explorative approach that is being pursued with data from electronic patient records, for example, must clearly be limited to research that pursues public welfare-oriented purposes. It becomes difficult when the data is then used in other areas such as the insurance industry, in HR or by the state in law enforcement. Then there is an entirely different purpose behind it, which was not recognizable to the individuals concerned. Appropriate regulation is therefore very important, but it must also be implemented effectively.
We see that Google, for example, has to pay very high fines time and again. Do such measures only affect small companies?
MĂĽhlhoff: We believe that the previous four percent of annual turnover is a sensible benchmark, as it is scaled to the size of the company. However, it is important that infringements are also effectively enforced.
Will we be able to control the risk of misuse of data as a result?
Ruschemeier: We are very aware that regulating data sets is much more difficult than regulating AI models. With AI models, at least we know that they can't suddenly sprout up all over the world because they require a certain infrastructure.
It's wholly different with data. There is no constitutional basis for regulating anonymized data. You can't simply regulate something and restrict freedom at will. We live in a liberal society where everything is allowed and if you want to ban it, you have to justify it. And then you always have to say, what is the risk and the potential for harm? You cannot generally derive harm from anonymized data.
The target group of this regulation and also the places where you have to look can be defined. Big Tech should be obliged to only use data for training models if they know where it comes from and can prove this. We are currently seeing how OpenAI and others have scraped almost the entire web and simply trained models from it despite copyright infringements. I'm not particularly hopeful.
What about open data approaches?
MĂĽhlhoff: A widespread idea is that open data is beneficial for democracy. That is true in many respects. However, we have to rethink things with AI because negative or abusive uses of published data suddenly become possible. We also need purpose limitation and corresponding licenses in the area of open data. For moral reasons, we need to license data in such a way that it cannot be used completely freely and without purpose limitation.
(mack)