"AI anonymization of judgments only makes sense if it surpasses humans"

AI is intended to anonymize court judgments. Stephanie Evert explains with heise online why this is only sensible in an automatic and strictly evaluated mode.

listen Print view
Scales with a finger pointing at them, surrounded by justice symbols.

(Image: Summit Art Creations / Shutterstock.com)

7 min. read

Stephanie Evert is a professor of corpus and computational linguistics at the Friedrich-Alexander-Universität Erlangen-Nürnberg.

(Image: glasow, fotografie)

Increasingly, AI systems are being used for the anonymization of court judgments, such as JANO. Prof. Stephanie Evert and her team have investigated in several research projects whether and under what conditions court decisions can be fully automatically anonymized – reliably enough to be published on a large scale. In an interview, she talks about technical limitations and why semi-automatic solutions are not sufficient in her view.

Several federal states are currently working on AI-supported anonymization of court judgments. How do you assess these developments?

Stephanie Evert: What is communicated publicly often does not reflect the actual state of research and development. As early as 2023, there were very large press releases from Hesse and Baden-Württemberg, even though JANO was merely a pilot project. What exactly is technically being used remains relatively unclear to this day. To our knowledge, these are predominantly supporting systems – meaning not fully automatic anonymization, but tools intended to assist in the manual processing of judgments.

Your own project started much earlier. How did that come about?

We had a research mandate from the Bavarian Ministry of Justice, which began in early 2020. Interestingly, at that time, both sides initially assumed that full automatic anonymization would probably not be reliably achievable. The goal was therefore to verify this scientifically. Our focus from the outset was on evaluation: We wanted to be able to state reliably what is possible – and what is not.

Videos by heise

What does that mean in practice?

We created very high-quality gold standards. This means: real judgments in which sensitive text passages were manually annotated and cross-checked by several people. This is extremely time-consuming, but necessary to be able to make reliable statements. In computational linguistics, 95 or 97 percent accuracy is often considered very good. That's not enough for anonymization. This involves highly sensitive personal data. If you want to ensure that a system finds and masks almost all such data, you also need a gold standard that can support this claim.

Yet you still achieved high scores.

Yes, that actually turned out to be feasible, but only with a highly specialized model. In a very narrow domain – district court judgments in rental and traffic law – we achieved around 99 percent recall for direct identifiers such as names, addresses, or dates of birth (Editor's note: recall is the metric indicating the proportion of actually existing sensitive text passages found by the system). This was possible by fine-tuning pre-trained language models (so-called LLMs) specifically for the anonymization task in this domain. Important: This quality is not achieved "just like that," but only when the system is trained very specifically on a particular type of text – and with an extremely high-quality gold standard. In a follow-up project funded by the BMFTR, we were able to extend this quality to a number of other legal areas.

Nevertheless, many judicial administrations rely on semi-automatic processes where a human checks at the end. Why do you consider that problematic?

That sounds very reasonable at first. You think: If a human looks at it in the end, it's safer. But with anonymization, that's not automatically the case. Humans even make more mistakes than machines here – especially oversight errors. And in anonymization, a single overlooked name is enough to make a person identifiable.

We have also seen this empirically, both in our corpora and in already published judgments. In manual anonymization, information regularly slips through. Especially with long texts or when names appear multiple times, attention decreases. An automatic system is often more consistent here: Either it recognizes a name – then usually every time – or it doesn't recognize it at all.

That contradicts the widespread intuition that humans are more careful?

Yes, that's counterintuitive. But in practice, this is what happens: The better a semi-automatic system works, the more people trust it and accept its suggestions. Then attention decreases for the few cases where the system actually makes mistakes.

An example from our evaluation is "Witness Wiese." "Wiese" doesn't look like a typical name. The system therefore doesn't recognize it – consistently throughout the judgment.

So, human control doesn't automatically increase security?

Exactly. Especially not if it's only intended as a downstream control. If someone rereads an already anonymized document, the probability of finding exactly the few remaining errors is low. The task is too monotonous and too error-prone for that.

Therefore, we say: If a system is used, it must have been evaluated so thoroughly beforehand that it demonstrably performs better than human work in a clearly defined domain. Only then is fully automatic use justifiable.

Why do ministries opt for semi-automatic solutions?

A key point is responsibility. As long as a human is involved, the administration feels legally on the safe side. If something goes wrong, you can say: A human checked it. With a fully automatic system, it's unclear who is responsible – the manufacturer, the ministry, the judges?

That is administratively understandable. Technically, however, it is not a convincing argument. A poorly evaluated semi-automatic system is not safer than a well-evaluated automatic one – quite the opposite.

Does this argument also apply to other AI applications, for example in medicine?

No, that must be clearly distinguished. In medicine, it's about decisions that require individual consideration – diagnoses, therapies, risks. Human responsibility is central there.

With anonymization, the task is much more clearly defined: There are relatively clear criteria for what needs to be anonymized – at least since our research projects. That's precisely why this task can be evaluated so well. And that's precisely why one can argue here that an automatic system is more suitable than human processing under certain conditions.

What would be the right path for the judiciary in your view?

Fully automatic anonymization – but only where it is demonstrably reliable. That means: for specific court instances (AG, LG, or OLG), specific legal areas, specific text types. And with accompanying procedures that detect changes, such as writing styles or formats.

Semi-automatic systems can help gain initial experience. But they will not lead to us publishing truly large volumes of judgments. For that, we need systems that, after careful evaluation, can be trusted to work independently.

(mack)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.