Data quality: BSI sets high bar for training AI systems

The BSI has published a catalog for the quality assurance of training data in AI applications. It is primarily concerned with documentation and data management.

listen Print view
Robot with its hand on its mouth, surrounded by lots of digital lettering with the word "Data"

(Image: Jirsak/Shutterstock.com)

4 min. read

The quality of training data is a decisive factor for AI projects. This applies from both a technical and regulatory perspective. Requirements are no longer only to be complied with voluntarily, but are now – specified by the EU's AI Regulation, especially for high-risk systems. The German Federal Office for Information Security (Bundesamt für Sicherheit in der Informationstechnik, BSI) has therefore published a catalog for the quality assurance of training data in AI applications (Quaidal). With the guidelines, the authority aims to translate requirements that cover aspects such as relevance, accuracy and completeness into concrete action modules.

For decision-makers, quality assurance is about "the reliability and legal compliance of AI applications", while for developers it is about the basis of "powerful, robust and comprehensible models", explains the BSI. The office has currently published these on its overview page on artificial intelligence. Insufficient data quality can "not only lead to inefficient or distorted results", it says. It also poses risks to safety, fairness and social acceptance.

According to the AI Act, training, validation and test data for high-risk systems must be "relevant, representative, accurate and complete", the authors explain. In particular, they must not contain any biases "that could lead to discriminatory or harmful results". This puts data quality at the center of regulatory attention and makes it a measurable prerequisite for the permissibility and marketability of many AI systems.

Quality-assured training data is also essential from a technical perspective, explains the BSI: it makes it possible to develop models that "learn efficiently, make robust decisions and behave in a comprehensible manner". To achieve this, the source material must be "correct, complete and free of systematic errors". Equally important is resistance to manipulation, i.e. the ability to withstand cyberattacks. Weaknesses in the data could be attack vectors for security vulnerabilities that could have a massive impact – on autonomous driving, finance or medical diagnostics, for example.

It is important to take these quality requirements to heart "right from the early stages of the AI life cycle", the office says, referring to the collection, cleansing and processing of data. This is where it is decided "whether a system is based on a stable, fair and legally viable database". This requires "targeted measures, a structured approach and close cooperation between specialist departments, data managers and development teams".

Based on common norms and standards, the creators of the catalog define ten central quality criteria such as representativeness or diversity. They map these in 143 metrics and methods to allow a "detailed and holistic assessment of data quality". To realistically reproduce the target population, it is essential to record as many relevant characteristics as possible. Additionally, deliberate weighting of subgroups and sufficient data coverage help represent even rare constellations accurately. To avoid distortions, systematic maldistributions must also be recognized and reduced.

Videos by heise

"We must ensure that applications with artificial intelligence meet high quality requirements," emphasized BSI President Claudia Plattner on the publication of the draft. "This is the only way we can create and use trustworthy AI." She invited the community to provide comments and suggestions. The office is also making the first version available in two machine-readable GitHub repositories.

(mki)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.