Learning from software errors – Part 1: Units. When numbers become misleading

Mars missions, stock market crashes and medical disasters – The most spectacular IT mishaps always follow the same pattern. But we can learn from them.

listen Print view
Computer,Key,Orange,-,Oops! There is an "Oops" key on a keyboard

(Image: jurgenfr/ Shutterstock.com)

5 min. read
By
  • Golo Roden

When software errors occur, they usually happen unnoticed and inconspicuously, but occasionally they can be highly spectacular and costly. From failed space missions and stock market crashes to faulty medical devices, there is a long list of famous software mishaps. If you study them, you quickly realise that most of these mistakes seem like one-off disasters at first glance, but are in fact repetitions of familiar patterns.

This series of articles presents nine of these typical error classes that occur again and again in practice, regardless of industry or technology. In each category, the series will present a specific example, analyze its causes and deduce what software developers can learn from it in the long term.

One of the most popular anecdotes in software history is that of Grace Hopper and the famous moth. In 1947, Grace Hopper was working on the Harvard Mark II computing system, in which a real moth had become trapped in a relay. After a team member found and removed it, he stuck it in the log book with the comment "first actual case of bug being found". This logbook is now on display in a Smithsonian Museum in Washington.

This episode has become a legend of IT culture, but the word "bug" was not invented back then. Thomas Edison was already talking about "bugs" in machines, i.e. small, hard-to-find faults, in letters in 1878. Nevertheless, Grace Hopper coined the idea that software errors are something that can be found and removed like an insect.

But the reality is often more subtle: Bugs are usually symptoms of systematic weaknesses. Behind almost every spectacular glitch is a pattern that can be found again and again in a modified form. It is therefore worth taking a look not only at the individual incidents, but above all at the error categories they represent.

Videos by heise

This series begins with a topic that developers deal with every day as a matter of course – and which is perhaps precisely why it is so dangerous. It is all too easily underestimated: we are talking about dealing with numbers.

Few examples of software errors are cited as frequently as the loss of NASA's Mars Climate Orbiter probe in 1999. The aim of the mission was to study the Martian atmosphere. After a months-long journey, the probe approached the planet – and burnt out. The cause was almost grotesquely simple: the developers had mixed up metric and imperial units of measurement in the software. The result was a systematic navigation error that led the probe into an orbit that was too low.

This incident shows that numbers without context are dangerous. A number like 350 can mean the measure of a speed, a force or an energy – or something completely different. As long as software treats data as raw numbers, there is a risk that someone will misinterpret them. In large projects with several teams, this risk increases if each side makes implicit assumptions that no one has explicitly documented or technically validated anywhere.

From a quality assurance perspective, such errors are particularly treacherous, as unit tests, just like integration tests, can run correctly – as long as the wrong units are consistently wrong. The problem often only becomes apparent in reality when sensors, actuators or external systems come into play whose data does not correspond to the assumptions previously made. The lesson from this incident is clear: numbers need meaning. Modern programming languages and frameworks offer various ways of making this meaning explicit:

  • Value Objects or wrapper types: Instead of double or float, a separate type such as ForceInNewton or VelocityInMetersPerSecond is used. This makes the unit part of the type of information. Some programming languages, such as F#, even offer support for units as part of the language.
  • Libraries for physical units: They enable automatic conversions and enforce correct combinations.
  • Interface contracts and end-to-end tests: API definitions should name units. Tests with real data uncover discrepancies before they lead to disasters.

In addition to these technical measures, team culture also plays a major role. Projects that ensure a common language for their data early on – be it through Domain-Driven Design (DDD) or simply through consistent documentation are much more likely to avoid such errors.

The loss of the Mars Climate Orbiter hit NASA hard. However, it has also led to developers paying more attention to unit errors, at least in some projects, and (hopefully) taking this class of error more seriously since then. The same applies to everyday software teams: if you pass on numbers without context, you are practically inviting the next bug.

In the next part you can read: Overflow, arithmetic and precision: when numbers go awry

(mki)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.