Learning from software errors – Part 2: Why did Ariane 5 explode after take-off?

A 64-bit floating point value caused Ariane 5 to skid – IT mishaps always repeat the same pattern. But we can learn from them.

listen Print view
Keyboard with "Oops!!!" button

(Image: jurgenfr/ Shutterstock.com)

4 min. read
By
  • Golo Roden

Numbers seem so natural in programs that many developers treat them intuitively as they would in mathematics: addition, subtraction, multiplication—what could possibly go wrong? In practice, however, there are numerous pitfalls lurking here. Memory limitations, rounding errors, conversions between data types, and the different treatment of integers and floating-point numbers repeatedly lead to catastrophic errors.

Golo Roden - Serie: Aus Softwarefehlern lernen

An iconic example of this is the failure of the first Ariane 5 launch in 1996. Just 37 seconds after launch, the rocket left its trajectory began to spin uncontrollably and eventually self-destructed. The cause was not a material problem or a defect in the rocket, but a software error in the inertial navigation software.

Videos by heise

Specifically, the system attempted to convert a 64-bit floating-point value into a 16-bit integer. However, the value was too large for Ariane 5, and an overflow occurred, which then triggered an exception. As a result, the entire control system shut down. As the rocket had two identical systems running synchronously, the error was immediately repeated on the backup, so the redundancy couldn't help.

This raises the question: Why does something like this happen in a multi-billion dollar space program? The answer is indeed instructive: the Ariane 5 software was based in part on its predecessor, Ariane 4, where the value ranges were smaller and a 16-bit integer was perfectly adequate. With Ariane 5, however, the accelerations were in a different range. The old assumptions no longer applied, but the developers never checked the corresponding code paths. After all, the software had already been working reliably for years.

This pattern can still be found in countless projects today:

  • Developers adopt old code paths without checking their validity for new operating conditions.
  • Implicit type conversions or missing range checks lead to overflows in borderline cases.
  • Error handling is missing or too global, as in the case of Ariane, where a single exception led to total failure.

In practice, developers encounter this risk time and again—even in completely everyday projects. Typical symptoms are.

  • Sudden jumps or negative values in counters,
  • NaN results or Inf values in floating-point calculations and
  • silent rounding errors that only become noticeable with large numbers or after a long runtime.

The worst thing is that countermeasures are well known but are often neglected for reasons of time and cost:

  1. Explicit range analysis: check whether all value ranges still fit, especially for takeovers.
  2. Saturating arithmetic or clamping: If a value exceeds the permissible range, set it to the maximum or cancel the process instead of allowing it to overflow unnoticed.
  3. “Fail fast” for critical conversions: Better a targeted error that shows up early than silent data corruption.
  4. Telemetry and monitoring: Monitor value ranges during operation and report conspicuous outliers.

The psychological component is also interesting here: many teams rely on their test coverage and overlook the fact that test data is often too nice. Limit values, extreme ranges, and unusual combinations are often missing. Only property-based testing, fuzzing, or targeted limit tests uncover the critical cases.

The Ariane 5 incident has shown that even in highly critical projects with a seemingly endless budget, a seemingly trivial numerical problem can lead to a billion-dollar catastrophe. For everyday business IT, this means that every number is a model, and models have limits. Knowing and safeguarding these limits not only prevents crashes but also saves hours of troubleshooting for subtle rounding errors.

This series of articles presents nine typical error categories that occur again and again in practice, regardless of industry or technology. In each category, the series will present a specific example, analyze its causes, and deduce what software developers can learn from it in the long term.

In the next part, you can read: Concurrency and scheduling: when processes block each other.

(mack)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.