Learning from software errors – Part 4: Patriot missile fatally misses target

A rounding error adds up tragically over the months – IT mishaps always repeat the same pattern. But we can learn from them.

listen Print view
Keyboard with "Oops key"

(Image: jurgenfr/ Shutterstock.com)

5 min. read
By
  • Golo Roden

Time is one of the biggest illusions in software development. For many developers, it is just a numerical value – a data type such as int and long or a DateTime object. But in practice, time is complex, unreliable, and full of pitfalls. In any application that lives for more than a few seconds or interacts with the real world, time zones, daylight saving time changes, leap seconds, clock drift and latencies are an issue. If you ignore this, you risk making fatal mistakes.

Golo Roden - Serie: Aus Softwarefehlern lernen

A classic and tragic example is the 1991 Patriot missile incident during the Gulf War. The Patriot system was designed to intercept incoming Scud missiles. In Dhahran in Saudi Arabia, however, the defense failed, and the missile hit a US barracks: 28 soldiers died and many more were injured.

The investigation revealed that the system had an accumulating rounding inaccuracy. The Patriot's internal time measurement was implemented in tenths of a second. To get from this value to seconds, a floating-point value with finite precision was used. A small error added up over time. After many hours of operation, the calculated position of the Scud missile had shifted so far that the interception logic failed.

The lesson: time is error-prone and drifts, even if the hardware works perfectly. This phenomenon is not limited to military technology. Similar effects occur in financial systems, distributed databases or IoT devices. Even simple events such as a summer time changeover can lead to planned tasks being executed twice or not at all.

Errors that only occur after a long period of operation are particularly perfidious. A system that runs stably for a few hours in the laboratory can suddenly perform incorrect calculations after days or weeks in production. Normal test strategies do not work here because hardly any team carries out complete long-term simulations.

Videos by heise

Typical sources of error relating to time and geography are

  • System time instead of monotonic time: many developers simply use the current time for time differences. If the clock shifts (e.g. through NTP sync or manual correction), the calculations also jump.
  • Missing time zone logic: One server in UTC, another in local time, a database without a time zone field –, which can lead to incorrect calculations for billing or deadlines.
  • Daylight saving time and leap seconds: A nightly batch job that runs at 2 a.m. every day can suddenly run twice or not at all when the clock is changed.
  • Global distributions: Systems that replicate around the globe need to account for latency and different local times.

What can be concluded from these findings to get this class of error under control?

  1. Treat time as a separate domain: Consistently enforce the difference between the "wall time" visible to users and the "monotonic time" for calculations.
  2. Use monotonic clocks: Never use the system clock ("wall clock") for runtime measurements or timeouts. Modern languages and frameworks offer monotonic time sources that are independent of time jumps.
  3. Explicit time zones: Always save timestamps in UTC and only provide time zones for display or user input.
  4. Long-term tests and simulations: Test environments should be able to simulate days or weeks, including clock jumps. This uncovers accumulating errors and problems due to clock changes.
  5. Protection against accumulative rounding: Calculate time differences with high precision, keep internal ticks in integers and perform conversions as late as possible.

The Patriot incident was certainly an extreme example with a tragic outcome. But the underlying realisation applies to any system that calculates time or has to work reliably over long periods of time: Time is not a trivial number. Anyone who treats it like an ordinary variable is inviting errors. At this point, we would also like to refer you to the fascinating and instructive video The Problem with Time & Timezones by Tom Scott.

Empfohlener redaktioneller Inhalt

Mit Ihrer Zustimmung wird hier ein externes YouTube-Video (Google Ireland Limited) geladen.

Ich bin damit einverstanden, dass mir externe Inhalte angezeigt werden. Damit können personenbezogene Daten an Drittplattformen (Google Ireland Limited) übermittelt werden. Mehr dazu in unserer Datenschutzerklärung.

This series of articles presents nine typical error classes that occur again and again in practice – regardless of industry or technology. In each category, the series will present a specific example, analyse its causes and deduce what software developers can learn from it in the long term.

In the next part you will read: Deployment, configuration and flags: When a switch costs millions

(emw)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.