Thursday, 3 October 2013

Blindspots in code programming

Not news that we all got stuck with silly typos or "mistakes" while debugging some code: Big endianess? N-1 Ring-Buffer ?! Wrong datatype conversion/interaction?! hmmm...
As "developing some blindspots" its something that affects us all from time to time, when writing code, i decided to share some nice examples usually referenced as main examples( though, tbh, there are journalistic column entitled "IT Hiccups of the week", which speaks for itself :).

1- The Ariane 5 rocket, who got confused and got lost

On 4 June 1996 the maiden flight of the Ariane 5 launch ended in a failure, about 40 seconds after initiation of the flight sequence. At an altitude of about 3700 m, the launcher veered off its flight path, broke up and exploded. The failure was caused by "complete loss of guidance and attitude information" 30 seconds after liftoff.
A program segment for converting a floating-point-number to a signed 16-bit integer was executed with an input data value outside the range representable by a signed 16-bit integer. This run time error (out of range, overflow), which arose in both the main and backup computers at about the same time, was detected and both computers shut themselves down. This resulted in the total loss of altitude control. The Ariane 5 turned uncontrollably and aerodynamic forces broke it apart. This was detected by an on-board monitor which ignited the explosive charges to destroy the vehicle in the air. 

* Funny enough, the result of this conversion was no longer needed after, lift off.

reference: http://en.wikipedia.org/wiki/Ariane_5
http://www.rvs.uni-bielefeld.de/publications/Incidents/DOCS/Research/Rvs/Misc/Additional/Reports/ariane.html



2- The Patriot Missile who was late...

On February 25, 1991, during the Gulf War, an American Patriot Missile battery in Dharan, Saudi Arabia, failed to track and intercept an incoming Iraqi Scud missile. The Scud struck an American Army barracks, killing 28 soldiers and injuring around 100 other people. A report of the General Accounting office, GAO/IMTEC-92-26, entitled Patriot Missile Defense: Software Problem Led to System Failure at Dhahran, Saudi Arabia reported on the cause of the failure.

The bug occurs in the calculation of the next location of the incoming target by the range gate. The prediction is calculated based on the target’s velocity and the time of the last radar detection.
Velocity is stored as a whole number and a decimal, and time is a continuous integer or whole number (i.e. the longer the system has been running, the larger the value) measured in tenths of a second.
The algorithm used to predict the next air space to scan by the radar requires that both velocity and time be expressed as real numbers. However, the Patriot’s computer only has 24 bit fixed-point registers. Because time was measured as the number of tenth-seconds, the value 1/10, which has a non-terminating binary expansion, was chopped at 24 bits after the radix point.5 The error in precision grows as the time value increases, and the inaccuracy resulting from this is directly proportional to the target’s velocity.
When the Patriot system was first designed, the primary targets were Soviet aircraft and cruise missiles travelling at speeds around MACH 2, and only operating at a few hours at a time. However, in Operation Desert Storm, they were deployed as static defences (operating continuously), tracking and intercepting Scud missiles travelling at speeds of approximately MACH 5

"It turns out that the cause was an inaccurate calculation of the time since boot due to computer arithmetic errors.
Specifically, the time in tenths of second as measured by the system's internal clock was multiplied by 1/10 to produce the time in seconds. This calculation was performed using a 24-bit fixed-point register. In particular, the value 1/10, which has a non-terminating binary expansion, was chopped at 24 -bits after the radix point.
The small chopping error, when multiplied by the large number giving the time in tenths of a second, led to a significant error. 
Indeed, the Patriot battery had been up around 100 hours, and an easy calculation shows that the resulting time error due to the magnified chopping error was about 0.34 seconds. (The number 1/10 equals 1/24+1/25+1/28+1/29+1/212+1/213+.... In other words, the binary expansion of 1/10 is 0.0001100110011001100110011001100.... Now the 24 bit register in the Patriot stored instead 0.00011001100110011001100 introducing an error of 0.0000000000000000000000011001100... binary, or about 0.000000095 decimal. Multiplying by the number of tenths of a second in 100 hours gives 0.000000095×100×60×60×10=0.34.) 
A Scud travels at about 1,676 meters per second, and so travels more than half a kilometer in this time. This was far enough that the incoming Scud was outside the "range gate" that the Patriot tracked.
Ironically, the fact that the bad time calculation had been improved in some parts of the code, but not all, contributed to the problem, since it meant that the inaccuracies did not cancel."

Reference- "Failure at Dhahran " http://en.wikipedia.org/wiki/MIM-104_Patriot

3- Therac-25 : At a costly price ...

It is hard to reference this one, as it was the most costly, specially as the victims thought that the same device would be the one that would help improve their health, instead of the end result!




No comments:

Post a Comment

Feel free to contact me with any suggestions, doubts or requests.

Bless