Safety-critical software is generally found in so-called embedded systems; that is, computer systems that control vehicles or machinery. “Catastrophic consequences” may then include loss of life and limb if the system malfunctions. Not all computer-controlled machinery are considered “critical”; for instance, certain industries solve the safety issue by requiring that no humans come close to computerized machinery when it is in operation (for instance, in a car factory, humans cannot walk around the robotic arms during production). Examples of computer-controlled machinery considered “critical” include the fly-by-wire controls of airplanes, automatic driving in trains and subway systems, controllers of radiation therapy equipment, and surgical robots.

There have already been deaths caused by software bugs. I have listed a few of them in the introduction of my habilitation thesis: the best known case is probably the faulty software of the Therac-25 radiation therapy machine, which massively overdosed 6 people. A less well known case of radiation therapy software occurred in Panama, at least 5 people and perhaps as many as 21 died from overdose. In 1991, during the first Gulf War, a Patriot missile failed to intercept an incoming Iraqi Scud missile in Dhahran, Saudi Arabia, which struck an American Army barracks, killing 28 soldiers and injuring others; the incorrect behavior was traced to accumulation of small errors in computations, which had passed unnoticed before because the Patriot batteries were not intended to be turned on for lengthy periods of time.

There have assuredly been many more bugs that caused, if not outright deaths or injuries, at least close calls. It is difficult to comment on such issues, because manufacturers of the faulty products as well as authorities are not necessarily willing to disclose technical facts about the issues. This is understandable from the point of view of avoiding bad publicity and leaks of trade secrets, but is very bad for return of experience as well as academic study. All too often, one hears of such issues only during un-citable informal discussions or through allusions in the media (and we know the general media is unreliable on technical matters).

I should say that civil aviation is not the industry that worries me most. The industry is heavily regulated, and, from what I have seen, the people who work in safety-critical airplane software tend to be careful. A jetliner crash is spectacular event, sure to appear on television worldwide, and sure to raise fears in some passengers; but jetliner crashes are few and far between. They kill hundreds at the same time, but they pale in comparison to car accidents in terms of total deaths. Talking about automotive, how about the software in cars?

Modern cars have computerized fuel injection, computerized steering assistance, and so on; I'm unsure whether we already have brake-by-wire and steering-by-wire. If your steering assistance stiffens while you're taking sharp turns, or softens while overtaking at freeway speeds, you're in trouble, and so are you if your fuel injection fails on the freeway. I have never worked on car systems, but from what I've heard in discussions with professionals from that area and academics working with them, I would trust them considerably less than airplane systems, for several reasons:

  • Airplanes have “black boxes” that may record evidence for a computer failure. Cars don't, and manufacturers resist the option of having them (with the pretext that drivers would not want to carry records that may be used against them as proof of speeding or other incorrect driving).

  • If an airplane crashes, there is a lengthy investigation; if a car crashes, in general, the blame is laid upon some driver. In practice, the burden of proof is on the driver to prove that the car malfunctioned.

  • Car manufacturers tend to limit costs in electronics, cutting margins of safety. There is less incentive to do so on airplane systems: given the $100000000 list price or so of jet liners, one PowerPC processor less or more is not much significant.

  • There is less redundancy in cars than in airplane. The only redundant system that I know about in cars is braking (two different circuits); for the rest, if it is broken, you are told to drive to a repair shop, or even to stop and call for repairs.

The other industry that worries me is medical devices: radiation therapy machines, obviously, but also surgical robots (I hear they're much in vogue in the US for operating on prostates, with some unclear benefit), and more mundane devices such as infusion pumps. Why worry?

  • According to the FDA, there here have been a number of cases of faulty infusion pumps in the United States, some due to faulty software. As expected, the blame, in many cases, was initially laid on doctors and nurses.

  • Some of the software bugs in question are basic problems that no software package or electronic device should exhibit, including the infamous “buffer overrun” bug, or failing to de-bounce keys (a key is said to “bounce” if pressing it once may result in the system seeing multiple presses; de-bouncing techniques have been known for ages). Furthermore, it seems that some of these systems using error-prone programming techniques, such as multithreading, with unclear benefit (*).

  • Even through drugs and airplanes need advance approval from authorities before being brought to the market, medical devices and software do not, at least in the United States. A woman needs a prescription to get on the contraceptive pill, but apparently, one needs no authorization to market medical devices that can kill patients if malfunctioning!

  • There does not even seem to be standards for programming such safety-critical systems. In civil aviation, there is DO-178B; in other fields, there are similar criteria, but curiously, there does not seem to be any for medical devices!

I was told about the infusion pump issue during informal talks at a NSF-sponsored workshop at Microsoft Research in Redmond in last October (proof that informal talks are often more valuable than presentations), and since then have done a little Web research. A summary of FDA actions is found here. More interesting is this transcript from a 2009 meeting where a FDA official basically threatens manufacturers, especially those with a history of incidents and recalls, to request all their software, run static analysis tools on them, search for bugs, and report them to the appropriate authorities (more information about this project here).

The transcript, especially Richard Chapman's intervention (p. 154 and following), is sobering (**). Some manufacturers exhibit clear amateurism about software, writing unmaintainable, undocumented code or even being unable to supply complete source code to the authorities. Such lax programming practices are inadmissible for safety-critical devices. The resemblance with the Therac-25 accident is starking: in both cases, programming practice seems totally at odds with what is taught in computer engineering sources; perhaps because software is not written by software engineers, but by employees with other backgrounds who self-taught programming. Therac-24 was 25 years ago; don't we learn?

I'm unsure how many people have died or at least suffered adverse effects from such problems. The good thing is that FDA is now applying static analysis techniques, including software from Coverity and PolySpace. I don't know whether they have tried Astrée; maybe they should.

(*) Coincidentally, I've just seen Brion Vibber posting the following definition for multithreading:

threading (n): a programming model in which every app developer is forced to solve computer science's most difficult problems

(**) I'd like to see the corresponding slides.