Sunday, October 5, 2014

Dr. Koopman on Toyota's notion of safety from "realistic" ETCS faults

 Toyota bakes a yummy-looking safety cake. But it's fake. 


Dr. Koopman, in his Sept 18 lecture at Carnegie-Mellon:
NASA did not disclose a 'written safety argument' from Toyota.
A written safety argument, according to Dr. Koopman, is a statement that explains why Toyota thinks its system is safe.
Toyota's entire safety argument, expressed in Bookout and in the Exponent public report, is that there is no "realistic" ETCS fault that would not be caught by the failsafe system.

To elucidate why merely guarding against "realistic" faults actually equals "unsafe," 

here are some excerpts from Dr. Koopman's Blog, Better Embedded Systems, in a post from Monday, March 31, 2014 


If you design a system that doesn't eliminate all single points of failure, eventually your safety-critical system will kill someone.  How often is that?  Depends on your exposure, but in real-world systems it could easily be on a daily basis.

"Realistic" Events Happen Every Million Hours

Understanding software safety requires reasoning about extremely low probability events, which can often lead to results that are not entirely intuitive. Let's attempt to put into perspective where to set the bar for what is “realistic” and why single point fault vulnerabilities are inherently dangerous for systems that can kill people if they fail.

If a human being lives to be 90 years old, that is 90 years * 365.25 days/yr * 24 hrs/day = 788,940 hours.  While the notion of what a "realistic" fault is can be subjective, perhaps an individual will think something is a realistic failure type if (s)he can expect to see it happen in one human lifetime. 

This definition has some problems since everyone’s experience varies. For example, some would say that total loss of dual redundant hydraulic service brakes in a car is unrealistic. But that has actually happened to me due to a common-mode cracking mechanical failure in the brake fluid reservoir of my vehicle, and I had to stop my vehicle using my parking brake. So I’d have to call loss of both service brake hydraulic systems a realistic fault from my point of view. But I have met automotive engineers who have told me it is “impossible” for this failure to happen. The bottom line: intuition as to what is "realistic" isn’t enough for these types of matters -- you have to crunch the numbers and take into account that if you haven't seen it yourself that doesn't mean it can't happen.

So perhaps it is also realistic if a friend has told you a story about such a fault. From a probability point of view let’s just say it's "realistic" if it is likely to happen every 1,000,000 hours (directly to you in your life, plus 25% extra to account for second-hand stories).  

Probability math in the safety critical system area is usually only concerned with rounded powers of ten, and 1 million is just a rounding up of a human lifespan. Put another way, if it happens once per million hours, let’s say it’s “realistic.”

"Realistic" Catastrophic Failures Will Happen To A Deployed Fleet Daily

[paraphrasing and editing] Obermaisser gives permanent hardware failure rates as about 1 hardware failure per 10 million operating hours of drive-by-wire automobiles. But transient failures, such as those caused by cosmic rays, voltage surges, and so on, are much more common, reaching rates of 100 per million hours). That means that any particular component will fail in a way that can be fixed by rebooting or the like about 10 to 100 times per million hours – well within the realm of realistic by individual human standards. 

Obermaisser also gives a frequency for arbitrary faults as 1/50th of all faults (id., p. 8). Thus, any particular component, can be expected to fail 10 to 100 times per million hours, and to do so in an “arbitrary” way (which is software safety lingo for “dangerous”) about 0.2 to 2 times per million hours.  This is right around the range of “realistic,” straddling the once per million hours frequency.

 In other words, if you continuously own vehicles for your entire life, and they each have 100 computer chips in them, then you can expect to see about one potentially dangerous failure per year since there are 100 of them to fail -- and whether any such failure actually kills you depends upon how well the car's fault tolerance architecture successfully deals with that failure and how lucky you are that day. That is why it is so important to have redundancy in safety-critical automotive systems. These are general examples to indicate the scale of the issues at hand, and not specifically intended to correspond to exact numbers in any particular vehicle, especially one in which failsafes are likely to somewhat reduce the mishap rate by providing partial redundancy.

************************************************************
This excellent post continues here.