Sunday, August 3, 2014

Aircraft safety engineer Geoff Barrance grasps the nettle of Dr. Koopman's testimony

A stinging nettle
Once I had a side-job harvesting nettles. They sting like a million tiny bee stings, so I wore long gloves. But nettle tea can cure illness, so I picked them diligently for the  health and well-being of people unknown to me.

Geoff Barrance, summarizing Dr. Koopman, has grasped the documented and reported unsafe hardware architecture of the Toyota Camry 2004, and, unlike others, has faced its sting for the sake of public health.

Here is his report, freshly produced yesterday. Tom, Lisa, and Kevin, I dare you to read this. With your bare hands.

Geoff Barrance Extracts some Nuggets from

Dr. Koopman’s Testimony at the Bookout vs. Toyota trial


In an earlier email to Betsy that she quoted, I referred to Dr. Philip Koopman’s testimony in the Bookout vs. Toyota trial.  I said that Dr. Koopman’s testimony was as important, and as damning of Toyota’s throttle control system design, as that of Mr. Barr’s.  But Mr. Barr’s seems to have garnered more attention (and got Toyota legal people all steamed up about what seems to me to be a trivial redaction in the slides that Mr. Barr used in his testimony).  Now I am not downplaying how damning Barr’s testimony was of the quality of Toyota’s ETCS software – indeed I find their lack of adherence to state-of-the-practice quality for safety critical system software to be quite astounding.  But I think we should also be aware that, as I said before (in no uncertain terms!), the hardware in which that software runs also ignores the requirements for a safety critical system.  Even if they’d had perfect software it’d be no good because the hardware architecture is unsafe.

So is the ETCS a safety critical system?  Sure is.  For example the stuck-at-wide-open-throttle condition is the most obvious case, and is in fact the case in point in the Bookout trial.  So, as Dr. Koopman’s  testimony stresses, the ETCS needs to be designed to the standards for safety critical, and it isn’t.

It is a bit of a slog to read through all Dr. Koopman’s verbatim testimony, in its question and answer form, as he was led through his presentation to the court, and I don’t have access to the slides he was using.  But he’s a university professor, so he says the things that I have been thinking rather well, probably a lot more understandably than my engineering approach conveys.  Anyway, I will give some highly pertinent abstracts.  Links to the full text are given below.  His testimony starts on page 14 of the AM transcript, this is all from the AM part.

It starts with Dr. K giving several reasons why he thinks the ETCS design is[1] unsafe.  I quote him with some condensation in places and some explanatory insertions by me, indicated by [ ]:

1.      Random hardware and software faults are a fact of life.  Random has a special meaning … it means even if you think it was designed perfectly something always goes wrong anyway.  The defective safety architecture has an obvious single point of failure.  A single point of failure is a critical concept in safety critical systems.  I will explain where one is and why that is important.  And reading the NASA report, they came to the same conclusion.

2.      Toyota’s methods to ensure safety were themselves defective.  You have to exercise great care when you’re doing safety critical software [and hardware].  You can’t just wing it.  And Toyota exercised some care, but they did not reach the accepted practice in how you need to design safety critical systems.

Gambling - with lives

3.      Third opinion is that the Toyota safety culture is defective.  So safety culture is how the organization as a whole treats safety.  Do they take it seriously, do they have professionals in place to make sure that even if you’re having a bad day you will not make a mistake that day, that still things are going to work OK.  And I saw several signs of a defective safety culture.  And one example that I will talk about is that when they’re investigating an accident they don’t seem to take the possibility that the software can be defective very seriously. They just say, No, you know, that can’t be defective.


Culture
4.      My next opinion is that Toyota’s source code is of poor quality.  … Even at a high level there is some tell-tale signs that that you don’t need to look at the individual lines of code to know that there are some severe problems here. One of them is 10,000 global variables.  If you talk to a safety person, and that number is above 100 they will right there say, you know, that’s it.  There is no way this can be safe.  … [The] academic standard is that there should be zero.

5.      Toyota’s approach to concurrency and timing is defective.  That means that when you’re driving a car and the engine is spinning around and the spark is firing to ignite the fuel, it has to happen in a very precise time line. … And in a safety critical system you have to meet deadlines. … If you miss those deadlines the system is generally considered unsafe.

A bit later in his testimony Dr. Koopman is asked about the defective safety architecture with an obvious single point of failure, which he mentioned in his first opinion.  He said:

A single point of failure is one place that if that has a problem the system is unsafe.  … this is probably the most important point of safety critical design.  If you have any single point of failure the system is by definition unsafe.  All the safety standards say you cannot have any single point of failure.

A single point of failure is some piece of hardware or software that has complete control over whether the system is safe or not.  And so if it fails due to a random hardware event or a software bug, if it fails, then the system is unsafe.  And it is kind of tricky because you don’t say, well, I can think of five ways for it to fail, and I protect against all those five; that is not good enough.  It doesn’t matter whether you’re smart enough to think about how it is going to fail.  When you have millions of vehicles on the road it will find a way to fail you didn’t think about.  So the rule is simply you cannot have a single point of failure.

He then points to the Analog to Digital (A/D) converter that takes in (among other data) the two, supposedly independent (but that’s another story), values from the accelerator pedal.  Note that he is not talking about the software here; the A/D converter is a piece of hardware.  Again I quote in a condensed form:

So there are two voltages that indicate accelerator pedal fully depressed.  This is not a fault mode right now, we’re just talking about normal operation.  In this case it goes into the A/D portion.  [It is] converted to digital bits that say, Hey, the gas pedal is all the way down.  [This information] is sent to the both the sub CPU and the main CPU.  And it says the gas pedal is all the way down.  Okay let’s get the throttle more open because the driver wants to speed up. 



If one of these two wires goes bad then you’re okay because there are two of them.  And this [computer] will, if it’s working properly, notice they don’t match and invoke one of the failsafes.  If there is a failure here, for some of the failures it will detect that it has failed.  For some of the failures it will result in the voltages not matching.  But whether we’re smart enough to think about it or not, there is a single point of failure that there is always the possibility that something in here will cause the two voltages to be read as though the gas pedal is all the way down [when it isn’t] without noticing there is a problem.  I don’t know of a failsafe that will catch all possible, all single point faults in the A/D converter.  My concern with that is that it makes the system unsafe.  For example there could be a fault that just the A/D converter just decides to say, Do you know what, the gas pedal is all the way down, even though it’s not.

So the failsafes [designed by Toyota] are based on this … analysis that basically says we are never going to have a situation in which these signals come through in a way – in a way that is wrong but undetectable.  They’re assuming that you can always detect that something is wrong.  Making that assumption limits your fault model to only faults that are detectable, not any possible fault.  So that falls short of the requirements of the safety standards.  It could result in unintended acceleration by, for example, if you have your foot on the accelerator and you release it and this [A/D converter] keeps shoving out stale data.  It just stops updating and keeps doing the old accelerator position that you used to have.

It could fail that way but it can also fail by spitting out an arbitrary number.  It is a single point of failure.  And when you look at these [arbitrary failures], you say, What is the worst thing it could do?  Well, the worst thing it could do is probably command wide open throttle.  And there is no independent check and balance to stop doing it, and that makes it unsafe.  [Toyota’s fail safes cannot catch those failures] Because it is basically trusting that it will be able to detect any difference, and that’s a restricted fault model.

So there, in plain and unvarnished language, is why the ETCS in the 2004 Camry is unsafe.  Somehow the NASA report could not bring itself to say this so forthrightly (and one certainly wonders why) but it did say that it could not prove there was no case that would result in unintended acceleration, though of course the DOT’s boss at the time ignored that and said the opposite.  And the NAS report also failed to grasp the nettle and say what I had been telling them.  Politics I guess.
Anyway, thank you Dr. Koopman for explaining it so well.
Links:




Can you grasp that?

[emphasis added by Betsy]


[1] Remember that the present tense he is using refers to Toyota in the early 2000’s.  We may assume that much has improved since then.  But there are still a lot of those 2004 vehicles in use today, and nothing has been done to bring them up to an acceptable standard.