Veteran software developer David A. Kruger offered some thoughts on computer security recently at Expensivity and we appreciate the opportunity to republish them here. He starts with “Root Cause Analysis 101”
The classic line “I have a bad feeling about this” is repeated in every Star Wars movie. It’s become a meme for that uneasy feeling that as bad as things are now, they are about to get much worse. That’s an accurate portrayal of how many of us feel about cybersecurity. Our bad feeling has a sound empirical basis. Yearly cybersecurity losses and loss rates continually increase and never decrease despite annual US cybersecurity expenditures in the tens of billions of dollars and tens of millions of skilled cybersecurity man-hours. Cybersecurity’s record of continuously increasing failure should prompt thoughtful observers to ask questions like “Why are cybersecurity losses going up? Why isn’t cybersecurity technology reducing them? Are there things we don’t understand or are overlooking?”
That’s easy to answer: Of course, there are! After spending this much time, money, and brainpower on cybersecurity without managing to decrease losses, much less eliminating them, it’s painfully obvious something isn’t right.
This article explains what we get wrong about cybersecurity, how and why we get it wrong, and how to fix it. Fair warning: it’s a long and bumpy road. There a healthy dose of counterintuitive assertions, cybersecurity heresy, and toes stepped on, but at roads end you’ll know what the true cause of cybersecurity failure is and how to fix it.
Part One – Cybersecurity Technology
The Heart of the Matter
When confronted with a chronic problem, we human beings are prone to err by trying solutions without first asking the right questions. We tend to ask, “How do we stop this now?” and fail to ask, “What’s causing this?” Then we are shocked when our fixes don’t last. This tendency is so common that safety engineers developed a formal analytical method called a root cause analysis to prevent this error. Root cause analysis is designed to find unidentified causes of recurring failure. A root cause analysis starts with an effect, in this context, a failure, and works upstream all the way through the chain of causation until the root cause is found. In complex systems like computers, finding the root cause of failure is critically important because an unidentified root cause makes multiple downstream elements of the system much more prone to fail. You can tell when you’ve found the root cause, because if you fix it, the downstream recurring failures cease.
Identifying the root cause in complex systems can be hard because:
- A single root cause can spawn multiple instances and types of failure because a single root cause can spawn multiple chains of cause and effect. The chains can be long, having many intermediate cause and effect links between the root cause and the failure. The more links in the chain, the longer the “distance” between the root cause and the failure. Long chains branch and intersect with other chains, which makes it even more difficult to identify the root cause.
- Usually, the longer the distance is between an unidentified root cause and the failures it’s causing, the harder the root cause it is to identify. The shorter the distance between an intermediate cause and the failures, the easier the intermediate cause it is to identify. Intermediate causes are obvious, unidentified root causes are not—and that’s why root causes are so often overlooked.
Because of these difficulties, problem solvers can easily fall prey to the symptomatic solution fallacy, a mistaken belief that solving intermediate problems can permanently stop long distance failures. It’s called the “symptomatic” solution fallacy because it’s the engineering equivalent of a doctor believing that a treatment is curative when it only temporarily alleviates symptoms of an undiagnosed chronic disease. For example, a dose of pain medication can temporarily alleviate suffering, but it can’t cure the cancer that’s causing the pain.
To see how root cause analysis aids in finding and fixing unidentified root causes, we’ll review a common real world root cause analysis and then take the lessons learned and apply them to cybersecurity technology and then to cybersecurity policy.
Root Cause Analysis 101
The purpose of automaker safety recalls is to prevent recurrent failures attributable to a previously unidentified root cause. Recently, 700,000 Nissan Rogue SUVs were recalled because:
“In affected vehicles, if water and salt collect in the driver’s side foot well, it may wick up the dash side harness tape and enter the connector. If this occurs, the dash side harness connector may corrode and possibly cause issues such as driver’s power window or power seat inoperative, AWD warning light ON, battery discharge, and/or thermal damage to the connector. In rare cases, a fire could potentially occur, increasing the risk of injury.”
Lesson Learned 1. A root cause analysis, and ultimately the recall, was initiated by the automaker because it observed a pattern of multiple types of recurring failure that appear to be related, in this case multiple types of electrical failures.
Lesson Learned 2. From the perspective of the driver, if your power windows or seats stop working, or your care won’t start because the battery is dead or wiring in the dashboard of your 2014-2016 Nissan Rogue catches fire, it’s apparent that the problem is electrical. The root cause analysis revealed that the closest cause to these electrical failures was obvious, a corroded wiring harness connector.
Now, imagine the automaker had identified the wiring connector as the root cause and declared that replacing it was a permanent fix. It would soon be evident that the automaker had fallen prey to the symptomatic solution fallacy because replacing the connector would not be a permanent solution. The still unidentified and unfixed root cause would cause the replacement connector to corrode again, which, in turn, would cause one or more of the related failures to recur.
Key Point: After a fix has been applied, if related failures continue recurring, it’s evident that an intermediate cause was erroneously identified as the root cause.
Lesson Learned 3. Working the chain of causation backwards, the automaker deduced the cause of corrosion was exposure to moisture and a corrosive. What was the source? They deduced that the wiring harness tape wicked moisture and salt up to the connector, but where did the water and salt come from? They deduced the wiring harness was being wetted as it traversed the footwell.
The potential presence of water and salt in the footwell of an SUV is a known operating condition. A given vehicle may or may not encounter salt and water during its lifetime, but it is a known potential operating condition for all SUVs. The automaker neglected to take this known operating condition into account when selecting the routing and the physical characteristics of the tape used to wrap the wiring harness. Therefore, the root cause of failure is that the automaker neglected to compensate for a known operating condition in its design. Note that this finding is axiomatic; truly unforeseeable root causes are rare.
Key Point: In complex systems, it is axiomatic that recurring failures attributable to a previously unidentified root cause nearly always results from neglecting to compensate for known operating conditions in the design.
Lesson Learned 4. Now that the root cause has been identified, the automaker will conduct a requirements analysis to clarify operating conditions, needs, and goals of the fix, and then redesign to compensate for overlooked operating condition and minimize their and their customers’ risk and expense.
Lesson Learned 5. Since the automaker neglected to compensate for a known operating condition—potential exposure of an SUV to water and salt—in their design, the automaker is responsible legally, financially, and morally, for fixing the affected vehicles and making certain that the overlooked operating condition is compensated for in the design of all future models.
Summary of Root Cause Analysis Lessons Learned:
- Lesson Leaned 1: A pattern of multiple types of recurring related failures indicates the presence of an unidentified root cause.
- Lesson Learned 2: If repeated fixes fail to stop recurring failures, it indicates fixes are being applied to intermediate causes (symptoms), rather than to the root cause.
- Lesson Learned 3: It is axiomatic that neglecting to compensate for a known operating condition in the design is nearly always the root cause.
- Lesson Learned 4: To fix the root cause, a redesign compensating for the overlooked operating condition is required.
- Lesson Learned 5: The designers neglected to compensate for a known operating condition, therefore, they are responsible for fixing existing and new designs.
Next: What’s Wrong with Cybersecurity Technology?