Jul 30 2007
On the Importance of Failing Well
Well engineered devices tend to share similar characteristics, from durability to repairability, among other attributes. Of particular interest in software is the another characteristic of a well designed system: failing well.
It’s very important for devices to fail well. One great example of a well designed failure state is the modern intersection traffic light (at least in United States): when a traffic light loses power the light doesn’t turn off. Instead, a battery kicks in and the light blinks red.
The traffic light has lost access to it’s power source and to it’s traffic detectors, but even in failure the lights can keep an intersection operational and safe.
Software too needs to fail well. Let’s say you built a software product that had an email module that sends out email. Maybe that email module runs as a service, in the background, invisible to users.
When that email service fails, what happens? It might be hours or even days before a user notices the email isn’t working in your product. You may have taken the time to answer you customers inquirers, but your replies are stuck in a queue.
The solution to this problem? Design the email module to fail well: have it send a notification when it fails. Or, oppositely, have the email send a regular status notification when it’s operational. If you don’t get a status notification, you know something is wrong.
The point here is simple: when you design software spend some time and think how it can fail. Imagine what kinds of outliers can cause it to fail and imagine the implications of that failure on the work flow of the product. Then implement a routine to ensure that when your application fails, it fails well.
We will call this missing features axiom 3: make your software fail well.
[...] for and eliminate dead end phone paths. Phone menu systems, like software, need to fail well. In this case, that means routing a person back to a customer service rep (at the top of the queue) [...]