|Like other high-tech
companies, a major networking company (well call it "BigNet") has found
keeping up with rapid growth presents an interesting and significant challenge. In
particular, it stressed the Customer Service Organization's escalation process to where it
could no longer effectively meet either customer needs or their own internal needs.
Fundamentally, the process was not designed to handle the growing numbers of customers or
the increasing diversity in customer and BigNet requirements.
The result was increasing dissatisfaction and frustration
on the part of everyone who participated in identifying and resolving escalated problems.
Increasing numbers of problems were treated as exceptions, which is expensive and
time-consuming for all concerned. Relief was needed urgently, but where to begin? This
article describes the true story of how BigNet approached this business process problem.
Peter, the Vice President of Customer Service,
decided to take a different approach than usual for such projects. Rather than chartering
a team to draw "as is" and "should be" maps of the escalation process,
Peter decided to choose a specific incident that was a representative demonstration of
what happened to the overloaded process. With this approach, if the escalation process
wasnt really causing the problem or was only a partial contributor, we would find
out. He would analyze that incident using a method called TŘ Root Causes Analysis, a tool
specifically designed to learn the truth about complicated incidents. The method focuses
actions on preventing recurrences and improving recovery when problems occur. This
approach promisedand delivereda quick and accurate way to understand exactly
what was really happening and to identify the most immediate opportunities for
Peter and his team chose the most recent
automatic teller machine (ATM) outage in a particular customers network to be the
TŘ event. This particular customer had experienced a series of outages, and was so angry
and frustrated by the constant problems that BigNet was worried about losing a major
customer. Such outages are particularly complicated to resolve because so many physical
components, software products and different companies are involved.
In this case, BigNet provided the network
management hardware and software. Other participants included the bank that owned and
operated the ATM, three phone companies, contractors responsible for installing and
maintaining the cabling, and several contractors responsible for parts delivery and
providing field engineers. There had been numerous meetings among technical staff and
executives from the participating companies, but outages continued to occur and take too
long to resolve. The outage selected for analysis was clearly representative of an
"tee-zero" for time-zero, is a way of designating a key point in time.
From TŘ, events are traced backward or forward in time.
Peter wanted to
use TŘ Root Cause Analysis (RCA) because it focuses on an event, or specific
incident. Since an incident is something undesirable that happens at an identifiable time
and place (the TŘ event), the analysis begins from an objective viewpoint. Everybody
could easily agree that ATM outages are undesirable. In an already volatile relationship,
this was an important advantage over other root cause analysis methods, which require an
agreed problem definition before you can begin.
Beyond discovering the cause of the incident,
TŘ RCA uncovers what happens during recovery. Speedy recovery when a crisis occurs can
substantially reduce the impact of the incident, so it is highly beneficial to identify
root causes and prevent recurrence of problems in the recovery process. In this example,
the ATM outage lasted 22 hours, and several months later, BigNet, the customer and the
other participating companies were still dealing with follow-on problems associated with
the outage. Thus the ability to improve the recovery process was also a major motivation
in choosing to perform TŘ RCA.
Click here to see a
diagram that illustrates the picture that TŘ RCA creates. "Still water" before the TŘ incident
represents the time when everything is completely normal, with no ripple hinting at what
is to come. Still water after the TŘ incident represents the time when recovery is
complete and there are no lingering traces of what occurred.
TŘ RCA Pays Off in
Improved Application Availability and Improved Productivity
Performing TŘ RCA is intensive work, and like
any such investment, must yield specific benefits. TŘ RCA, when applied to an appropriate
incident, improves application availability to end users as well as improves productivity
for all parties concerned. In the case discussed here, that includes technical staff and
executives from BigNet as well as from the customer and all the other participating
companies. The productivity improvements derive from the following specific benefits of
using TŘ RCA:
1 - Problem prevention. This is always the
expected benefit of root cause analysis. Learning the basic causes of a problem gives you
the information needed to eliminate these causes and thereby prevent future incidents.
2 - Quicker problem resolution. TŘ RCA
recognizes that substantial reduction in the number of incidents takes time and that,
given the changing environments that we work in, we cannot eliminate crisis incidents
entirely. Therefore, containment of problems to limit the impact is also a significant
focus of TŘ RCA. The method identifies root causes of problems in the recovery process,
providing information needed to eliminate the things that delay and confuse recovery.
3 - Reduced contention over non-issues.
Because the method starts from an objective basis and creates a picture of the incident
based on facts, everyone can focus on the real problems rather than placing blame.
4 - Clear identification of solution owners.
The systematic analysis process shows a logical connection between events that occur and
what causes them. The analysis then focuses on solutionswhat can be done to
eliminate the causes, and who can perform those actions. The clarity of who is responsible
for specific actions and the focus on solution ownership (rather than problem ownership)
is key to changing a working relationship from blaming and reactive mode to constructive
on Problem Occurrences
TŘ RCA takes advantage of the fact that problem
incidents don't just happen for no reason. By studying an incident as the direct outcome
of a series of events, we can break the problem into understandable and manageable pieces:
we draw the picture of how the precursor events interact and lead directly to the TŘ
incident. Similarly, recovery from an incident is the direct outcome of the activities
that take place following the TŘ incident. The recovery can be speedy and effective
or less than terrific.
The TŘ incident is the basis for a discovery
process that gives us the information we need to prevent future occurrences of a whole
class of incidents. The process has seven steps:
1. Define the TŘ event
Proper selection of the TŘ event is critical
because the success of the entire analysis depends on this step. The event must be
significant, i.e., cause enough pain, to be worth the investment to prevent future
occurrences. It must also provide leverage so that identifying and addressing root causes
of the one incident truly will eliminate the whole class of incidents.
In this case, the TŘ event was defined to be the
time when the specific ATM outage was reported to BigNet. A 22-hour outage was certainly
significant in impact, especially since we knew that it was not an isolated incident.
Without intervention, additional outages were essentially guaranteed, certainly at that
customer site and probably at others.
2. Construct the event
The event line is a graphic representation of all
the precursor events that lead up to the TŘ event and the follow-on events that occur
during recovery. An event line typically has 70-100 events represented, with multiple
threads showing parallel kinds of activity. The event line for the BigNet case had 126
events, reflecting the complexity of the situation, with threads showing routine, setup,
investigation, corrective action and communication activities. Many problems became
obvious simply from examining the event line. For example, simultaneous corrective actions
that are uncoordinated and possibly conflicting show up clearly on an event line. Click here to see a small section of the event line for this case.
Although examination of the event line suggests
some root causes, it is necessary to analyze the entire event line systematically to get
complete understanding. Therefore, each event on the line is individually classified as
problematic, unclear or OK. The problematic events are the ones we examine more closely.
3. Identify "key
To make the entire effort manageable and to
ensure that the actions taken will have the largest impact, the next step is to focus on
the most important events. We review all the events, identifying those that are the most
important contributors to TŘ or recovery problems, or that represent a class of
problematic events with similar characteristics. Typically, approximately ten events are
selected as key events.
The full event line for the ATM outage reveals
several interesting points, which illustrate the kinds of things that key problematic
- Although the outage occurred in January, it was
"set up" in December when the ATM was initially installed. Examining the event
line shows that problems with the backup circuit were identified almost a month before the
TŘ outage incident.
- When an outage occurred, the bank called BigNet
and everyone else who might possibly help. This resulted in lots of activity and finger
pointing, but not in effective problem resolution. Everyone involved replaced equipment to
demonstrate responsiveness, regardless of technical evidence about the cause of the
problem. The companies involved had not successfully identified the specific cause of the
outage, so werent even applying the right band-aids.
- There was no still water. The ATM had repeated a
cycle of outages since the day it was installed. Although the ATM was back in service at
the time the TŘ analysis was performed, there were still many recovery activities
underway, and another outage occurred before they were complete. There would be no still
water for the specific ATM until the correct apparent causes were identified and
corrected. There would be no reduction in ATM outages in general until the correct root
causes were identified and corrected.
4. Analyze apparent
causes4. Analyze apparent
Specific conditions that allowed a key
problematic event to occur are important to correct, but also form the basis for further
analysis to reach root causes. An example of an apparent cause would be a particular
outstanding bug report that, if fixed, could have prevented an outage. In this case, an
especially important key problematic event was the initial installation of the ATM.
Analyzing the apparent causes revealed that the internal wiring was the wrong gauge. Once
understood, fixing the apparent cause for that ATM and for all other installed ATMs
5. Identify root causes.
Root causes are the systemic causes that allow
the apparent causes to develop. Often the root causes are areas of concern that are
already known, but the true impact of not correcting them was not previously recognized.
Performing this systematic analysis makes clear the direct connection between a given root
cause and a crisis event, making it easier to attach an appropriate priority and take
In the example, the root causes behind the
incorrect initial installation included inadequate installation specifications, inadequate
testing procedures for new installations and inadequate turnover procedures to transfer
responsibility from the organizations responsible for new installations to those
responsible for ongoing operations.
6. Identify corrective
Success in preventing crises and improving
recovery depends on following through with actions that correct apparent causes as well as
root causes. Correcting apparent causes is usually fairly easy since they are very
specific and often well enough defined to identify the appropriate action quickly.
Correcting root causes is typically more difficult because it takes more work to determine
exactly what actions to take and to verify that the corrective actions work. However,
correcting root causes is how we leverage the effort of analyzing one incident into
preventing whole classes of incidents.
7. Do them.
Unless corrective actions are actually carried
out, we don't realize the benefits of doing the analysis. Although this seems obvious, it
is all too common for new crises to divert attention, letting these actions join the long
list of good intentions.
The escalation process, originally targeted as
the culprit, had little to do with the problem. Redesigning it would have had no impact on
the real issues that were frustrating the bank. Instead, the real root causes were
discovered. ATMs are now installed correctly the first time and other issues that
were uncovered in the analysis have also been addressed. The bank, which was almost ready
to change vendors, is now BigNets largest support customer.
TŘ Root Cause Analysis is a straightforward way
to find out why something happened and leverage that knowledge to prevent similar
incidents in the future. It also enables speedier recovery when incidents to occur. One
significant incident is the basis for a discovery process: we analyze it as the result of
a chain of events. By preventing recurrence of individual problematic events, we reduce
the chances of the TŘ outage incident recurring. Similarly, we eliminate individual
problematic events in the recovery process. This practical approach keeps things
manageable. The end result is to improve service and save money for customers and