TŘ Root Cause Analysis
Learning the Truth
TŘ RCA Pays Off
Capitalizing on Problem
Seven Steps to prevent a whole class of
1. Define the TŘ event
2. Construct the event line
3. Identify "key
4. Analyze pre-conditions
5. Identify root causes
6. Identify corrective
7. Do it
||The situation was a
muddled swamp of political and technical issues. The customer had experienced an
application outage that lasted 48 hours. This would be a serious disruption at any time.
To make things worse, however, it occurred when the company was trying to complete year
and month-end processing. Users were forced to revert to a workaround system that was much
slower and more labor intensive. In addition, work done using the workaround had to be
redone after service was restored. Everyone agreed that the situation was nasty, and
should not be allowed to happen again. The unanimous opinion ended there.
The customer believed the client-server based document
imaging and processing application was the culprit. The application vendor agreed that
some changes were necessary in the product, but felt the customer was not managing the
system properly. They needed a way to learn the truth about what happened and, more
importantly, what the real causes wereall of them. In this article, we use this real
case to see how we can discover the truth and use it to prevent unwanted incidents and
speed recovery when they do occur.
"tee-zero" for time-zero, is a way of designating a key point in time.
From TŘ, events are traced backward or forward in time.
The vendor decided to apply a specific
method, called TŘ Root Cause Analysis (RCA), to gain a clear picture of what
happened and what the underlying causes were. They chose this method because it focuses on
an event, or specific incident. Since an incident is something undesirable that happens at
an identifiable time and place (the TŘ event), the analysis begins from an objective
viewpoint. In an already volatile relationship, this was an important advantage over other
root cause analysis methods, which require agreement on what the problem is that you are
trying to solve.
Beyond discovering the cause of the incident, TŘ
RCA uncovers what happens during recovery. Speedy recovery when a crisis does occur can
substantially reduce the impact of the incident, so it is also beneficial to identify root
causes and prevent recurrence of problems in the recovery process. In this example, the
application outage itself lasted 48 hours, and two weeks later the company and customer
were still dealing with follow-on problems associated with the outage. Thus the ability to
improve the recovery process was also a major motivation in choosing to perform TŘ RCA.
TŘ RCA Pays Off in Improved Application Availability
and Improved Productivity
Performing TŘ RCA is intensive work, and like
any such investment, must yield specific benefits. TŘ RCA, when applied to an appropriate
incident, improves application availability to end users as well as improves productivity
for all parties concerned. In the case discussed here, that includes the vendor technical
and sales staff, vendor executives, customer technical staff, customer executives and end
users. The productivity improvements derive from the following specific benefits of using
- Problem prevention. This is the usual
purpose of root cause analysis. Learning the basic cause of a problem gives you the
information needed to eliminate the cause and thereby prevent future incidents.
- Quicker problem resolution. TŘ RCA
recognizes that substantial reduction in the number of incidents takes time and that,
given the changing environments that we work in, we cannot eliminate crisis incidents
entirely. Therefore, containment of problems to limit the impact is also a significant
focus of TŘ RCA. The method identifies root causes of problems in the recovery process,
providing information needed to eliminate the things that delay and confuse recovery.
- Reduced contention over non-issues. Because
the method starts from an objective basis and creates a picture of the incident based on
facts, everyone can focus on the real problems rather than placing blame.
- Clear identification of solution owners.
The systematic analysis process shows a logical connection between events that occur and
what causes them. The analysis then focuses on solutionswhat can be done to
eliminate the causes, and who can perform those actions. The clarity of who is responsible
for specific actions and the focus on solution ownership (rather than problem ownership)
is key to changing working relationship from a blaming, reactive mode to being
constructive and proactive.
Capitalizing on Problem Occurrences
TŘ RCA takes advantage of realizing that problem
incidents don't just happen. By studying an incident as the direct outcome of a series of
events, we can break the problem into understandable and manageable pieces: we actually
draw the picture of how the precursor events interact and lead directly to the TŘ
incident. Similarly, recovery from an incident is the direct outcome of the activities
that take place following the TŘ incident. The recovery can be speedy and effective, or
less than optimal.
One major incident is the basis for a discovery
process that gives us the information we need to prevent future occurrences of a whole
class of incidents. The process has seven steps:
1. Define the TŘ event
Proper selection of the TŘ event is critical
because the success of the entire analysis depends on this step. The event must be
significant, i.e., cause a high enough pain level, to be worth investing in to prevent
future occurrences. It must also provide leverage so that identifying and addressing root
causes of the one incident will actually address a whole class of incidents.
In the example case, the TŘ event was defined to
be the time when end-users reported the application outage to the internal technical
support staff. A 48-hour outage was certainly significant in impact! Assessing leverage
was harder since at the time we selected the TŘ event we didn't yet know what the cause
was. However, many of the technical and operational details were already known. Clearly
without intervention, additional outages were effectively guaranteed, certainly at that
customer site and probably at others.
2. Construct the event
The event line is a graphic representation of all
the precursor events that lead up to the TŘ event and the follow-on events that occur
during recovery. An event line typically has 70-100 events represented, with multiple
threads showing parallel kinds of activity. The event line for the example case had 88
events, with threads showing end-user, investigation, corrective action and communication
activities. Many problems became obvious simply from examining the event line. For
example, simultaneous corrective actions that are uncoordinated and possibly conflicting
show up clearly on an event line.
The overall structure of the event line suggests
some root causes, but it is necessary to analyze the event line systematically. Therefore,
each event on the line is individually classified as problematic, unclear or OK. The
problematic events are the ones we examine more closely.
3. Identify "key
To make the entire effort manageable and to
ensure that the actions taken will have the largest impact, the next step is to focus on
the most important events. Approximately ten events are selected as the key events, either
because they are the most important contributors to TŘ or recovery problems, or because
they represent a class of events with similar characteristics.
Click here to see the
event line depicting the key problem events for the example case. Examining the event line
reveals several interesting points:
Although the outage occurred in February, it was
"set up" by actions taken in November that resulted in objects being located
incorrectly. In fact, those actions did initiate a problem report, but the potential
impact was not understood and the problem was not addressed until after the TŘ crisis.
The apparent cause of the outage determined in
the initial troubleshooting steps was running out of disk space on the server. In fact,
the disk space problem was a trigger rather than the real cause.
Troubleshooting events are representative rather
than significant individual events, so the patterns are easier to see in the full event
line. The troubleshooting area of the event line is labeled a "cloud" because,
although there was plenty of activity, it was not effectively coordinated or focused to
resolve the problem quickly.
The final event emphasizes the finding that there
was no agreed-to definition of recovery, and it was unclear at the time the analysis was
done whether recovery was complete.
4. Analyze pre-conditions.
5. Identify root causes.
Specific conditions that allowed a key
problematic event to occur are important to correct, but also form the basis for further
analysis to reach root causes. An example of a pre-condition could be a particular
outstanding bug report that, if fixed, could have prevented an outage. In the sample case,
one precondition was the lack of follow-up in December when the problem with objects
located incorrectly was reported.
6. Identify corrective
Root causes are the systemic causes that allow
the pre-conditions to develop. There are a limited number of root causes that regularly
appear in computer-related situations, but the TŘ RCA identifies which ones to address to
prevent the particular class of outages being analyzed. Often the root causes are areas of
concern that are already known, but the true impact of not correcting them was not
previously visible. Performing this systematic analysis makes clear the direct connection
between a given root cause and a crisis event, making it easier to attach an appropriate
priority and take action.
In the example, "less-than-adequate work
practices" is a root cause of every key problematic event. Other root causes of some,
but not all, key events included "less-than-adequate product robustness" and
7. Do it.
Success in preventing crises and improving
recovery depends on following through with actions that correct specific pre-conditions as
well as the root causes. Correcting pre-conditions is usually fairly easy since the
pre-conditions are very specific and often well enough defined to identify the appropriate
action quickly. Correcting root causes is typically more difficult because it takes more
work to determine exactly what actions to take and to verify that the corrective actions
work. However, correcting root causes is how we leverage the effort of analyzing one
incident into preventing whole classes of incidents.
An example of correcting a root cause from the
example case was to define, document and ensure use of an effective escalation procedure.
This included defining the roles and responsibilities of the key players, establishing
criteria for when and how to escalate an issue, and specifying whom to notify of what
kinds of problems.
Unless corrective actions are actually carried
out, we don't realize the benefits of doing the analysis. Although this seems obvious,
unfortunately it is all too easy to let these actions join the long list of good
intentions. A key aspect of implementation is establishing and using success criteria, or
objective measurements, to verify the actual impact of the changes.
TŘ Root Cause Analysis is a straightforward way
to find out why something happened and leverage that knowledge to prevent similar
incidents in the future. It also enables speedier recovery when incidents to occur. It
uses one significant incident as the basis for a discovery process by analyzing it as the
result of a chain of events. By preventing recurrence of individual problematic events, we
reduce the chances of the TŘ outage incident recurring. Similarly, we eliminate
individual problematic events in the recovery process. The result is to improve service
and save money for customer and vendor alike.