The Sawyer Partnership logo  
Workshops
Quotes & Tips
   

Consulting, training, and coaching to help you solve messy problems that disrupt your operations or cause customers to take their business elsewhere.

  About The Sawyer Partnership
Home

"Stuff" Doesn't Just Happen:  Using the Truth to Prevent Unwanted Incidents and Speed Recovery

by

Jeanne Sawyer

Using TŘ Root Cause Analysis

Learning the Truth

TŘ RCA Pays Off

Capitalizing on Problem Occurrences

Seven Steps to prevent a whole class of incidents.

1. Define the TŘ event precisely

2. Construct the event line

3. Identify "key problematic events"

4. Analyze pre-conditions

5. Identify root causes

6. Identify corrective actions

7. Do it

Summary

 

 

 

 

 

 

The situation was a muddled swamp of political and technical issues. The customer had experienced an application outage that lasted 48 hours. This would be a serious disruption at any time. To make things worse, however, it occurred when the company was trying to complete year and month-end processing. Users were forced to revert to a workaround system that was much slower and more labor intensive. In addition, work done using the workaround had to be redone after service was restored. Everyone agreed that the situation was nasty, and should not be allowed to happen again. The unanimous opinion ended there.

The customer believed the client-server based document imaging and processing application was the culprit. The application vendor agreed that some changes were necessary in the product, but felt the customer was not managing the system properly. They needed a way to learn the truth about what happened and, more importantly, what the real causes were—all of them. In this article, we use this real case to see how we can discover the truth and use it to prevent unwanted incidents and speed recovery when they do occur.

Learning the Truth

TŘ, pronounced "tee-zero" for time-zero, is a way of designating a key point in time.  From TŘ, events are traced backward or forward in time.

The vendor decided to apply a specific method, called TŘ Root Cause Analysis (RCA), to gain a clear picture of what happened and what the underlying causes were. They chose this method because it focuses on an event, or specific incident. Since an incident is something undesirable that happens at an identifiable time and place (the TŘ event), the analysis begins from an objective viewpoint. In an already volatile relationship, this was an important advantage over other root cause analysis methods, which require agreement on what the problem is that you are trying to solve.

Beyond discovering the cause of the incident, TŘ RCA uncovers what happens during recovery. Speedy recovery when a crisis does occur can substantially reduce the impact of the incident, so it is also beneficial to identify root causes and prevent recurrence of problems in the recovery process. In this example, the application outage itself lasted 48 hours, and two weeks later the company and customer were still dealing with follow-on problems associated with the outage. Thus the ability to improve the recovery process was also a major motivation in choosing to perform TŘ RCA.

TŘ RCA Pays Off in Improved Application Availability and Improved Productivity

Performing TŘ RCA is intensive work, and like any such investment, must yield specific benefits. TŘ RCA, when applied to an appropriate incident, improves application availability to end users as well as improves productivity for all parties concerned. In the case discussed here, that includes the vendor technical and sales staff, vendor executives, customer technical staff, customer executives and end users. The productivity improvements derive from the following specific benefits of using TŘ RCA:

  1. Problem prevention. This is the usual purpose of root cause analysis. Learning the basic cause of a problem gives you the information needed to eliminate the cause and thereby prevent future incidents.
  2. Quicker problem resolution. TŘ RCA recognizes that substantial reduction in the number of incidents takes time and that, given the changing environments that we work in, we cannot eliminate crisis incidents entirely. Therefore, containment of problems to limit the impact is also a significant focus of TŘ RCA. The method identifies root causes of problems in the recovery process, providing information needed to eliminate the things that delay and confuse recovery.
  3. Reduced contention over non-issues. Because the method starts from an objective basis and creates a picture of the incident based on facts, everyone can focus on the real problems rather than placing blame.
  4. Clear identification of solution owners. The systematic analysis process shows a logical connection between events that occur and what causes them. The analysis then focuses on solutions—what can be done to eliminate the causes, and who can perform those actions. The clarity of who is responsible for specific actions and the focus on solution ownership (rather than problem ownership) is key to changing working relationship from a blaming, reactive mode to being constructive and proactive.

Capitalizing on Problem Occurrences

TŘ RCA takes advantage of realizing that problem incidents don't just happen. By studying an incident as the direct outcome of a series of events, we can break the problem into understandable and manageable pieces: we actually draw the picture of how the precursor events interact and lead directly to the TŘ incident. Similarly, recovery from an incident is the direct outcome of the activities that take place following the TŘ incident. The recovery can be speedy and effective, or less than optimal.

One major incident is the basis for a discovery process that gives us the information we need to prevent future occurrences of a whole class of incidents. The process has seven steps:

1. Define the TŘ event precisely.

Proper selection of the TŘ event is critical because the success of the entire analysis depends on this step. The event must be significant, i.e., cause a high enough pain level, to be worth investing in to prevent future occurrences. It must also provide leverage so that identifying and addressing root causes of the one incident will actually address a whole class of incidents.

In the example case, the TŘ event was defined to be the time when end-users reported the application outage to the internal technical support staff. A 48-hour outage was certainly significant in impact! Assessing leverage was harder since at the time we selected the TŘ event we didn't yet know what the cause was. However, many of the technical and operational details were already known. Clearly without intervention, additional outages were effectively guaranteed, certainly at that customer site and probably at others.

2. Construct the event line.

The event line is a graphic representation of all the precursor events that lead up to the TŘ event and the follow-on events that occur during recovery. An event line typically has 70-100 events represented, with multiple threads showing parallel kinds of activity. The event line for the example case had 88 events, with threads showing end-user, investigation, corrective action and communication activities. Many problems became obvious simply from examining the event line. For example, simultaneous corrective actions that are uncoordinated and possibly conflicting show up clearly on an event line.

The overall structure of the event line suggests some root causes, but it is necessary to analyze the event line systematically. Therefore, each event on the line is individually classified as problematic, unclear or OK. The problematic events are the ones we examine more closely.

3. Identify "key problematic events."

To make the entire effort manageable and to ensure that the actions taken will have the largest impact, the next step is to focus on the most important events. Approximately ten events are selected as the key events, either because they are the most important contributors to TŘ or recovery problems, or because they represent a class of events with similar characteristics.

Click here to see the event line depicting the key problem events for the example case. Examining the event line reveals several interesting points:

Although the outage occurred in February, it was "set up" by actions taken in November that resulted in objects being located incorrectly. In fact, those actions did initiate a problem report, but the potential impact was not understood and the problem was not addressed until after the TŘ crisis.

The apparent cause of the outage determined in the initial troubleshooting steps was running out of disk space on the server. In fact, the disk space problem was a trigger rather than the real cause.

Troubleshooting events are representative rather than significant individual events, so the patterns are easier to see in the full event line. The troubleshooting area of the event line is labeled a "cloud" because, although there was plenty of activity, it was not effectively coordinated or focused to resolve the problem quickly.

The final event emphasizes the finding that there was no agreed-to definition of recovery, and it was unclear at the time the analysis was done whether recovery was complete.

4. Analyze pre-conditions.

Specific conditions that allowed a key problematic event to occur are important to correct, but also form the basis for further analysis to reach root causes. An example of a pre-condition could be a particular outstanding bug report that, if fixed, could have prevented an outage. In the sample case, one precondition was the lack of follow-up in December when the problem with objects located incorrectly was reported.

5. Identify root causes.

Root causes are the systemic causes that allow the pre-conditions to develop. There are a limited number of root causes that regularly appear in computer-related situations, but the TŘ RCA identifies which ones to address to prevent the particular class of outages being analyzed. Often the root causes are areas of concern that are already known, but the true impact of not correcting them was not previously visible. Performing this systematic analysis makes clear the direct connection between a given root cause and a crisis event, making it easier to attach an appropriate priority and take action.

In the example, "less-than-adequate work practices" is a root cause of every key problematic event. Other root causes of some, but not all, key events included "less-than-adequate product robustness" and "unclear communications."

6. Identify corrective actions.

Success in preventing crises and improving recovery depends on following through with actions that correct specific pre-conditions as well as the root causes. Correcting pre-conditions is usually fairly easy since the pre-conditions are very specific and often well enough defined to identify the appropriate action quickly. Correcting root causes is typically more difficult because it takes more work to determine exactly what actions to take and to verify that the corrective actions work. However, correcting root causes is how we leverage the effort of analyzing one incident into preventing whole classes of incidents.

An example of correcting a root cause from the example case was to define, document and ensure use of an effective escalation procedure. This included defining the roles and responsibilities of the key players, establishing criteria for when and how to escalate an issue, and specifying whom to notify of what kinds of problems.

7. Do it.

Unless corrective actions are actually carried out, we don't realize the benefits of doing the analysis. Although this seems obvious, unfortunately it is all too easy to let these actions join the long list of good intentions. A key aspect of implementation is establishing and using success criteria, or objective measurements, to verify the actual impact of the changes.

Summary

TŘ Root Cause Analysis is a straightforward way to find out why something happened and leverage that knowledge to prevent similar incidents in the future. It also enables speedier recovery when incidents to occur. It uses one significant incident as the basis for a discovery process by analyzing it as the result of a chain of events. By preventing recurrence of individual problematic events, we reduce the chances of the TŘ outage incident recurring. Similarly, we eliminate individual problematic events in the recovery process. The result is to improve service and save money for customer and vendor alike.

                                                                                  Top

Home


Solving Problems PermanentlySM

JSawyer@SawyerPartnership.com
tel. 408-929-3622

Privacy Policy

Copyright ©2010.  The Sawyer Partnership.  All rights reserved.
Jeanne Sawyer, Ph.D.