The Sawyer Partnership logo  
Workshops
Quotes & Tips
   

Consulting, training, and coaching to help you solve messy problems that disrupt your operations or cause customers to take their business elsewhere.

  About The Sawyer Partnership
Home

Leveraging a Crisis:   How to Learn the Truth About What Really Happened and Use It to Change How You Do Business

by

Jeanne Sawyer

Leveraging a Crisis

Learning the Truth

TŘ RCA Pays Off

Capitalizing on Problem Occurrences

Seven Steps to prevent a whole class of incidents.

1. Define the TŘ event precisely

2. Construct the event line

3. Identify "key problematic events"

4. Analyze apparent causes

5. Identify root causes

6. Identify corrective actions

7. Do them

Summary

 

 

 

 

 

 

Like other high-tech companies, a major networking company (we’ll call it "BigNet") has found keeping up with rapid growth presents an interesting and significant challenge. In particular, it stressed the Customer Service Organization's escalation process to where it could no longer effectively meet either customer needs or their own internal needs. Fundamentally, the process was not designed to handle the growing numbers of customers or the increasing diversity in customer and BigNet requirements.

The result was increasing dissatisfaction and frustration on the part of everyone who participated in identifying and resolving escalated problems. Increasing numbers of problems were treated as exceptions, which is expensive and time-consuming for all concerned. Relief was needed urgently, but where to begin? This article describes the true story of how BigNet approached this business process problem.

Peter, the Vice President of Customer Service, decided to take a different approach than usual for such projects. Rather than chartering a team to draw "as is" and "should be" maps of the escalation process, Peter decided to choose a specific incident that was a representative demonstration of what happened to the overloaded process. With this approach, if the escalation process wasn’t really causing the problem or was only a partial contributor, we would find out. He would analyze that incident using a method called TŘ Root Causes Analysis, a tool specifically designed to learn the truth about complicated incidents. The method focuses actions on preventing recurrences and improving recovery when problems occur. This approach promised—and delivered—a quick and accurate way to understand exactly what was really happening and to identify the most immediate opportunities for improvement.

Peter and his team chose the most recent automatic teller machine (ATM) outage in a particular customer’s network to be the TŘ event. This particular customer had experienced a series of outages, and was so angry and frustrated by the constant problems that BigNet was worried about losing a major customer. Such outages are particularly complicated to resolve because so many physical components, software products and different companies are involved.

In this case, BigNet provided the network management hardware and software. Other participants included the bank that owned and operated the ATM, three phone companies, contractors responsible for installing and maintaining the cabling, and several contractors responsible for parts delivery and providing field engineers. There had been numerous meetings among technical staff and executives from the participating companies, but outages continued to occur and take too long to resolve. The outage selected for analysis was clearly representative of an escalated situation!

Learning the Truth

TŘ, pronounced "tee-zero" for time-zero, is a way of designating a key point in time.  From TŘ, events are traced backward or forward in time.

Peter wanted to use TŘ Root Cause Analysis (RCA) because it focuses on an event, or specific incident. Since an incident is something undesirable that happens at an identifiable time and place (the TŘ event), the analysis begins from an objective viewpoint. Everybody could easily agree that ATM outages are undesirable. In an already volatile relationship, this was an important advantage over other root cause analysis methods, which require an agreed problem definition before you can begin.

Beyond discovering the cause of the incident, TŘ RCA uncovers what happens during recovery. Speedy recovery when a crisis occurs can substantially reduce the impact of the incident, so it is highly beneficial to identify root causes and prevent recurrence of problems in the recovery process. In this example, the ATM outage lasted 22 hours, and several months later, BigNet, the customer and the other participating companies were still dealing with follow-on problems associated with the outage. Thus the ability to improve the recovery process was also a major motivation in choosing to perform TŘ RCA.

Click here to see a diagram that illustrates the picture that TŘ RCA creates. "Still water" before the TŘ incident represents the time when everything is completely normal, with no ripple hinting at what is to come. Still water after the TŘ incident represents the time when recovery is complete and there are no lingering traces of what occurred.

TŘ RCA Pays Off in Improved Application Availability and Improved Productivity

Performing TŘ RCA is intensive work, and like any such investment, must yield specific benefits. TŘ RCA, when applied to an appropriate incident, improves application availability to end users as well as improves productivity for all parties concerned. In the case discussed here, that includes technical staff and executives from BigNet as well as from the customer and all the other participating companies. The productivity improvements derive from the following specific benefits of using TŘ RCA:

1 - Problem prevention. This is always the expected benefit of root cause analysis. Learning the basic causes of a problem gives you the information needed to eliminate these causes and thereby prevent future incidents.

2 - Quicker problem resolution. TŘ RCA recognizes that substantial reduction in the number of incidents takes time and that, given the changing environments that we work in, we cannot eliminate crisis incidents entirely. Therefore, containment of problems to limit the impact is also a significant focus of TŘ RCA. The method identifies root causes of problems in the recovery process, providing information needed to eliminate the things that delay and confuse recovery.

3 - Reduced contention over non-issues. Because the method starts from an objective basis and creates a picture of the incident based on facts, everyone can focus on the real problems rather than placing blame.

4 - Clear identification of solution owners. The systematic analysis process shows a logical connection between events that occur and what causes them. The analysis then focuses on solutions—what can be done to eliminate the causes, and who can perform those actions. The clarity of who is responsible for specific actions and the focus on solution ownership (rather than problem ownership) is key to changing a working relationship from blaming and reactive mode to constructive and proactive.

Capitalizing on Problem Occurrences

TŘ RCA takes advantage of the fact that problem incidents don't just happen for no reason. By studying an incident as the direct outcome of a series of events, we can break the problem into understandable and manageable pieces: we draw the picture of how the precursor events interact and lead directly to the TŘ incident. Similarly, recovery from an incident is the direct outcome of the activities that take place following the TŘ incident. The recovery can be speedy and effective— or less than terrific.

The TŘ incident is the basis for a discovery process that gives us the information we need to prevent future occurrences of a whole class of incidents. The process has seven steps:

1. Define the TŘ event precisely.

Proper selection of the TŘ event is critical because the success of the entire analysis depends on this step. The event must be significant, i.e., cause enough pain, to be worth the investment to prevent future occurrences. It must also provide leverage so that identifying and addressing root causes of the one incident truly will eliminate the whole class of incidents.

In this case, the TŘ event was defined to be the time when the specific ATM outage was reported to BigNet. A 22-hour outage was certainly significant in impact, especially since we knew that it was not an isolated incident. Without intervention, additional outages were essentially guaranteed, certainly at that customer site and probably at others.

2. Construct the event line.

The event line is a graphic representation of all the precursor events that lead up to the TŘ event and the follow-on events that occur during recovery. An event line typically has 70-100 events represented, with multiple threads showing parallel kinds of activity. The event line for the BigNet case had 126 events, reflecting the complexity of the situation, with threads showing routine, setup, investigation, corrective action and communication activities. Many problems became obvious simply from examining the event line. For example, simultaneous corrective actions that are uncoordinated and possibly conflicting show up clearly on an event line. Click here to see a small section of the event line for this case.

Although examination of the event line suggests some root causes, it is necessary to analyze the entire event line systematically to get complete understanding. Therefore, each event on the line is individually classified as problematic, unclear or OK. The problematic events are the ones we examine more closely.

3. Identify "key problematic events."

To make the entire effort manageable and to ensure that the actions taken will have the largest impact, the next step is to focus on the most important events. We review all the events, identifying those that are the most important contributors to TŘ or recovery problems, or that represent a class of problematic events with similar characteristics. Typically, approximately ten events are selected as key events.

The full event line for the ATM outage reveals several interesting points, which illustrate the kinds of things that key problematic events highlight:

  • Although the outage occurred in January, it was "set up" in December when the ATM was initially installed. Examining the event line shows that problems with the backup circuit were identified almost a month before the TŘ outage incident.
  • When an outage occurred, the bank called BigNet and everyone else who might possibly help. This resulted in lots of activity and finger pointing, but not in effective problem resolution. Everyone involved replaced equipment to demonstrate responsiveness, regardless of technical evidence about the cause of the problem. The companies involved had not successfully identified the specific cause of the outage, so weren’t even applying the right band-aids.
  • There was no still water. The ATM had repeated a cycle of outages since the day it was installed. Although the ATM was back in service at the time the TŘ analysis was performed, there were still many recovery activities underway, and another outage occurred before they were complete. There would be no still water for the specific ATM until the correct apparent causes were identified and corrected. There would be no reduction in ATM outages in general until the correct root causes were identified and corrected.

4. Analyze apparent causes4. Analyze apparent causes.

Specific conditions that allowed a key problematic event to occur are important to correct, but also form the basis for further analysis to reach root causes. An example of an apparent cause would be a particular outstanding bug report that, if fixed, could have prevented an outage. In this case, an especially important key problematic event was the initial installation of the ATM. Analyzing the apparent causes revealed that the internal wiring was the wrong gauge. Once understood, fixing the apparent cause for that ATM and for all other installed ATM’s was straight-forward.

5. Identify root causes.

Root causes are the systemic causes that allow the apparent causes to develop. Often the root causes are areas of concern that are already known, but the true impact of not correcting them was not previously recognized. Performing this systematic analysis makes clear the direct connection between a given root cause and a crisis event, making it easier to attach an appropriate priority and take action.

In the example, the root causes behind the incorrect initial installation included inadequate installation specifications, inadequate testing procedures for new installations and inadequate turnover procedures to transfer responsibility from the organizations responsible for new installations to those responsible for ongoing operations.

6. Identify corrective actions.

Success in preventing crises and improving recovery depends on following through with actions that correct apparent causes as well as root causes. Correcting apparent causes is usually fairly easy since they are very specific and often well enough defined to identify the appropriate action quickly. Correcting root causes is typically more difficult because it takes more work to determine exactly what actions to take and to verify that the corrective actions work. However, correcting root causes is how we leverage the effort of analyzing one incident into preventing whole classes of incidents.

7. Do them.

Unless corrective actions are actually carried out, we don't realize the benefits of doing the analysis. Although this seems obvious, it is all too common for new crises to divert attention, letting these actions join the long list of good intentions.

Summary

The escalation process, originally targeted as the culprit, had little to do with the problem. Redesigning it would have had no impact on the real issues that were frustrating the bank. Instead, the real root causes were discovered. ATM’s are now installed correctly the first time and other issues that were uncovered in the analysis have also been addressed. The bank, which was almost ready to change vendors, is now BigNet’s largest support customer.

TŘ Root Cause Analysis is a straightforward way to find out why something happened and leverage that knowledge to prevent similar incidents in the future. It also enables speedier recovery when incidents to occur. One significant incident is the basis for a discovery process: we analyze it as the result of a chain of events. By preventing recurrence of individual problematic events, we reduce the chances of the TŘ outage incident recurring. Similarly, we eliminate individual problematic events in the recovery process. This practical approach keeps things manageable. The end result is to improve service and save money for customers and suppliers alike.

                                                                                  Top

Home


Solving Problems PermanentlySM

JSawyer@SawyerPartnership.com
tel. 408-929-3622

Privacy Policy

Copyright ©2010.  The Sawyer Partnership.  All rights reserved.
Jeanne Sawyer, Ph.D.