Say Again? #35: Lessons Unlearned
The National Airspace System has lots of redundancies; and some would say that takes more effort than is necessary. Why fly on airways when direct can be faster? Why read back the entire clearance or radio frequency change when you can just give your callsign? AVweb's Don Brown tells of a time when a perfect storm of at least 11 separate, small errors built up to bring two planes mere feet from each other at high speed.
Once again it's time for me to take a trip down the road to unpopularity and write an article about a subject we'd rather not talk about. I never have been able to leave well enough alone. It's probably the reason I don't get invited to dinner more often than I do. I like to talk about politics and religion, too.
Let's cut to the chase: On March 2, 1999, an L-1011 and a DC-10 came within 1,500 feet of each other near Salina, Kan. That would be 1,500 ft. laterally. They were both at FL330. You can read the short version of the incident here.
The first thing you'll notice about the report is this catchy phrase: "Incident Evaluation: CRITICAL." That's one way of putting it, I guess. I like to use a term that is a little less formal: Scary. Because that is what this type of incident does. It scares people. It scares the FAA as an institution. It scares them into action.
"Winning" the Lottery
Accordingly, the FAA made a videotape of the events leading up to the incident and started showing it to controllers. The videotape is entitled "Collision Course: What Are the Odds?" I remember another briefing just like it from when I was just starting out as a controller. Two other large aircraft had come within 150 ft. of each other (vertically). That one scared folks too; hence the briefing.
The NTSB investigated this incident and issued a report too.
I've used all three items (both reports and the videotape) to write this article. There were some questions raised about the accuracy of the videotape and, having been involved (unfortunately) in a few incidents that the NTSB investigated, I know that a report can never tell the whole story. But this is the best information available to me and I believe it's accurate enough that we can learn something from it all.
First and foremost, this incident represents a system failure. I want you to keep that term in mind while you're reading this. I have intentionally refrained from using the company names of the aircraft involved in this incident. I haven't used the names of the facilities involved, either. Yes, I know all that information is in the reports but I want you to stay focused on the system. I don't believe there is any mistake made by an individual in this incident that hasn't been made by other individuals hundreds (if not thousands) of times. Try to stay focused on the system, not the individuals.
The first aircraft in this incident is a DC-10 flying from Portland, Oreg., to Memphis, Tenn. at FL330. The second is an L-1011 that departed Los Angeles, Calif., en route to Indianapolis, Ind., also at FL330. At the time, the aircraft involved weren't required to have TCAS (Traffic Collision Avoidance System) and neither aircraft was so equipped. Due to a number of small and seemingly inconsequential errors, neither aircraft was in communication with ATC when their flight paths crossed. The flight crew of the DC-10 only became aware of the L-1011 after encountering its wake turbulence. The NTSB report mentions the L-1011 crew detected the DC-10 visually at approximately 1/2 mile.
The First Roll
According to the FAA's analysis, the first error occurred when a controller failed to switch the DC-10 to the next sector's frequency. This is a fairly common occurrence. Especially if you happen to be me. I forget to switch airplanes at least 2-3 times a day. Minimum. It's one of my greatest weaknesses. The trick is discovering the error. Usually, the next sector calls me to remind me to switch the aircraft. The other "backup" (if you will) is the fact that I mark strips. I try to make myself methodically review my Flight Progress Strips on a regular basis. That's easier said than done, especially when it's busy. You'd be surprised how many times I discover I haven't switched someone using this method. It's a part of the system and it works.
Unfortunately, neither portion of the system worked in this instance. The FAA's videotape makes mention of the fact that the controller was distracted while trying to adapt to a new piece of equipment called DSR. DSR is the Display System Replacement. In short, it's the new radar scopes that were installed during this time frame. I remember my first few days using the system. I was distracted too.
In the case of the L-1011, again, it was the most common of errors cited by the FAA. The controller switched the L-1011 to the wrong frequency. The L-1011 was switched to 125.67 instead of 127.65. It's enlightening to note that 125.67 is a frequency used by another sector on the opposite side of the sector in question. I can relate because I have the same situation in my airspace. The sector north of the BRISTOL sector where I work uses 121.32. The sector to the south uses 121.35. I wish I had a nickel for every time I transposed those two frequencies.
Just a quick recap for those who might feel a little lost ... we've got two different airplanes that are NORDO (no radio) because of two of the most common errors in ATC: A controller forgot to switch one airplane to a new frequency, and another controller sent the second plane to the wrong frequency. I see both types of errors every single day. As a matter of fact, I usually make both of these errors every day myself. The FAA hasn't made a training video about me (yet) so what else went wrong?
The answer is plenty. The initial communications error with the DC-10 occurred two centers away from where the near mid-air collision (NMAC) occurred. I'll get to why the error wasn't discovered and corrected in a moment but first I want to talk a little bit more about why it happened.
Take a look at the route of flight (as depicted in the training video.)
PDX ./. DLN201070..GQE.GQE3.MEM
In plain English it reads:
Portland, Oreg., as the departure point
A route truncation symbol (./.)
The Dillon, Mont., VOR 201 radial 70 mile DME fix
Direct Gilmore, Ark., VOR
And the Gilmore3 arrival to Memphis.
Whether or not that route is an accurate recreation, I tend to believe that the aircraft was indeed on a direct routing.
The reason I believe that is simple. The initial communication error was a little more complicated than it appears. The DC-10 was cutting through the corners of several sectors. Airways don't do that. Sector borders are designed around (or to contain) airways. Because the DC-10 was cutting through the corner it led to the "I don't need to talk to him" syndrome. The sector controller receiving the aircraft (the handoff) calls back and says "I don't need to talk to him; put him on the next sector." That controller immediately flashes the handoff to the next sector and the controller that is talking to the aircraft continues to keep him on his frequency until the next sector takes the handoff.
Only this time, there's a twist. The next sector doesn't want to talk to the DC-10 either because the aircraft is only going to be in his airspace for a few miles too. That sector adds fuel to the fire by calling the next sector and has them "steal" the handoff. Confused yet?
OK, let's run through it again. "Specialist A" (that's how we refer to controllers involved in an error) is in communication with the aircraft and flashes the handoff to Specialist B." "Specialist B" flashes the handoff to "Specialist C." "Specialist C" calls "Specialist D" and has "Specialist D" steal the handoff. "Specialist A" is now supposed to switch the aircraft to "Specialist D." But he's probably forgotten about the aircraft by this time. Keep in mind that "Specialist A" is already struggling with a new piece of equipment. He needs this additional distraction like you need to fly an ILS back course at maximum speed on your first day in a new airplane.
"Specialist D" fails to notice that he isn't talking to the aircraft. The DC-10 pilots are blissfully ignorant that all this has transpired (of course) and continue on their way to Memphis listening to the sounds of "Specialist A's" radio chatter with other aircraft.
The Long Shot
Before anyone thinks, "Wow, no wonder it takes so long to become a controller ... that's complicated," it isn't supposed to be this complicated. The only reason it got this complicated is because someone in Montana decided to clear the DC-10 direct Gilmore, Ark. It's either that or somebody needs a major airspace redesign. I'll put my money on the "direct" explanation.
In my opinion, this direct clearance was as much a factor in this incident as any other "link" in the chain of events. That statement will probably bring howls of protest from the "Direct is Good" crowd. So be it.
What I find significant is that it wasn't mentioned as a factor. I'm reminded of the old software joke: It isn't a software "bug," it's a software "enhancement." Pilots want direct clearances, the FAA encourages controllers to give them, and the controllers do. Thousands upon thousands of times a day. It doesn't change the fact that this system wasn't designed for unlimited direct routings. The fact that we get away with giving them the vast majority of time doesn't change it either. We get away with forgetting to switch an airplane most of the time too, yet that was cited as a factor in this incident. I see the direct route as significant (no more, but certainly no less) as any other factor in this event. I've beaten this horse before, so we'll move on.
Racing the Clock
Just as a reminder, we're still several minutes away from this event becoming a NMAC at this point. Although radio contact with the DC-10 has been lost (and the initial opportunity to correct the error has also been lost) there is still plenty of time to discover and correct the error. In fact, there's enough time to discover, forget, and discover again that the DC-10 is now NORDO.
As a matter of fact, that is exactly what happened. A controller down the line recognized that he wasn't talking to the DC-10. The initial effort to re-contact the airplane (through ARINC -- a commercial voice/data service used by many airlines) was unsuccessful. Due to a failure in coordination, a later controller had to rediscover that the DC-10 was NORDO when he tried to turn the DC-10 around the L-1011.
Although none of the sources I've used to research this article describe the process, I'm sure it quickly becomes obvious that the L-1011 is now NORDO also. When the controller was unsuccessful in getting the DC-10 to turn, I feel certain an effort was made to move the L-1011 (which was in another sector at the time.) The gravity of the situation was now becoming painfully obvious to all the controllers.
Numerous attempts were made and different methods were used to try and reestablish radio contact with either aircraft. ARINC was called again to try to relay a message to both aircraft; however, neither aircraft involved used this service. Both aircraft were called on the emergency frequency (121.5). An effort was made to contact the dispatchers for each company by telephone to relay a message. One was a wrong number and the other used an automated phone answering service, which placed the supervisor making the telephone call on hold.
All efforts proved unsuccessful and the aircraft passed each other at 1640Z. While I'm mentioning the time, the event the FAA points to as the initial event (failure to switch the DC-10) occurred at 1548Z. The L-1011 was switched to the wrong frequency at 1617Z. I'll let you figure out the times each aircraft was out of contact.
I think it's worth noting how contact was eventually reestablished with both aircraft. The DC-10 received an ACARS (Aircraft Communications Addressing and Reporting System) message from their company, as they were passing the L-1011, advising them of the frequency to contact ATC. The L-1011 did it the old-fashioned way. They contacted FSS on 122.2 and got their help in finding the right frequency to contact ATC.
In that I am hoping you might learn something from all this, I want to mention something else. Back in the old days, when we had someone go NORDO, we would call FSS and have them broadcast a message on the VOR voice channels. I'm not sure if it would have worked (or was even tried) in this incident. A lot of people are navigating without using VORs these days. Still, FSS is a great resource when things go wrong. For controllers and pilots.
For those who are still with me and want even more information about this incident, you can read the full report, including a CVR transcript.
If you make it to the bottom of that report you'll note that the NTSB cited this incident in their recommendation to require TCAS on large cargo aircraft.
So, have you learned anything from this incident? I hope so. Thinking of buying TCAS? That's fine but I hope you realize that TCAS, by itself, isn't a cure-all. There will be another incident report out soon (from the German equivalent of the NTSB) that will make that painfully obvious. I know that you might be tired of hearing me talk about "the system" but I can't over stress how important the concept is to your safety.
Hedging Your Bet
No person, no machine -- no system -- is infallible. All of them have limitations. You can't continuously push a machine past it's limits and not expect it to fail. You can't give people an unusual amount of tasks to accomplish on a regular basis and not expect them to fail. We can't make an unlimited number of errors -- even small ones -- and not expect the system to fail. The FAA, in the training video, notes 11 separate "links" (or errors) in the chain of events. Each error was what we tend to think of as a "small" error. But as the magnitude of this event sinks in you realize that there isn't any such thing as a "small" error in this business.
It's seems like a contradiction but we must continually strive to remain error free while acknowledging the fact that we never will eliminate all the errors. This is the reason I urge you to take a "systems approach" to your usage of the National Airspace System. I'm not qualified to tell you how to do that as a pilot but there are people out there who are.
Critically evaluate the way you fly -- your habits, your procedures, and your techniques. See if you have any redundancies -- any "backups"-- built into your "system." One of the most important areas you need to evaluate is the system you use to recognize task saturation.
Becoming a Better Player
As an example, these are the signals I use to alert me that I am reaching task saturation. First is when I start missing radio transmissions. If I'm continually missing what a pilot is saying because I'm too busy trying to think, I realize that I'm reaching an unacceptable level. Next is missing handoffs. If I have more than one or two controllers that have to call and remind me to take an automated handoff, I take that as a signal that I need to start shedding some workload. I get another person to help (a D-side or a Tracker) or take some other action (such as refusing requests for VFR advisories) to lessen the workload. Finally, the last danger sign for me is if I'm unable to complete my strip marking. If I reach that state, I'll start taking drastic measures (such as stopping the departures) until I can get the workload down to an acceptable (i.e., safe) level.
See if you can take my examples and translate them into clear, recognizable signs that you are reaching your task saturation point. Recognizing that you are "busy" isn't enough. Break it down, giving yourself clear signs of just how busy you are. When you recognize one of those signs, use it to remind yourself you've reached a point where you are susceptible to making an error. Develop a system that allows you a way to discover you've made an error and fix it before it becomes a serious problem. Don't let these lessons go unlearned.
Have a safe flight.
Facility Safety Representative
National Air Traffic Controllers Association
Want to read more from Don Brown? Check out the rest of his columns.