The Savvy Aviator #47: Reliability-Centered Maintenance (Part 1)
More than 30 years ago, in 1974, the U.S. Department of Defense commissioned United Airlines to prepare a report on the techniques used by the airline industry to develop cost-efficient maintenance programs for civil airliners. The resulting report, titled Reliability-Centered Maintenance (F. S. Nowlan & H. Heap, National Technical Information Service, 1978) described a radically different way of looking at aircraft maintenance, based on rigorous analysis of traditional maintenance practices and evaluation of their shortcomings.
Traditionally, a major emphasis of aircraft maintenance programs had been defining specific overhaul and retirement intervals (TBOs) in order to achieve a satisfactory level of reliability. However, engineering analysis of reams of operational data from a number of major air carriers produced fascinating insights into the conditions that must exist for scheduled maintenance to be effective. Two discoveries were especially surprising:
- For a complex item (like an engine), scheduled overhaul has little effect on the overall reliability unless the item has a single, dominant, failure mode.
- There are many items for which there is simply no form of scheduled, preventive maintenance that is technically and economically feasible.
For example, it was determined that scheduled overhauls on turbine engines do not produce any reliability or economic benefit whatsoever, and that maintaining such powerplants strictly on-condition provides longer life, reduced maintenance costs, and improved reliability. (I am becoming convinced that the same is true of piston aircraft engines, and will discuss this further in Part 2 of this column next month.)
Reliability-centered maintenance (RCM) has resulted in immense cost savings for the airlines. Here are some examples:
- The initial maintenance program for the Douglas DC-8 (developed before the advent of RCM) required scheduled overhaul of 339 items. The larger and far more complex DC-10 (whose maintenance program was developed using RCM) required scheduled overhaul of just seven items, none of them engines.
- The pre-RCM DC-8 required 4,000,000 man-hours of structural inspections during its initial 20,000 hours of operation. The post-RCM Boeing 747 required only 66,000 man-hours over the same interval.
Not only are these cost savings huge, but they are achieved with no decrease in reliability. To the contrary, reliability has actually improved in most cases where emphasis shifted from time-directed maintenance (TDM) like scheduled overhauls to condition-directed maintenance (CDM).
In the remainder of this article, I will talk about some of the fundamental principles of RCM. Then, in a follow-up article next month, I will explore how these principles might be applied to our piston-powered airplanes.
Functions And Failures
Each system and component of an airplane performs one or more functions. The purpose of maintenance is to ensure that those items continue to perform their functions to an acceptable standard of performance. In some cases (e.g., ability to withstand g-loads), the acceptable standard of performance is established by the FAA during aircraft certification. In other cases (e.g., dispatch reliability), the acceptable standard is established by the aircraft owner or operator. We perform maintenance in order to ensure that the aircraft and each of its systems and components continue to meet performance standards.
Before we can establish a rational performance standard for an item, we need to examine the consequences when that item fails. For an item whose failure is likely to result in death or injury (e.g., the main wing-spar), the likelihood of failure must be very close to zero. On the other hand, for an item whose failure is simply an inconvenience (e.g., the #2 communication radio), a higher failure probability is acceptable.
Often, the consequences of failure depend on the item's operating context. The failure of an alternator is much less critical if the aircraft has a standby alternator. The failure of an engine is considerably less critical on a four-engine airplane than on a single-engine airplane. The failure of a wing spar is less critical if the wing has a fail-safe, multiple-spar design.
RCM classifies the consequences of failure into four categories, in descending order of importance:
- Safety Consequences: A failure has safety consequences if it could kill or injure someone.
- Operational Consequences: A failure has operational consequences if it prevents the aircraft from being operated. (AOG.)
- Hidden Consequences: A failure has hidden consequences if it is not apparent to the flight crew, but could cause a subsequent failure to have more serious consequences (e.g., failure of a standby alternator, voltage regulator or vacuum pump.)
- Non-Operational Consequences: Failures in this category are evident to the flight crew, but impact neither safety nor operation, and involve only the cost of repair (e.g., failure of the #2 comm radio.)
Feasible? Worth Doing?
RCM recognizes that maintenance resources should be focused on reducing failures that really matter. Therefore, RCM does not demand that all failures be prevented. Instead, it concentrates on preventing failure of items with safety or serious operational consequences and detecting hidden failures so that they can be corrected in a timely fashion. For failures with non-operational consequences, the optimum course of action is often reactive rather than proactive: Run to failure (RTF) and repair the item only when it fails.
For a failure with safety or serious operational consequences, RCM attempts to prevent the failure by identifying a proactive maintenance task to be undertaken before the failure occurs. Such proactive tasks may involve scheduled overhaul or replacement (TDM) or on-condition maintenance (CDM). However, in order for such a proactive task to be adopted, it must first be shown to be both technically feasible and worth doing:
- A task is considered technically feasible if it reduces the consequences of the associated failure to an extent that is acceptable to the owner or operator of the aircraft. (In other words, it gets the job done.)
- A task is considered worth doing if it reduces the consequences of the associated failure to an extent that justifies the direct and indirect costs of doing the task. (In other words, it is cost-effective.)
If it is not possible to find a proactive task that is both technically feasible and worth doing, then the failure must be dealt with by means of reactive maintenance tasks, including corrective maintenance (fix it only when it breaks), failure finding (scheduled functional checks to detect hidden failures), and redesign (e.g., install a backup).
Many aircraft owners, mechanics, and even aeronautical engineers still believe that the best way to optimize reliability of complex aircraft systems (e.g., engines) is to do some kind of proactive maintenance on a routine basis. Conventional wisdom is that this should consist of overhauls or replacement at fixed intervals. The graph below illustrates this fixed-interval view of failure:
This traditional view assumes that most items operate reliably for some fixed period of time, after which the probability of failure starts to increase rapidly, and that analysis of failures will allow us to predict the useful life of an item and take scheduled action to overhaul or replace it before it reaches the "wear-out zone" where risk of failure becomes unacceptable.
It turns out that this traditional view is valid for certain simple items, and for complex items that have a single dominant failure mode. For example, the failure pattern illustrated in the graph above is appropriate when considering an item that normally fails from metal fatigue due to repetitive stress. Examples include a wing spar and a cylinder head.
In this traditional view, the probability of failure during the item's useful life is usually small but nonzero. Therefore, a modest number of premature failures can be expected before the item reaches the end of its useful life, at which point the probability of failure starts increasing.
For a critical item like a wing spar whose failure has extreme safety consequences (if it fails, you could die), the traditional approach is to establish a very conservative "safe life limit" that ensures the item is retired before the probability of failure reaches some very low threshold (e.g., 1 in 1000, 1 in 10,000, etc.). This is illustrated in the following graph:
However, RCM researchers have determined that very few nonstructural aircraft components and systems exhibit a pattern of failure that corresponds to the traditional view. For example, some complex items have a failure pattern that looks like the following:
In this pattern, known as the "bathtub curve," the item exhibits a high risk of failure when the item is first placed in service, commonly known as "infant mortality." Once the infant mortality period has passed, the probability of failure drops to a low level for the remainder of the item's useful life, after which it rises as the item is continued in service into the wear-out zone. This is commonly accepted to be the failure pattern associated with piston aircraft engines, although I believe that the significance of the wear-out zone for such engines tends to be greatly overrated. (Much more about this next month.)
The Six Failure Patterns
One of the most fascinating findings by RCM researchers is that there are actually six different failure patterns exhibited by various mechanical, electrical and electronic aircraft components. These are illustrated below:
- Pattern B corresponds to the traditional view of age-related failures. It depicts a constant or very slowly increasing failure probability, followed by a pronounced "wear-out zone" where the probability of failure increases rapidly. This corresponds to the traditional view of age-related failures, and is valid for structural items like wing spars, whose dominant failure mode is repetitive-stress metal fatigue. However, RCM studies of civil aircraft found that only 2 percent of nonstructural items actually conform to this failure pattern. For such items, a fixed-age limit (safe life or TBO) may be appropriate and desirable.
- Pattern A, the bathtub curve, accounts for another 4 percent of nonstructural items. This failure pattern depicts a high-risk infant-mortality period, followed a constant or very slowly increasing failure probability, and then followed by a pronounced "wear-out zone." Such items may also benefit from a fixed-age limit, provided the number of premature failures is small enough that the majority of items survive to TBO.
- Pattern C depicts a failure probability that gradually increases with age, but with no obvious wear-out zone or useful life. Approximately 5 percent of nonstructural items exhibited this pattern. It is not usually desirable to impose a fixed-age limit on such items.
- Pattern D depicts a failure probability that is low when the item is new or newly overhauled, then increases to a constant level that continues as long as the item remains in service. This pattern accounted for 7 percent of nonstructural items.
- Pattern E depicts a constant failure probability; in other words, the conditional probability of failure is unrelated to age. Fourteen percent of nonstructural items exhibited this pattern.
- Finally, Pattern F depicts a high-risk infant-mortality period, followed by a constant or very slowly increasing failure probability, with no apparent wear-out zone or useful life. RCM studies showed that a whopping 68 percent of nonstructural items in civil aircraft exhibit this pattern, including turbine engines. (I believe that piston engines also exhibit this failure pattern.)
These findings contradict the traditional belief that reliability is predominantly age-related, and that the more often an item is overhauled or replaced, the less likely it is to fail. RCM studies show clearly that unless there is a dominant age-related failure mode (e.g., metal fatigue), age limits and scheduled overhauls do little or nothing to improve reliability. In fact, for the 72% of items that exhibit failure patterns A and F, scheduled overhaul or replacement will almost certainly increase overall failure rates by introducing infant-mortality risk into an otherwise reliable system or component.
RCM analysis demonstrates that fixed-age limits and scheduled overhauls (i.e., TDM) are technically feasible only if:
- There is an identifiable age (TBO) after which the item shows a rapid increase in the conditional probability of failure (i.e., an obvious wear-out zone); and
- Most of the items survive to that age; i.e., there are relatively few premature failures.
Condition-Directed Maintenance (CDM)
Although RCM has revealed that there is often little or no relationship between time-in-service and likelihood of failure, most failures give some sort of warning that they are about to occur. If we can detect these warnings in time, we may be able to take maintenance action to prevent the failure and avoid its consequences; see the following graph:
If a developing failure can be detected somewhere between point P (where it first becomes detectable) and point F (where total failure occurs), it may be possible to take action to prevent the consequences of the failure. Whether or not it is technically feasible to do this depends on how quickly the failure occurs, how far in advance it becomes detectable, and how difficult it is to detect the potential failure. Condition-directed maintenance (CDM) consists of checking for potential failures so that action can be taken to prevent functional failures before they occur.
The warning period between the occurrence of a detectable potential failure and its decay into a total functional failure is known as the "P-F interval." It may be measured in hours, cycles, calendar months, or any other appropriate metric. In order to detect failures reliably before they occur, CDM tasks must be performed at intervals that are less than the P-F interval. In practice, it is usually sufficient to implement a task frequency that corresponds to one-half of the P-F interval. For example, if the P-F interval is 100 hours, we need to inspect every 50 hours to ensure that we will detect a potential failure in plenty of time to avert a total failure.
Such condition monitoring is considered to be technically feasible if:
- It is possible to identify a well-defined and reliably detectable potential failure condition;
- The P-F interval is reasonably consistent and predictable; and
- It is practical to inspect or monitor the item at an interval approximately one-half the P-F interval.
Let's Get Real
So much for the philosophy and theory of reliability-centered maintenance. Next month, I will explore how we can apply these principles to the maintenance of our piston-powered general aviation airplanes, with special emphasis on our piston aircraft engines.
Meantime, if you'd like to learn more about the technical aspects of RCM, I recommend you obtain a copy of John Moubray's book, Reliability-Centered Maintenance (Second edition. 1997. ISBN 0-8311-3078-4) from Amazon or Barnes & Noble. For a more aviation-oriented discussion, the original Nowlan & Heap report (which makes fascinating reading) is available from the National Technical Information Service, a division of the Department of Commerce. The NTIS document number is ADA066579.
See you next month.