The Savvy Aviator #48: Reliability-Centered Maintenance (Part 2)

  • E-Mail this Article
  • View Printable Article
  • Text size:

    • A
    • A
    • A

The Savvy Aviator

For three decades, the airlines and military have been using the principles of reliability-centered maintenance to achieve drastic reductions in maintenance cost while actually improving reliability (discussed in last month's column). The lion's share of this improvement in maintenance cost-effectiveness has come from a major shift away from fixed overhaul, replacement or retirement intervals towards on-condition maintenance protocols.

Unfortunately, such RCM-inspired maintenance practices do not seem to have trickled down to the low end of the aviation food chain: piston-powered airplanes. Most aircraft owners dutifully overhaul or replace their piston aircraft engines when they reach TCM's or Lycoming's recommended TBO (at $20,000 to $35,000 a pop), even if the engines are running just fine with no signs of mechanical problems. They overhaul their propellers every five or six years because that's what Hartzell or McCauley recommends. Some operators overhaul or replace various airframe components at fixed intervals, regardless of condition, because that's what the aircraft service manual suggests. Some prophylactically replace vacuum pumps and alternators every 500 hours because some A&P told them it was a good idea.

Do such maintenance practices make sense? After analyzing reams of operational data from a number of major air carriers, RCM researchers concluded that fixed-interval overhaul or replacement rarely makes sense, and often makes things worse by increasing cost while decreasing reliability.

When Does TBO Make Sense?

For fixed-interval overhaul or replacement of a component to make sense, the component needs to have a failure pattern that looks something like the graph below, where the component can be expected to operate reliability for some fixed useful life, beyond which the probability of failure starts to increase rapidly to unacceptable levels.

Fixed-interval overhaul or replacement makes sense for components that have failure patterns that look like this.

Do our piston aircraft engines exhibit this kind of failure pattern? No, they do not. It's easy to demonstrate that these engines suffer the highest risk of catastrophic failure not when they pass TBO, but rather when they're fresh out of the TCM or Lycoming factory or a field overhaul shop. Take a look at these histograms derived from NTSB data on 180 engine-failure accidents for the five-year period from 2001 through 2005.

Small piston airplane accidents in 2001 through 2005 attributed by the NTSB to engine failure, by hours (top) and years (bottom) since engine overhaul. (Thanks to Dr. Nathan Ulrich for these graphs.)

This NTSB data can't tell us much about the risk of engine failure beyond TBO, because relatively few piston aircraft engines are allowed to remain in service beyond TBO (and we don't even have good data on how many are). What it does show quite clearly, however, is that engines fail with disturbing frequency during their first few years and few hundred hours in service after manufacture, rebuild or overhaul. The conventional wisdom is that our piston aircraft engines have a failure pattern that looks more like the following.

Piston aircraft engines are thought to have a failure pattern that looks like this.

This is the well-known "bathtub curve" in which the component exhibits a high risk of failure when first placed into service (so-called "infant mortality"), after which the probability of failure drops to a low level for the remainder of the component's useful life, then starts to rise again as the component is continued in service into the wear-out zone.

Does fixed-interval overhaul or replacement make sense for an engine with a bathtub-curve failure pattern? That's a tricky question, because engine overhaul at TBO becomes a two-edged sword. On one hand, it keeps us out of the presumptive wear-out zone where the probability of engine failure is thought to begin increasing toward unacceptable levels (although we don't have much data to support the contention that such an age-related wear-out zone actually exists). On the other hand, it puts us right back inside the infant mortality window where the data tells us clearly that the probability of engine failure is disturbingly high.

Would you be comfortable taking your family up in your piston-powered single-engine airplane with an engine at five hours since major overhaul (SMOH)? At night? Over rough terrain or water? In IMC? How about at 10 hours SMOH? Or 25 hours SMOH? (These are not easy questions.)

Does overhauling an apparently healthy engine at some fixed TBO help reliability more than it hurts? We can't be sure because there's so little data about the reliability of piston aircraft engines when they are operated beyond TBO (because so few of them are). But the evidence I've seen strongly suggests to me that fixed-interval overhaul for these engines doesn't make sense.

The fact that we have so little data about engines operated beyond TBO illustrates a major obstacle to the adoption of RCM-inspired on-condition maintenance in areas where fixed-interval overhaul has been the norm. RCM researchers call this "The Resnikoff Conundrum," which simply states that in order to collect failure data, there must be equipment failures. But failures of critical items such as engines is considered unacceptable because such failures can cause injury and death. This means that the maintenance program for a critical item must be designed without the benefit of failure data which the program is meant to avoid.

(The FAA's decades-long opposition to rescinding "the age 60 rule" for airline pilots is a perfect example of The Resnikoff Conundrum. Experts in aviation medicine have long been unanimous that there's no scientific basis for the FAA's venerable policy of forcing airline pilots to retire at age of 60. The FAA's longstanding argument has been that it has no safety data showing that allowing airline pilots to continue flying beyond age 60 is safe. Well, duh!)

We do know without doubt that fixed-interval overhaul is counterproductive for turbine aircraft engines, because the airlines and military started phasing out such fixed-interval overhauls in favor of on-condition maintenance decades ago. So we have tons of data about high-time turbine engines (some astonishingly high-time), and analysis of that data makes it crystal clear that fixed-interval overhaul hurts reliability more than it helps, not to mention that it greatly increases maintenance cost and downtime.

My experience and intuition leads me to believe that the same is true for piston aircraft engines, but we simply don't have enough operational data on very high-time piston engines to prove it.

Is This The Right Question?

Upon further reflection, I would argue that this isn't even the right question for us to be asking. That's because a piston aircraft engine isn't a single component with a single dominant failure mode and a well-defined failure pattern (like the bathtub curve). The NTSB reports reveal that engine failures occur for lots of different reasons. A piston engine is a complex system made up of hundreds of diverse components -- crankcase, crankshaft, camshaft, connecting rods, pistons, piston rings, cylinder barrels, cylinder heads, valves, valve guides, rocker arms, pushrods, gears, bearings, through-bolts, magnetos, spark plugs, etc. -- each of which has its own unique failure modes and patterns. An engine failure can be caused by the failure of any of these parts, and each of these parts has distinctively different failure characteristics.

To gain any real insight into how, when, and how often engines fail -- and how best to maintain the engine to prevent those failures -- we really need to analyze the failure modes and patterns of each of the engine's critical component parts, rather than try to lump them all into a single failure pattern for the engine as a whole.

Exhaust valves don't always survive to TBO. If we fail to catch a potential valve failure via compression test, borescope inspection, or engine monitor data, we risk a total failure ("swallowed valve").

Consider exhaust valves, for example. We know from experience that exhaust valves often don't survive to TBO. When they begin to fail, sometimes we're lucky enough to catch the potential failure at annual inspection before complete functional failure occurs (i.e., a "swallowed valve") by means of a compression test or borescope inspection. If the aircraft is equipped with a digital engine monitor and if the pilot knows how to interpret it, sometimes we can catch a potential exhaust valve failure before the valve fails completely. But if we're not lucky and the valve fails in-flight, it's usually a mayday situation and sometimes results in an off-airport landing or worse.

Does this mean that we should reduce engine TBO to something less than typical exhaust valve life? Should we be overhauling our engines every 500 or 1000 hours to prevent exhaust valve failures? Of course not!

Why not? For one thing, repairing a failing exhaust valve doesn't necessitate removing the engine from the airplane and tearing it down; the repair can be done simply by removing a cylinder. For another thing, we've got excellent tools (like borescopes and digital engine monitors) that allow us to reliably detect potential exhaust valve failures before complete failure occurs -- provided those tools are used properly and often enough. (I would estimate that the time between being able to detect a failing exhaust valve and it actually failing -- the P-F interval I mentioned last month -- is something on the order of 100 hours, which suggests that perhaps we should be inspecting the valves with a borescope every 50 hours -- especially in airplanes that are not equipped with a digital engine monitor.)

Failure Analysis

With this in mind, let's examine some of the most critical components of our piston aircraft engine, think about how those components fail and what consequences their failures have on engine operation, and what sort of maintenance actions we might take to deal with those failures in a manner that is both feasible and worth doing.

Crankshaft

It's hard to think of a more serious piston engine failure mode than a crankshaft failure. The engine has only one crankshaft, and if it fails the engine stops producing power instantly and totally. Crankshaft failures result in safety consequences that we simply cannot tolerate.

Nevertheless, crankshafts are not normally replaced even at engine major overhaul. In fact, they are retired very rarely. Lycoming claims that their crankshafts typically remain in service for 14,000 hours and well over 50 calendar years! According to Lycoming, a crankshaft typically remains in service for 7,000 hours until it fails to pass dimensional tests at overhaul, whereupon the crankshaft is machined to approved undersize dimensions and continues in service for another 7,000 hours until it flunks dimensional tests at overhaul a second time. TCM hasn't published this sort of data about their crankshafts, but I would guess that TCM crankshafts experience a very similar life cycle.

There are three kinds of crankshaft failure: (1) infant-mortality failures due to improper material or manufacture; (2) failures following unreported prop strikes; and (3) failures secondary to oil starvation and/or bearing failure.

We've seen a rash of the first type of failures in recent years. Both TCM and Lycoming have had major recalls of crankshafts that were either forged from bad steel or were physically damaged during manufacture. These failures invariably occur within the first 200 hours after a newly manufactured crankshaft goes into service -- a classic infant-mortality failure. History shows that if a crankshaft survives the first 200 hours, we can be confident that it was manufactured correctly and should perform reliably for many engine TBOs.

The type 2 failures seem to be getting rare because owners and mechanics are becoming smarter about the high risk of operating an engine after a prop strike. Both TCM and Lycoming state that any incident that damages the propeller enough that it has to be removed for repair warrants an engine teardown inspection, including both magnetic particle and ultrasonic inspection of the crankshaft for surface and subsurface cracks. This applies even to prop damage that occurs when the engine isn't running. Insurance will pay for the teardown and any necessary repairs, no questions asked, so there's no reason for an owner to hesitate to do it and the risk of severe consequences if he does not.

That leaves us with type 3 failures due to oil starvation and/or bearing failure. We'll talk about these when we look at oil pumps and bearings.

Crankcase

Like crankshafts, crankcases are not usually replaced at major overhaul and often provide reliable service for many TBOs. However, if the crankcase remains in service long enough, it will eventually develop cracks. Some small cracks in low-stress areas of the crankcase are acceptable, but most crankcase cracks are considered airworthiness items that require the engine to be torn down.

The good news is that crankcase cracks tend to propagate quite slowly, so a detailed visual inspection once a year is generally considered to be sufficient to detect such cracks before they pose a threat to safety. Catastrophic engine failures caused by undetected crankcase cracks are extremely rare.

Although crankcases are not normally replaced at overhaul, they do go through a reconditioning process that involves honing the parting surfaces until they're perfectly flat and smooth, and then align-boring the crankshaft and camshaft journals. This process can only be repeated a finite number of times before the crankcase no longer meets dimensional specifications. So crankcases that do not develop fatigue cracks will be eventually retired for dimensional reasons. But their typical useful life is many TBOs.

Camshaft and Lifters

A severely spalled cam lobe. This camshaft failure originated from corrosion pitting during an eight-month period of engine disuse.

The interface between the cam lobes and lifter faces endures more pressure and friction than any other moving parts in the engine. The cam lobes and lifter faces must be extremely hard and perfectly smooth in order to function and survive. Even the slightest defects in these surfaces (such as tiny corrosion pits caused by periods of disuse or acid build-up in the oil) can lead to rapid destruction (spalling) of the cam and lifters, and the need for a premature teardown. Cam and lifter spalling is one of the most common reasons that engines fail to make TBO. This problem affects primarily owner-flown airplanes because they tend to be flown irregularly and to sit unflown for significant periods of time.

The good news is that camshaft and lifter problems seldom cause catastrophic engine failures. The engine will usually continue to make power even with severely spalled cam lobes and lifter faces that have lost quite a lot of metal, although there will be some loss of performance. Typically, the problem is discovered at an oil change when the oil filter is cut open and found to contain excessive amounts of ferrous metal shed by the destructing cam and lifters.

If the oil filter isn't cut open and inspected on a regular basis, the cam and lifter failure may progress undetected to the point that ferrous metal circulates through the oil system and contaminates the engine's bearings. In rare cases, this can cause catastrophic engine failure. The best way to prevent such failures is regular inspection of the oil filter (at least every 50 hours). Regular laboratory oil analysis may also be helpful in detecting such problems early.

If the engine is flown regularly so that the cam and lifters do not corrode or scuff, the cam and lifters can remain in pristine condition for thousands of hours. Many big-name overhaul shops routinely replace the cam and lifters with new ones at major overhaul, although some shops use reground cams and lifters.

Gears

The engine has lots of gears: crankshaft and camshaft gears, oil pump and fuel pump drive gears, magneto and accessory drive gears, prop governor drive gears and sometimes alternator drive gears. These gears typically have a very long useful life and are not usually replaced at major overhaul unless obvious damage is found. Gears almost never cause catastrophic engine failures.

Oil Pump

Failure of the oil pump is very occasionally responsible for catastrophic engine failures. If oil pressure is lost, the engine will seize quite quickly. The oil pump is a very simple gear pump, with one driven gear and one idler gear housed inside a close-tolerance housing. It is usually trouble-free, but when trouble does occur it usually starts making metal long before complete failure occurs. Regular oil filter inspection and oil analysis will normally detect oil pump problems long before they reach the failure point.

Bearings

Bearing failure is responsible for a significant number of catastrophic engine failures. Under normal circumstances, bearings have a very long useful life. They are always replaced at major overhaul (it's required), but it's quite typical for bearings that are removed at overhaul to be in excellent condition with very little measurable wear. There are three types of reasons that bearings fail prematurely: (1) They become contaminated with metal from some other failure (e.g., cam/lifter spalling); (2) They become oil-starved when oil pressure is lost; or (3) They become oil starved because they shift position and their oil supply holes become misaligned ("spun bearing").

Type 1 failures (contamination) can mostly be prevented by using a full-flow oil filter, and inspecting the filter for metal on a regular basis. So long as the filter is changed before its filtering capacity is exceeded, particles of wear metals will be caught by the filter and won't contaminate the bearings. If a significant amount of metal is found in the filter, the engine should not be operated until the source of the metal is found and corrected, and checks performed to make sure the bearings haven't become contaminated.

Type 2 failures (loss of oil pressure) are fairly rare. Pilots tend to be well-trained to respond to loss of oil pressure by reducing power and landing at the first opportunity. Bearings will continue to function properly even with fairly low oil pressure (e.g., 10 psi).

Type 3 failures (spun bearings) are usually infant mortality failures, either shortly after an engine is overhauled (because the engine was not assembled properly), or shortly after cylinder replacement (because the crankshaft was rotated while the through-bolts were not torqued up, or the through-bolts were not properly re-torqued). They can also occur after a long period of crankcase fretting (which is typically detectable through oil filter inspection and oil analysis) or after extreme cold-starts without proper pre-heating.

Type 1 and 2 bearing failures are secondary to some other failure that contaminates or shuts off the bearing's oil supply. Type 3 bearing failures are primary, but usually unrelated to hours or years since overhaul.

Connecting Rods

Connecting rod failure is responsible for a significant number of catastrophic engine failures. When a connecting rod fails in flight, it often punches a hole in the crankcase and causes loss of engine oil and subsequent oil starvation. Rod failures have also been known to result in camshaft breakage. The result is invariably a rapid loss of engine power.

Connecting rods usually have a very long useful life, and are not normally replaced at major overhaul. (The connecting rod bearings, like all bearings, are always replaced at overhaul.) Many connecting rod failures are infant mortality failures caused by improper torquing of the rod cap bolts. Rod failures can also be caused by failure of the rod bearings (for the reasons discussed earlier under "bearings"), and these are usually unrelated to hours or years since overhaul.

Pistons and Rings

Piston and ring failures can cause catastrophic engine failures, usually involving only partial power loss but occasionally total power loss. Piston and ring failures are of two types: (1) infant mortality failures due to improper manufacture or installation; and (2) heat-distress failures caused by pre-ignition or destructive detonation events. Heat-distress failures can be caused by contaminated fuel or improper engine operation, but are unrelated to hours or years since overhaul. Use of a digital engine monitor can usually detect pre-ignition or destructive detonation episodes and allow the pilot to take corrective action before heat-distress damage occurs. Unless there's significant collateral damage, damaged pistons and rings can be replaced without engine removal or teardown.

Cylinders

Cylinder failures can cause catastrophic engine failures, usually involving only partial power loss but occasionally total power loss. Cylinders consist of a forged steel barrel mated to an aluminum alloy head. Cylinder barrels normally wear slowly, and excessive wear is detected at annual inspection by means of compression tests and borescope inspections. However, cylinder heads can suffer fatigue failures, and occasionally the head can separate from the barrel. Cylinders with worn barrels can be reconditioned using a number of processes (e.g., oversizing, plating, rebarreling) and kept in service for several engine TBOs. However, most big-name overhaul shops install new cylinders at major overhaul, with reconditioned cylinders used primarily for life extension. Cylinders can be repaired or replaced without engine removal or teardown.

Valves And Valve Guides

As discussed earlier, it is quite common for valves and guides (particularly exhaust valves and guides) to develop problems well short of TBO. Potential valve problems can usually be detected prior to complete failure by means of annual compression tests and borescope inspections, and continuously by means of a digital engine monitor (provided the pilot knows how to interpret the engine monitor data). If a valve fails completely, a significant power loss can occur that occasionally results in an off-airport landing. Failing valves and guides can be replaced without engine removal or teardown.

Rocker Arms And Pushrods

Rocker arms and pushrods (which operate the valves) typically have a very long useful life and are not routinely replaced at major overhaul. (Rocker arm bushings are always replaced at overhaul). Rocker arm failure is quite rare. Pushrod failures are caused by stuck valves, and can almost always be avoided through repetitive valve inspections and digital engine monitor usage as discussed earlier.

Magnetos

Magneto failure is actually uncomfortably commonplace. Fortunately, aircraft engines are equipped with dual magnetos for redundancy, and the probability of both magnetos failing simultaneously is extremely remote. Mag checks during pre-flight runup can detect gross magneto failures, but in-flight mag checks are far better at detecting subtle or incipient failures. Digital engine monitors can reliably detect magneto failures in real time if the pilot knows how to interpret the data. Magnetos should be disassembled, inspected, and serviced every 500 hours -- doing so drastically reduces the likelihood of an in-flight magneto failure.

The Bottom Line

After performing failure analysis of each critical engine component that has a history of contributing to catastrophic engine failure, I cannot help but be struck by some fundamental observations.

The "bottom-end" components of these engines -- crankcase, crankshaft, camshaft, bearings, gears, oil pump, etc. -- are very robust. They normally exhibit very long useful lives that are many times as long as recommended TBOs. Most of these bottom-end components (with the notable exception of bearings) are reused at major overhaul, and not replaced on a routine basis. When these items do fail prematurely, the failures are mostly infant-mortality failures that occur shortly after overhaul, or random failures that are unrelated to hours or years since overhaul. The vast majority of random failures can be detected long before they get bad enough to cause catastrophic engine failure simply by means of routine oil filter inspection and laboratory oil analysis. There seems to be no evidence that these bottom-end components exhibit any sort of well-defined wear-out zone that would justify fixed-interval overhaul or replacement at TBO.

The "top-end" components (or "hot section" components if you prefer) -- pistons, cylinders, valves, etc. -- are considerably less robust, and it is not unusual for them to fail prior to TBO. However, most of these failures can be prevented by regular inspections (compression tests, borescopy, etc.) and by use of digital engine monitors (by pilots who have been taught how to interpret the data). Furthermore, when potential failures are detected, the top-end components can be repaired or replaced quite easily without the need for engine removal or teardown. Once again, the failures are mostly infant-mortality failures or random failures that do not correlate with hours or years since overhaul.

The bottom line is that a detailed failure analysis of piston aircraft engines using RCM principles strongly suggests that what the airlines and military found to be true about turbine aircraft engines is also true of piston aircraft engines: the traditional practice of fixed-interval overhaul or replacement is counterproductive. A conscientiously applied program of on-condition maintenance that includes regular oil filter inspections, oil analysis, compression tests, borescope inspections and in-flight digital engine monitor usage can be expected to yield improved reliability and much reduced maintenance expense and downtime.

The only exception seems to be magnetos. They really need to go through a fixed-interval major maintenance cycle (not a major overhaul, but close) every 500 hours, because we have no effective means of detecting potential magneto failures without disassembly inspection. The good news is that we have two of them on each engine for redundancy.

Don't They Get It?

Why don't our airframe and engine manufacturers recommend RCM-based on-condition maintenance instead of very costly. fixed-interval. major overhauls? Well, for one thing, our TCM and Lycoming engines were certified under the old CAR 7 regulations that were developed long before RCM was invented. The same applies to the overwhelming majority of today's piston-powered GA airplanes, which were certified under the old CAR 3. Even for newer airplanes like the Cirrus, Columbia and Diamond, the manufacturers mostly specify time-directed maintenance (TDM) rather than condition-directed maintenance (CDM) because, unfortunately, there's no tradition of using RCM maintenance principles for small airplanes.

It would take a lot of engineering work for TCM and Lycoming to develop new RCM-inspired maintenance programs for our piston engines, and frankly they have very little incentive to do this work. Even if they did, it would probably be quite a struggle for them to get them approved by the FAA. After all, there's very little operational data about piston aircraft engines that are operated beyond current TBO recommendations (because so few of them are). The Resnikoff Conundrum remains alive and well in piston GA.

Fortunately, we Part 91 operators are under no regulatory obligation to overhaul our engines at the manufacturer's recommended TBO. There's nothing to prevent us from implementing our own RCM-inspired maintenance protocols, and to maintain our engines and airframes strictly on-condition rather than on-time. I've been doing this for decades with my own piston airplanes, and have achieved absolutely outstanding dispatch reliability coupled with drastically reduced maintenance expense. In 20 years of flying and maintaining my current airplane -- a turbocharged piston twin -- I've actually logged more post-TBO engine time than pre-TBO engine time.

Sometimes it's hard to persuade mechanics that it's safe, sensible and prudent to continue an apparently healthy engine in service well beyond recommended TBO. A good friend of mine recently had a shop refuse to sign off the annual inspection on his Cessna T310R (the same kind of airplane I fly) because his right engine was more than 300 hours past TBO. (I consider that engine a relative spring chicken, since both of my engines are -- as of September 2007 -- more than 800 hours past TBO and running great.) The shop even refused to help my friend obtain a ferry permit from the FSDO so he could fly the airplane to another shop to have the engine inspected! The result was both emotionally and financially stressful for the aircraft owner. Ultimately, his engine was torn down by a big-name overhaul shop, and nothing was found to suggest that the engine couldn't have operated safety for many more years and hundreds of hours.

There's an important lesson here: If you believe strongly in condition -directed maintenance (as I do) and your engine is "mature" (as mine are), you'd be wise to explore the subject of condition-directed maintenance and past-TBO operation with the chief inspector at your maintenance shop before you authorize him to commence an annual or 100-hour inspection. If you discover that his maintenance philosophy differs from yours, you might be wise to choose another shop to do the inspection.

See you next month.


Want to read more from Mike Busch? Check out the rest of his Savvy Aviator columns.
And use this link to send questions to Mike.