The Maintenance Execution Framework

May 5

This is a real story, from a real plant that I’ve seen with my own eyes – and I’ve seen variants of it at many other plants. Millwright crew goes to conduct a time-based teardown inspection of a small pump. Pull it, put it in the back of the buggy, drive down dusty plant roads to the shop, put it on a not-very-clean bench and disassemble. They visually inspect the internals, take some measurements – everything looks pretty good.

They spray clean parts, put them back down on the table, then set about reassembling it. They measured end play – it took some sleight of hand to get it in limits. Next, the procedure specified to measure installed runout at the coupling hub – clearly beyond the limits on the sheet. Talk to the supervisor: “Measure it closer to the wet end.” They kept marching the indicator down until they got what they needed. Acceptance criteria achieved, but runout would play hell with everything on the shaft.

With that pyrrhic victory, they set about reassembling the pump – the highlight of which was when they torqued mechanical seal bolts by giving the wrench (not a torque wrench, to be clear) a health whack with their palm. Each bolt got a whack.

They headed out to the field with everything they needed – pump, hardware, and gaskets in the back of the buggy again; tools in the usual jumble; oil in an unlabeled transfer container that ultimately came from one of the drums sitting outside with rainwater collected around the bung – all bouncing along through the dust to the unit.

They got set to install the pump – dirty reused shims sitting there ready for reinstall, short bolt uncorrected. They knew that the spiral wound gasket had been ordered without a BOM part number, so they worked on getting that in first. It was just a millimeter or so too large. Not ones to shy from a challenge, they tried tapping the gasket into place using their channel locks. You could see the dents being made in the gasket… so it was probably a good thing it sprung. They tried and sprung two more with the same method, then their day was mostly over.

They went to see the planner who had ordered the gasket and we all went back out to stare at the equipment. The planner brought a straightedge ruler to measure the orifice – because surely that would work better this time.

They finally got the pump reinstalled the next week when a variety of gaskets came in and they found the right one. It ran for 3 months before tearing up a bearing.

This is not fiction. This is not a composite of different errors from different jobs. This is one true job, from a major chemical manufacturer that has a lot of reasonable paper on file globally and locally.

This is the sieve. It is a system problem we call the Maintenance Execution Gap. You can’t fix this with RCM, training, or culture. It needs a system solution.

This is the central confusion, and it has cost the industry forty years. The conditions required to do the work — the seven enablers in the Maintenance Enablement Framework — keep getting pitched as the optional finishing layer, the “precision maintenance” tier, the “world-class” level you graduate to once your culture has matured and your strategy has been studied. That is the architecture exactly upside-down. The MEF is not the polish; it is the non-negotiable foundation. The studies, the dashboards, and the strategy are downstream artifacts of an organization that has already enabled execution, and an organization that has not enabled execution does not have anything worth studying. It is like tape sessions for a football team that can’t block or tackle. Pouring more analysis on top of a disabled execution layer is filling the bucket from the top while the bottom drops out.

Industry has spent four decades and on the order of hundreds of billions of dollars on the upper layers of the maintenance system: failure mode analysis, criticality assessments, APM platforms, condition monitoring contracts, training curricula, and culture programs. The investment is real and the returns are not. Heavy industry’s maintenance cost as a fraction of replacement asset value has been flat for thirty years; process-plant availability has been flat or declining for twenty; and the catastrophic loss-frequency curve has refused to bend even as the analytical apparatus around it has multiplied year over year.

The reason the spend doesn’t reach the equipment is that it pours into the top of a sieve, and the sieve has seven sets of gaping holes — the same seven, in almost every plant.

The Seven Elements of the Maintenance Execution Framework

1. Execution Reference

The technician needs to know what good looks like at the equipment, in front of him, at the moment his hands are on it — not what was covered in a training course in March, not what is in the binder on the planner’s shelf, and not what is buried in section 7.3 of the OEM. He needs the acceptance criteria, the conditional logic — if the seal weep accumulates more than a measured threshold, write a notification — the visual reference distinguishing milky from clear oil, and the asset’s last four findings primed at the top of the page, before he is asked to look. Aviation’s task cards do this; surgical checklists do this; the U.S. military’s interactive electronic technical manuals, which replaced thirty thousand pages of paper, do this. Industrial maintenance is the holdout.

Building the reference is compilation work, not analytical work — a distinction the consulting world has muddied for decades by selling the reverse. The reliability engineer takes one equipment class at a time, pulls the OEM tolerances, the relevant API and ISO limits, the last twenty-four months of plant findings on that class, and the failure modes from the last three failure investigations; assembles the result into a three-column inspection reference (Check | Acceptable | Action Required) for the PM. REs work with planners to create a phased sequential reference with explicit hold points for repair work. The planner or a reliability specialist adds an asset-specific annotation layer that carries the last four findings on each tag forward automatically.

The single class reference covers every PM, every inspection, and every replacement for the entire population in that service. The planner’s job becomes configuration from a maintained library rather than authorship from a blank page, which is the right division of labor and not the one most plants currently have. Deployment moves through a defined progression — a managed library on SharePoint or a document management system, then linkage of references to work orders so the document travels with the job, and eventually mobile delivery with measurement points captured at the equipment in the same interface that delivers the reference — and each stage builds the muscle for the next. Most plants produce reports instead, and watch their PMs return with “PM complete, no issues” — the only honest closeout a technician can write when the work order said “PM pump. Check oil.”

2. Proper Tools

Bearing installation requires an induction heater, bearing oven, and/or hydraulic press; alignment requires a laser system; fasteners require torque wrenches with current calibration. None of this is exotic, all of it has been industry standard for longer than most of us have been alive… but most plants do not have the right tool within reach of the technician on the day the work is scheduled.

The work to fix this starts with a ninety-minute walk through every shop and tool crib, with the Execution References in hand and a single question for each precision specification the references call for: is the tool present, in serviceable condition, available to the technician on shift, and within current calibration? The gap list from that walk is the work plan.

Proper tools properly used pay for themselves in a handful of failures prevented. Doing any maintenance or reliability consulting spend without having these basics already on hand should be seen as lunacy. This is a ninety-day project for a competent reliability and maintenance organization, not a multi-year capability build. The capital required is small. Availability and use of the proper tools should be as automatic as basic PPE.

3. On-Spec Materials and Fluids

The new bearing arrives wrapped in oiled paper, in a sealed box, with a preservative film the manufacturer rates at three to five years inside a controlled storeroom — which sounds generous until you consider that the Gulf Coast warehouse where it will actually sit has no climate control, runs 95°F by mid-afternoon in July, and drops twenty degrees overnight, cycling humid air in and out of the packaging every single day until someone needs to put it in a critical pump. The preservative is not failing because the storeroom is dirty; it's failing because the packaging was designed for a stable environment and is being asked to survive a daily pressure-and-humidity pump it was never engineered for.

The five-gallon pail of ISO VG 46 turbine oil came off a transport that was already below cleanliness standard, was poured through a funnel that had not been wiped since 2019, and reached the gearbox at ISO 4406 cleanliness code 22/20/17, roughly a hundred times dirtier than the OEM specification. The mechanical seal face was unwrapped during a kit walk-down by a planner who needed to verify the part number, and re-bagged with a fingerprint on the lapped surface.

None of these are training failures and none are motivation failures; they are storeroom design failures and procurement specification failures, and most reliability programs do not address them at all because they sit on the wrong side of the maintenance/non-maintenance organizational line.

Closing this hole requires a designed interface between the storeroom and the work front. Procurement contracts get rewritten to specify cleanliness levels for new oil at delivery — ISO 4406 codes are routine and enforceable — along with balance grades for rebuilt motors and packaging standards for moisture-sensitive components. Receiving inspection becomes a real protocol for criticality A and B parts rather than a stamp on a packing slip, with named inspectors, documented acceptance criteria, and actual authority to reject a shipment. The storeroom layout gets rebuilt to put electrical, elastomer, and precision components into climate-controlled enclosures, and stored motors go onto a periodic shaft-rotation program to prevent false brinelling from forklift vibration traveling through the warehouse floor. Lubricant transfer happens through filtered carts with dedicated couplings, so no five-gallon pail and no general-purpose funnel ever touches a critical reservoir. A pre-work material verification step at the same checkpoint as tool verification confirms that the bearing in the kit has been received-inspected, that the oil has been sampled within the last week, and that the seal face is still in its original packaging.

The defect was introduced before the wrench came out of the bag — and that is the part of the problem the maintenance organization has the standing to fix, but only if it owns the gap formally rather than waving at it across the org chart.

4. Equipment Access

The equipment must be in the correct state when the work begins: isolated, decontaminated, depressurized, locked out, drained, vented, and at the temperature the work specifies. The technician arrives at a job that operations was supposed to release four hours ago, the energy isolation is partial, the line is still under residual pressure, and the unit is not yet cool enough to touch — and the system has disabled the work before any tool comes out of the bag. Equipment access is not an administrative formality; it is the prerequisite that determines whether the rest of the system has a chance to function on a given day.

The fix is a documented operations-to-maintenance handoff at the unit level that names the equipment state required by the work, lists the isolation steps, specifies the cooling or decontamination time before the work can begin, and is signed by both sides before the lockout is hung — not as paperwork after the fact, but as the gate that releases the work. Plants that have solved equipment access have built operations and maintenance into a single execution discipline at the unit level: planning that builds the cooling time into the schedule rather than treating it as buffer to be compressed; a stop-the-work authority that names a single supervisor empowered to halt the job when the conditions are not present, and that is exercised rather than displayed in a poster; and a feedback loop that flags chronic access failures by equipment tag, so that the third time the same crude column reboiler arrives at the work front in the wrong state, the planning process changes rather than the technician absorbing the cost. The plants that have not solved this lose hours every week to equipment that arrives at the work front in the wrong state, and lose far more than hours when the work is forced through anyway because the schedule said today and the unit superintendent had a number to make.

5. Continuity

A technician interrupted mid-task — pulled to a breakdown, called to a meeting, told to put down the alignment job to sign off on an emergency permit — does not pick the task back up where she left it; the mental model is gone, the setup has to be rebuilt, and the risk of skipping a step has just doubled because the steps already done are no longer fresh in working memory and the temptation to assume they were done well is enormous. Schedules built on theoretical wrench-time, with no buffer for the disruption that arrives every shift, have already chosen throughput over quality before anyone touches the equipment.

Capacity-based scheduling is what fixes this, and it is more an organizational decision than a planning technique. The schedule is built from honest wrench-time accounting — typically fifty to sixty-five percent of the nominal forty-hour week once meetings, walk-time, parts retrieval, and the realistic interruption rate are subtracted from theoretical capacity — rather than from the wishful arithmetic that produces a schedule loaded to a hundred and ten percent of theoretical hours and then degrades quietly as the week goes on. Each protected job carries a conditional action buffer for findings discovered during execution, because a real PM produces real findings and a schedule that does not allow the technician to act on them has trained him to stop noticing them. The schedule for the week names which jobs are protected and which are flexible, the protected jobs are not interrupted except by a defined and short list of conditions, and a named owner has the authority to refuse the interruption — which is the only way the protection actually holds. The daily standup is the place where the protection holds or fails, and the place where the supervisor either says no to the production unit asking for a tech, or quietly hands the alignment job over and watches it become rework on a future schedule. Aviation does not schedule maintenance windows beyond what its workforce can finish without interruption; healthcare does not insert an unrelated abdominal procedure into the middle of a cardiac surgery; industrial maintenance is the outlier, and the cost shows up in a rework rate that nobody tracks. Continuity is what protects every other condition the system has built, and it is the easiest one to corrode under production pressure — which is why it disappears first.

6. Safe Restart and Operation

The most carefully executed repair is one improper startup from failure. P-1105A’s seal failed in January because the field operator started the pump against a closed discharge valve; its bearing failed in August because the rebuild was returned to service without a verified valve lineup; a gearbox failed three weeks after an overhaul because the lube system was charged in the wrong sequence and the oil pump cavitated for the first ninety seconds of operation. Each of these was called a maintenance failure in the post-mortem. Each was actually a restart failure, and each erased every minute of skilled work that came before.

A verified restart sequence belongs inside the Execution Reference itself, as its closing phase — a documented valve-lineup check, an operations sign-off before the equipment is started, and a parameter monitoring window during the first thermal cycle (typically four to twenty-four hours, depending on the equipment class) where someone named is watching bearing temperature, vibration, seal condition, and flow against the operating envelope. The handoff from maintenance back to operations becomes the gate, governed by the same checklist discipline that opened the work, and the work order does not move to “complete” until the equipment has run safely through the monitored interval. Failure investigations carry “restart” as a categorized cause so the pattern surfaces in the monthly reliability review rather than getting absorbed into “maintenance error” and lost. Planning routinely reserves a craft hour for the restart observation rather than treating it as an afterthought. The operating procedures that govern the restart sequence are governed under the same library and revision discipline as the Execution References themselves, so that what the operator does at the moment of startup is connected to what the technician did at the moment of repair, in a single chain of custody that names everyone in the loop. The work isn’t done when the wrenches go back in the bag. It is done when the equipment has run safely through its first thermal cycle and someone has signed for that.

And if the logic holds for maintenance that defects are introduced by poor procedure, inadequate materials, and unsupervised execution, then it holds equally for the operator standing at the panel the next morning. The same discipline that governs how a technician handles a bearing before it goes into the housing governs how an operator brings the equipment online: verified lineup, documented sequence, named accountability. A reliability program that treats the maintenance-to-operations handoff as its terminal point has drawn its fence in the wrong place, because the operator's first thirty minutes with a freshly rebuilt machine carry the same failure potential as the last thirty minutes the mechanic spent on it – and each operations transient is another defect risk, or opportunity.

7. Learning by Doing

The technician finds something — elevated bearing temperature, a cracked grout pad, milky oil — and writes it on the closeout. The finding either generates a follow-up work order or it does not. If it does not, she has just been taught that the documentation is theater, and the next finding will not be written; if it generates a work order that is then deferred for six weeks until the equipment fails, she has been taught the same lesson in slower form. A maintenance system that collects findings and does not act on them is a system that has trained its workforce to stop reporting; conversely, a system that closes the loop — finding to work order, work order to action, action back to the Execution Reference for the next interval — gets smarter from execution rather than from theoretical workshops, and the improvement comes from the floor instead of the conference room.

The architecture is two loops with a defined cadence on top. The execution loop is asset-specific: this PM finds something, the finding generates a work order or is acted on within the same cycle, and the Execution Reference for that asset class is gradually updated based on what was learned over time.

The enterprise loop is cross-asset: findings distributions across the pump population, the heat exchanger population, and the electrical distribution population are reviewed quarterly against expected patterns, with a named owner empowered to revise PM intervals, change detection methods, and update the references themselves.

Without these loops, every condition the system established at deployment will degrade quietly over the months that follow, and no one will see it happening until the equipment starts telling them — by which point the strategy will be blamed; the strategy was fine.

The Foundation

There is no alternative to the Maintenance Execution Framework. A maintenance organization that has not built the elements of the Framework is not behind on best practice – it is failing in the non-negotiable basics. This yawning Gap is the norm at most plants and it causes copious failures — mid-life bearing seizures, seal blowouts, premature overhauls, and the steady drip of unplanned downtime that no one connects to a specific cause — that are the predictable output of a system that disabled its own technicians before they walked to the equipment.

The work to build this Framework — the shop walk, the storeroom redesign, the procurement rewrite, the operations-maintenance handoff, the honest schedule, the restart sequence, the closed loop — is finite, well-documented, and within the reach of any maintenance organization that decides to do it. None of it requires AI, a new platform, or another six-month paid consulting engagement upstream of the work itself.

By Peter J. Munson

Peter Munson