Stop Being an IT Hero: 4 Unspoken Rules for Building Reliable Systems
1.0 Introduction: The Vicious Cycle of IT Firefighting
It’s a familiar story. The company’s biggest sales event of the year is live, and suddenly the website crashes. Panic ensues, technical teams scramble, and after hours of frantic work, the service is restored. Management praises the "heroes" who saved the day. But this isn't a story of heroism; it's a symptom of a deeper, predictable problem.
The central argument, drawn directly from Clause 8.4 of the international standard for IT service management (ISO/IEC 20000-1), is this: IT services don't fail randomly; they fail when demand exceeds supply or when that supply is uncontrolled. The constant chaos of emergencies and escalations isn't a sign of a dynamic business but of a system lacking proactive control.
This article distills four surprising but powerful principles from this global standard that can help any organization move from constant "firefighting" to a state of quiet, predictable reliability.
2.0 Principle 1: You Can't Outsource Responsibility
Simply signing a contract with a supplier—whether for cloud hosting, software, or managed services—is not the end of your responsibility. The standard makes it clear that organizations must actively control and monitor their suppliers to ensure service delivery. This starts with selecting suppliers based on their capability and the risks they pose, not just their cost. A common red flag for auditors is seeing Suppliers chosen solely on cost without risk evaluation.
This is often counter-intuitive. Many businesses see a signed contract as a successful hand-off of responsibility. But in today's interconnected service ecosystem, a supplier's failure is your failure. If their infrastructure goes down or their software has a vulnerability, it's your service that suffers the outage and your customers who lose trust. Effective service management demands continuous oversight.
As a core audit principle states, the rule is absolute and formal:
If a supplier failure can disrupt a service, that supplier must be under ITSMS control.
3.0 Principle 2: Forecasting the Future Doesn't Require a Crystal Ball
Many capacity-related failures occur because an organization didn't anticipate a surge in demand. The key to preventing these failures is disciplined demand management—a systematic process designed specifically to avoid capacity-related service outages by understanding current usage and anticipating future needs. This isn't about guesswork; it's about using data and business intelligence to plan resources accordingly.
Key demand drivers that must be tracked include:
- Business growth or contraction
- New services or applications
- Seasonal or peak usage
- Regulatory or market changes
Many teams avoid forecasting because they fear it's too complex or that they'll get it wrong. However, the goal is not to be perfectly right, but to be prepared and systematic. An informed estimate that drives proactive planning is infinitely better than a reactive scramble to add capacity during an outage.
An audit insight makes this practical goal clear:
Demand forecasting does not need to be perfect—but it must be systematic and informed.
4.0 Principle 3: Constant Firefighting Is a Sign of Failure, Not Heroism
Our culture often praises the "heroes" who work all night to solve a crisis. But from a service management perspective, a state of constant emergency is definitive proof of weak controls. A culture that relies on heroes is, by definition, a reactive one. While incidents will always happen, a pattern of recurring, preventable emergencies indicates a fundamental failure to manage supply and demand.
The organizational danger lies in rewarding this reactive behavior. When the heroes get all the recognition, proactive planning—the quiet, unglamorous work that prevents incidents in the first place—is devalued. This directly harms the business by degrading service availability, undermining cost control through expensive emergency fixes, and eroding customer satisfaction with every preventable outage. A well-run service management system should feel stable, predictable, and even a little "boring." This stability is the true sign of success, as it means risks are being managed before they become incidents.
The audit principle is blunt and unforgiving on this point:
Reactive firefighting is evidence of weak supply and demand control.
5.0 Principle 4: Tolerating Underperformance Is a Strategic Choice—A Bad One
One of the most significant risks in service delivery is tolerating a supplier's poor performance simply because switching seems too difficult or costly. This creates a painful business reality where an organization knowingly relies on an undependable partner. This choice leads directly to recurring service failures, chronic operational friction, and the slow erosion of customer trust.
This is a strategic trap that auditors identify as a "Major Audit Concern": Supplier underperformance tolerated because “there is no alternative.” This decision to accept known risk has a predictable outcome. When auditors find evidence of Recurring service failures caused by known supplier or capacity constraints, they see it as a "Major Nonconformity Indicator." It signals that management is aware of a critical risk but has failed to address it, effectively choosing to let its services fail.
6.0 Conclusion: Are You Building a System of Control or a Culture of Crisis?
Ultimately, achieving service reliability isn't about having the best technicians; it's about having the best systems. True reliability comes from actively managing suppliers, systematically anticipating demand, prioritizing proactive planning over reactive firefighting, and refusing to tolerate underperformance from your partners.
Ultimately, the difference between a reliable and a chaotic service lies in a commitment to proactive control. It's the core distinction auditors look for: evidence of proactivity, not a history of reaction.
Look at your own organization: Is your team rewarded for preventing fires or for putting them out?
Ready to take the next step?
Browse our 221 toolkits and services, or speak to a lead auditor about certification, gap analysis, internal audit or training.
Share This Article
Found this useful? Share it with your network:
