What went wrong?
Actually, we know very well what went wrong in each
of the following horror stories…
14 steps to total Infrastructure Meltdown.
A large company had computer rooms, holding key
email and business information, in each of their London offices.
It was decided that they should be replaced by a single commercial co-location
facility.
Our advice was that, though the decision to concentrate was correct, the selected
facility was flawed and lacking in business continuity measures.
Our advice was ignored.
Here’s what happened next:
- A general power failure of the local electricity
supply.
- The resulting power surge blew a 600 Amp ceramic
fuse.
- The spare could not be found.
- The Uninterrupted Power Supply (UPS) was designed
to carry the load for 20 minutes.
- The SLA with the customer called for the generator
to start within 6 minutes.
- When it did, its exhaust flue ignited waste
material.
- The smoke was drawn into the generator enclosure,
triggering the fire detection system and shutting down the generator.
- After 20 minutes, the UPS shut down.
- All power was now lost to the data centre for
4 hours.
- The operators panicked but failed to
alert their customers.
- Complex operational systems and Databases failed
and crashed.
- To make matters worse, the move into the facility
had seen a lot of corners cut, to meet a tight deadline.
- The new systems had been created manually, without
backup disks.
- It took 4 days to restore most of the corporate
email system.
ECA had foreseen all these problems and had advised
the customer not to place their business critical systems in jeopardy.
Here are a few of our comments:
- Reliance on a single generator is poor practice.
- Multiple generators start as soon as they sense a mains power
failure and reach full designed electrical load within seconds.
- Thereafter, the switchgear should manage load shedding, shutting
down spare generators, while maintaining full load.
- A UPS capacity of 20 minutes is inadequate if you only have
one generator.
Digger shuts down hospital.
When a completely even mains supply is crucial,
it can be ‘smoothed’ via a UPS, consisting of large
battery banks that also maintain power if the mains supply fails.
It is essential that the entire end-to-end supply – UPS, backup generator(s),
and associated fuel and controls are fully tested on installation and after
maintenance.
Sometimes, the UPS is only designed to hold the electrical load for a very
short time, until the generator kicks in. This is a minimalist design, not
suited to essential installations like hospitals.
This is what happened in a recent incident:
- A contractor’s JCB dug up the electricity main.
- The UPS took over and – as designed – switched off
when the generator started up.
- Some time later, the generators ran out of fuel, because their
fuel pump depended on mains electricity!
- The hospital lost all power and suffered a
blackout.
- Fortunately, no lives were lost.
As can be seen, this design contained several single
points of failure.
- It should have been checked thoroughly, during design, installation
and commissioning.
- It should have been run under service load conditions, to prove
its operational effectiveness.
- It should have been regularly tested.
The fact that the generators start is no proof that
electrical supply will be maintained.
Software shuts down at midnight.
This was an avoidable chapter of events!
- The system was built in a hurry.
- The main software vendor installed the software – an early
implementer version - by download from their own website.
- They failed to supply discs at build stage.
- However, the system went live, settled down and operated well.
- Then, one night – sharp at midnight - it failed.
- And so did a similar system in Singapore.
- It transpired that for two weeks, the application software maintenance
people had been receiving ‘evaluation licence about to expire’ notices.
- They didn’t believe them.
- Having left a ‘logic time bomb’ in their software,
the vendor’s development staff had corrected it – but
they had forgotten to tell their own field staff.
- Results: damaged business; damaged reputations.
To discuss the ways in which we could enhance
the Resilience and Security of your organisation,
simply ring +44 (0) 118 976 7544
Meanwhile, have a look at Advice |