An early sunday morning: until the phone rings.
The local firebrigade called to notice us the fire extinguishing system in one of our datacenters was activated…
A quiet early Sunday morning around half past seven. A typical ICT guy should be sound asleep the next few hours. But this morning our phone wakes us. That week my wife was on out-of-office hours watch shift. The security services people informed my wife they received an fire alarm. A few hours earlier(!) The local fire department had already responded to it, and because they could not find anything actually burning they assumed it was not that big of deal. So they waited for the next shift to follow up on the events and inform us. Their only concern was the doorlock they forced to gain access to the datacenter. Maybe we wanted to inspect the premisse and get a locksmith to fit as new lock?
Fun fact: this datacentre is located on the premises of the fire department…
Arriving on scene our first impression was: business as usual on the datacentre.
A quick glance on the server racks revealed no havoc. Storage systems all green lights, servers and switches appeared to be working normal. No red flags, no beeps.
So what was this alarm about then? Let’s have a look at the fire extinguishing system.
Hey that’s not normal! The bottle pressure gauges indicate zero pressure in the system. At closer inspection the system had performed a full fire extinguishing cycle releasing all inert gas into the datacenter.
However we had no clue why it had done so, also the control panel did not indicate anything abnormal.
We decided to deal with the fire extinguishing system later, it was empty anyway. We started a check of all our systems. On the outside everything apeared running normal. No warning lights, all indicators green. So back to bed?
Logging into some systems some we did see odd behaviour however. What the heck? Every minute we got more puzzled by our systems. It soon became clear things were not normal and we quickly suspected our storage system to be the culprit. The Netapp storage showed to be normal at first glance, which was consistent with the all green lights on the
drives. More detailed search revealed the storage system to be severely broken: 56 disks in total had died!
So why did it not show disaster in the drive bays?
It appeared the system decided the situation was too severe to handle the situation itself.
It did not have a clue what was going on an simply had stalled a few aggregates. I guess a loss of 56 disks at once was slighty above its design 😉
The joy of management not having any clue is they leave you with nothing to work with. Storage costs money so free space is a waste. The company filled the system far beyond the recommended 70%. (Actually we the system was filled beyond 90%…) Besides this gives a horrible performance to all endusers, it also leaves you no room to work with if something goes wrong.
After escalating the incident to our managers, the next hours we dedicated our efforts on stopping all test & development systems in an attempt to find enough free space to move all data from this broken system to the
healthy one in the other datacentre. With 56 disks already failed, we were worried maybe more disks would fail in the next few hours, disabling even more aggregates. Now the next pproblem is: who are you going to call for 56 NetApp spare disks on a Sunday morning? We tried, but even Netapp itself did not have so much spare disks available in a few hours. (Or days for that matter.) This is not a disaster they see very often.
It became a busy Sunday for us, relocating storage and virtual services and to get the business up and running for the next Monday morning. After som calling around in our network of ICT specialists we found this behavior of disks failing after an inert gas fire extinguishing is not as rare as you should think. In some cases even more disks died within hours or days after the event. It became clear the only safe way to get through the coming days was to completely move all data to the other datacentre and write off all spinning hard disks in the affected datacentre. We faced a few very busy weeks, on one hand trying to keep as much systems running given the remaining storage and trying to find a solution for the defective system.
Another challenging task was to find and prove the root cause to our insurance company.
I will cover this in a next blog, it is interesting matter.
Maybe you are curious what happened that Sunday morning.
After investigation the electronics seemed all ok, and the log files showed no errors on the Gas Fire Suppression system. What was for sure: pilot cylinders gas release valve was open. The gas valves on the main bottles are too large to be operated electronically. Those are operated by a smaller pilot bottle. The pilot bottle is fitted with an electric operated valve.
This particular data centre is built in a prefabricated box, like a walk inn cooling cell.
It has a heightened floor like most datacentres. On the floor, in one corner the Gas Fire Extinguishing system, next to against a wall a large airco unit. As engineers we always had to get used to the large bang from the airco
switching on every now and then. This made a loud noise along with a vibration in the floor.
This constant beating of the airco eventually led to the release valve of the pilot bottle failing.
Then the rest of the system was emptied promptly. The loud noise induced by the sudden release of the gas damaged the disk, see my next blog for more on this topic.