Monday, August 20, 2007

Web Outages

We have had a lot of problems with web outages the last couple weeks. The problem stems from the fact that the co-location center has had air conditioner problems which are neither being adequately monitored nor resolved.

I've been out there probably close to a dozen times in the last two weeks, five of those times A/C has been inoperative and the building interior over temperature. Only one of those times were they aware of it when I called.

Saturday, August 18, 2007, I replaced ultra1, which is the main web server machine and also acts as an NFS server for some other applications that require common disk space. Earlier I had replaced a failed CPU, Saturday four SIMMS had failed. So not knowing for sure if it was the heat, a bad motherboard, or a bad power supply, I replaced the entire machine. On Saturday, they had just repaired the air conditioner before I arrived and the room was down to 74°F.

Monday, August 20, 2007 (today), around 11AM the web server crashed, the new just replaced web server.

Oops crash message on console

When I arrived, approximately 11:30AM, the co-location facility was 89°F indoors.

This is the thermostat inside on the wall of the co-location facility.  As you can see the temperature is set to 70°F but the temperature is 89°F.

I opened up my cabinet doors so that equipment could at least cool down to the 89°F room temperature, propped open the outside door in hopes of cooling the room off somewhat, and called the NOC.

Door propped open in an attempt to cool facility.

I called them at 15 minute intervals there after and by 12:15, still nobody had been dispatched. I spoke to a supervisor and by 12:41 PM someone was finally dispatched.

Other customers had apparently been there and left before I arrived, because one of them had removed the back from his cabinet and left it off and another had opened the door and left with it open.

Another customer removed the back of his cabinet and left with his equipment exposed.
A customer left his cabinet unlocked and open to keep it as cool as possible with the A/C failed.

Mark arrived about 1:20PM, by which time the room had cooled to 87°F with the door propped open. He called the A/C repair company, propped open an inside door between the co-location facility and their equipment room (where the A/C was working fine), and set a couple of floor fans in the doorway to blow cold air in.

By the time the A/C repair people where there, around 1:50 PM, the room was down to 83°F. The A/C guys determined that there was a problem with the unit on the roof, however, the tech that was there Saturday took the key home with him so they couldn't get upon the roof immediately. We had to wait for that tech to arrive with the key. The second tech arrived about 2:10PM or so.

Once they had access to the roof, the A/C repair folks found a bad card in the A/C unit on the roof and replaced it, the A/C came back on, and I left when the room was down to 81°F but cold air was blowing out of the vents.

There are many aspects of this event that are particularly troubling. It was 63°F outside and rainy.

Outside temperature 63F
Outside it's cold and rainy.

Earlier this year it was 101°F one day in Bellevue, what would happen if the A/C had failed then?

The temperature alarm was set to 85°F.

High temp alarm set to 85°F.

With it set this high an alarm isn't even generated until the room temperature has risen 15°F above the normal temperature.

The problem is greater than just high temperatures. These cabinets have very little ventilation. The A/C didn't just stop, there was no air coming out of the vents as far as I could tell, warm or cold. In this situation, the temperature inside equipment cabinet gets far hotter than outside. When I opened it up, it was like pulling bread out of an oven.

To my way of thinking, the high-temp alarm should be set just a few degrees above the normal temperature, so that when the A/C fails, an alarm will be generated before customers equipment is already melting and flaming.

Even at the insanely high temperature the alarm is set at, at 89°F, an alarm had to have been in previously, and yet, the NOC was blissfully unaware that anything was amiss when I called. In addition to being unaware, it took four calls and a conversation with a supervisor to get anyone dispatched. The previous four times I called a high temperature condition in, the NOC also was completely unaware. I have to conclude that either the high temperature alarm doesn't work, even when the limit is exceeded, or nobody actually monitors the alarms even when they come in.

I have customers who are very dependent upon their website for their income as are we because our main source of advertising is the website and word of mouth, and having failures like this do very negative things for both.

There are other customers out there who have fans on the back door of their cabinet, I have requested that the same be done for mine, so that during the event of a failure, the inside of cabinet doesn't get far above ambient. It isn't just the loss of A/C that but the loss of air circulation which then causes the interior temperature to rise far above the already extreme ambient temperature that bakes things.

I want the high temp alarm issue to be addressed, both an alarm working at a more reasonable temperature, and the systematic problems resolved, the alarms need to be monitored and when a failure does occur dispatch needs to happen in a timely manner. I am working on resolving these issues with this co-location provider.

1 Comments:

Blogger Dennis said...

Only one a/c unit? Sad state of affairs. I entered this field over 40 years ago working with main frames. Back then the rule of thumb was calculate your a/c load, divide by some number and install that number +1 units of that size to cover failure and maintenance of any unit. Sometimes +2, depending on how critical your computer was to the life of the business. Now it sounds like a basically of-the-shelf a/c unit has become the single point of failure.

August 21, 2007 1:10 PM  

Post a Comment

<< Home

Google