hot server room = dead equipment?

Nybbles

[H]ard|Gawd
Joined
Jun 17, 2002
Messages
1,234
Ok, so I need some opinions here......

I work for a Mechanical Contractor. We install commercial HVAC (heating and air conditioning) and plumbing systems. We installed an AC system into a server room for a client. Everything was going well until one day the unit shut down. (We're still trying to determine if it was because of a bad unit, or if it was due to a specific thing they did, but anyway....) It's believed that the server room reached 95'F for at MAX (maybe less) than 2 hours.

Anyway, the client wants to bill us for some equipment that supposedly died because of this. They say they had to replace a JetStore III IDE RAID Array with 8x160GB Seagate IDE HD's, as well as an Opteron Proc, motherboard, and RAM.

I know 95'F is damn hot for a server room, but IN YOUR OPINION, how likely is it that less than 2 hours at this temp would kill all this equipment?
 
Originally posted by Nybbles
Ok, so I need some opinions here......

I work for a Mechanical Contractor. We install commercial HVAC (heating and air conditioning) and plumbing systems. We installed an AC system into a server room for a client. Everything was going well until one day the unit shut down. (We're still trying to determine if it was because of a bad unit, or if it was due to a specific thing they did, but anyway....) It's believed that the server room reached 95'F for at MAX (maybe less) than 2 hours.

Anyway, the client wants to bill us for some equipment that supposedly died because of this. They say they had to replace a JetStore III IDE RAID Array with 8x160GB Seagate IDE HD's, as well as an Opteron Proc, motherboard, and RAM.

I know 95'F is damn hot for a server room, but IN YOUR OPINION, how likely is it that less than 2 hours at this temp would kill all this equipment?

Very. If it was 95F ambient in the room, it was easily 150-200F in the equipment housing, and the components were likely hotter than that still, pushing them over their operating limit.
 
heh, 90 F can easily CRASH a computer from heat.
server equipment is designed for 60-70 degrees anyway
but then, there are people that run modern computers in 100 degree rooms durring summer.
 
Both items generate heat. With a server room at 95 degrees ambient temp., that’s a really hot environment. As to say that it made parts fail? It's possible, but I would think it's not your fault. They should be more aware of cooling unit failers, more so if they know that it affects other equipment within the same room.
 
question should be, why didn't anything shutdown, thermal monitering on the computer at least, should have shut it down.
 
Definately possible, 95 is extremely hot for a server room although I have seen closets in remote sights (where the users use them for storage as well as a server closet) crammed full of gear with no ventilation and an ambient temp in the high 80's, 90's and that stuff lived forever....
 
Someone said it above, why didn't they use software to monitor the temp to shut the hardware down??
 
Was it your job to determine how many units would be used to cool the room? or did you just install what they told you to?
 
Originally posted by Roast
Someone said it above, why didn't they use software to monitor the temp to shut the hardware down??


all they would have do do is set the temp in the BIOS. lol. usualy on my deafult no?
 
It is rather negligent on the part of the admins not to have a backup or plan for that type of event. Everyplace I've worked has lost A/C to the machine room at least once in the past 5 years, and the only things that seem to get hit are the network equipment. The rest of the equipment will warn us when overheating, and we've got portable A/C units and box fans for those times when we can't shut down.

As for their choice of hard drives, serves them right for buying into IDE arrays. Those units are garbage compared to SCSI. Most of my own SCSI units have been operating at 170F+ internal temps for years, sometimes even 200F. The SCSI mfgs know those units need to tolerate high temps in tight cases and high spindle speeds, so they deliver. Two years ago one of the places I contract for lost A/C overnight and it was well into 110F when we came in the next morning. The servers were cherry, but awfully toasty. Burned my fingers when I popped the cover and touched the drive pack.

Lastly, I've run some of my own equipment in rooms without cooling just south of Sac. They're older machines that produce less heat, so I can't say how an Opteron would handle, but they've survived central valley summers. I should probably get around to installing a timed utility fan to blast the air out at 3am or something.
 
Well there is the issue of hardware monitoring and why it apparently did not shut the system down. (BTW, does anyone know if RAIN Arrays, NAS's, or SAN's typically have any sort of temperature monitoring?

There is also other issues at play that I cannot divulge here, like other intentional sources of excess heat in nearby areas that could cause the AC unit to overheat and blah blah blah....

There's also the issue of building controls. Most any commercial HVAC unit that's installed anymore has an electronic controls sytem installed as well. Another company was contracted to install and configure the controls.... so the question is, what sort of temperature monitoring and alarming had they set up to monitor this temperature critical area? On a tangent, the controls systems are pretty cool, with the newer ones being web-based. Very snazzy stuff.

But anyway, if anyone could tell me wether or not most arrays, SAN's, and NAS's have environmental monitoring, that'd be cool. In the mean time, I'm gonna see if I can pull any specifics up on this JetStor III that supposedly fried.

Also, many thanks to all those who's replied so far.
 
definately possible... but did all 8 hds die? that dosent seem that possible... i have a lot of IDE/SCSI disks in high heat enviornments and they seem ok

all of my servers and networking equipment is in my garage, and it can get into the 90's in there and ive never had a problem

although all my servers do have full thermal monitoring, and auxillary fans (read - lots of tornados) will turn on if things get too hot... if they still dont cool off they send me an email and shutdown

BUT

since this is a datacenter, you have to look @ the paperwork and see if the datacenter is responsible for the failure... which really shouldent have happened... since datacenters really should have n+1 redundancy for everything anyway...

so i would tell your boss to tell them to f*off... they should have had a failover HVAC or two
 
Originally posted by Nybbles
Well there is the issue of hardware monitoring and why it apparently did not shut the system down. (BTW, does anyone know if RAIN Arrays, NAS's, or SAN's typically have any sort of temperature monitoring?

Not typically, because shutting down storage arrays is a real good way to lose/corrupt data.
 
Originally posted by FLECOM
definately possible... but did all 8 hds die? that dosent seem that possible... i have a lot of IDE/SCSI disks in high heat enviornments and they seem ok

all of my servers and networking equipment is in my garage, and it can get into the 90's in there and ive never had a problem

although all my servers do have full thermal monitoring, and auxillary fans (read - lots of tornados) will turn on if things get too hot... if they still dont cool off they send me an email and shutdown

BUT

since this is a datacenter, you have to look @ the paperwork and see if the datacenter is responsible for the failure... which really shouldent have happened... since datacenters really should have n+1 redundancy for everything anyway...

so i would tell your boss to tell them to f*off... they should have had a failover HVAC or two

I found this on there web site fro the IDE jetstor3;

50°F to 104°F operating

For the fun of it I called them and asked about monitoring, they do indeed watch internal ambient case temps. And will notify when the unit gets above 104°F.
 
I don't know if the heat would kill a system or parts of one but HDD's would be one of the things I would expect to last longer than other parts. And the fact 8 of them went out...seems a bit fishy. I would ask to see the "dead" HDD's and test them (not saying they wont just do something to them and fubar them but most physical damage could be seen anyways).
 
Flecom,

as to wether or not all 8 drives failed, I dont know. I'm just looking at a faxed invoice for the replacement equipment. It looks like they replaced this JetStor III and the drives were included as a whole. Perhaps they could have simply replaced any faild HD's instead of the entire unit AND drives.

and Zwitterion, I'm not an engineer. I dont design HVAC systems. I simply keep this company's network operational. However, the unit was designed sufficiently for the projected heat load. There was an issue of nearby area's being intentionally heated beyond the scope of the project that could have overloaded the unit.

So anyway, I guess I'll let the project manager know of the general concensus of the board and let him work it out with the client! Even still, I say keep the thread going if anyone has any other relevant info to add.
 
We had the EXACT thing happen at my old work. Of course I told the boss way BEFOREHAND the server "closet" should have independant AC from the general offices. They said no, and one hot weekend to save some money they turned off the AC without thinking.

Well it was about 100 degrees in there first thing monday morning with a raid alarm beeping away! What I "think" happened is that the RAID controller began to malfunction and in its all knowing state of trying to fix itself it efficiently decided that 3 of the 5 hard drives were bad.

It turned out that only one drive had actually died, but somehow since the raid thought the other drives were dead previously, it couldnt rebuild the single drive. It was strange at any rate, ended up having to wipe the whole system. Heat can do strange things!

At least the tape backup had failed 3 weeks prior (sarcasm) and they were waiting to replace it since it was SO expensive. Well now they had an empty server and 24 peoples data entry for three weeks gone. Was that tape drive so expensive now? I LOVED when the bosses would blow up at me, and I had all my bases covered. "I told ya so" was about all I could say :)
 
Originally posted by Zardoz
I found this on there web site fro the IDE jetstor3;

50°F to 104°F operating

For the fun of it I called them and asked about monitoring, they do indeed watch internal ambient case temps. And will notify when the unit gets above 104°F.

:eek: Wow Zardoz,.... talk about above and beyond!! ;)

Yeah, this is a VERY important fact that may save us thousands of dollars. U just beat me to the punch in finding it. ;)

thanks a whole ton, and thanks to all the continued and VERY speedy replies.
 
Originally posted by Nybbles
:eek: Wow Zardoz,.... talk about above and beyond!! ;)

Yeah, this is a VERY important fact that may save us thousands of dollars. U just beat me to the punch in finding it. ;)

thanks a whole ton, and thanks to all the continued and VERY speedy replies.

You might want to figure out whether that operating temp is room ambient or case internal. The two will differ wildly. If it's case internal (which is probably the case, since it has an internal(!) temp monitor), then there's no way it could remain below 104F internal with 95F ambient, and the case internal temp is directly impacted by the room ambient. Higher ambient forces higher case internal.
 
Originally posted by skritch
You might want to figure out whether that operating temp is room ambient or case internal. The two will differ wildly. If it's case internal (which is probably the case, since it has an internal(!) temp monitor), then there's no way it could remain below 104F internal with 95F ambient, and the case internal temp is directly impacted by the room ambient. Higher ambient forces higher case internal.

Well if it says that that is the "environmental operating" temperature of the array. Wouldnt that automatically be ambient? I mean, they know EXACTLY how much cooling their unit has, and how many CFM's they should push. Therefore, shouldnt they be able to state their recommended ambient temperature range?

Maybe I'll give the company a call tomorrow though.
 
Oh I forgot to mention, afterwards, they used some of our high dollar engineers make a lil device that was connected to the thermostat. If it went above a certain temp it sounded an alarm, if it went to a second, higher temperature it actually triggered the building alarm and then the security company would call whoever was in charge and inform them. They did put in a standalone AC unit too.
 
Originally posted by Nybbles
Well if it says that that is the "environmental operating" temperature of the array. Wouldnt that automatically be ambient? I mean, they know EXACTLY how much cooling their unit has, and how many CFM's they should push. Therefore, shouldnt they be able to state their recommended ambient temperature range?

Maybe I'll give the company a call tomorrow though.

If it says environmental operating temp, then yes, that'd be ambient. But Zardoz' post just said "operating", which is why I brought the distinction up. I've seen some equipment specify ambient, and some specify internal (usually companies that'd rather not have to replace equipment for free when fans fail).
 
Originally posted by Supchaka
Oh I forgot to mention, afterwards, they used some of our high dollar engineers make a lil device that was connected to the thermostat. If it went above a certain temp it sounded an alarm, if it went to a second, higher temperature it actually triggered the building alarm and then the security company would call whoever was in charge and inform them. They did put in a standalone AC unit too.

It's funny how things are not determined to be a high priority until AFTER a disaster strikes.

We had a situation in the office where I came in to one dead drive in our RAID 5 array. Ok, 1 drive, no HUGE deal. So I called HP as it was under warranty and got it replaced. Well, I didnt get the drive the next day like I was supposed to. I come in the second day and the second I walk in the door someone asks "Hey is the server down??"

As soon as they said thay, I got a bad feeling in my stomach and ran to the room. Now I had 2 red lights on the server.

You'd be surprised how much money I got for drives after that. We did take the opprotunity to increase the storage capacity of the array, as well as upgrade from NT4 to Win2k. We also used the warrantied drive, as well as bought more, installed a hot-spare, and bought an extra to keep on hand so I can replace it AS SOON AS IT DIES.

Man who wouldve thought.

But one good think about working for someone who does commercial AC for a living is that I have a VERY NICE independent AC unit for my server room. It's also got a remote control that I play with when I'm bored waiting for stuff to install. :p
 
Originally posted by skritch
If it says environmental operating temp, then yes, that'd be ambient. But Zardoz' post just said "operating", which is why I brought the distinction up. I've seen some equipment specify ambient, and some specify internal (usually companies that'd rather not have to replace equipment for free when fans fail).

Good point. Yes, I looked up the official data from the manufacturer and it is indeed environmental operating temperature.
 
just as a note- the mgr at my office from like 3 yrs ago bought a jetstore unit. it's a scsi unit and i don't recall the exact model, but maybe it's a jetstore2(??) in any case, the thing would die if you looked at it funny, much less if there was actually some extreme heat conditions or anything. AC&NC (the makers of the jetstore) produce total and utter crap equipment.

we got the unit and within 6 months lost 3 out of 4 HDs. now hardware failure is possible but that's ridiculous. what is more ridiculous is the fact that when we returned a drive they would RMA the drive and wait until THEY got the RMA back before reissuing a drive. ARE YOU SERIOUS???? it took over a month to get the first drive back. subsequent drives came back slightly more expediently, but not by much.

also, since we had so many problems with the unit my coworkers and i have had extensive experience with their tech support on the phone. we have determined that there are only 3 dudes at the entire company (or there were at the time anyway). they all have russian accents and smoke a lot. on day after several months of problems they noted that they would be in DC for a NAS conference and said they would stop by. FANTASTIC! their customer support to this point was utterly abysmalup to this point.

they said they would bring the null modem cable necessary to hook up to the unit directly and diagnose the problems we were having. well, the day comes and they show up. they go into the server room, tell us they did not in fact bring a null modem cable, and they diagnose the issue by - GET THIS - all 3 simultaneously tilting their ears towards the unit.

shocked that such highly tuned diagnostic equipment could not determine the problem they took the most recently failed drive (it had failed 2 days prior, and was the last of the 3 failed drives) and the drive sled with them and left. again, after about a month we got a working drive back. since we actually care about our data we moved everything off that hunk of crap and use it for testing and as a paperweight to this day.

moral of the story? NEVER BUY AC&NC. and nybbles, tell those people you did them a favor by killing their jetstore. now they can buy something that WORKS!
 
Originally posted by Nybbles
Well there is the issue of hardware monitoring and why it apparently did not shut the system down. (BTW, does anyone know if RAIN Arrays, NAS's, or SAN's typically have any sort of temperature monitoring?

Monitoring, maybe. Auto-shutdown, usually not, because auto-shutdown isn't always the right thing to do.

Many IDE drives let you get temperature readings via SMART. Not sure about SCSI, but I imagine that temperature reporting would be the domain of a SAF-TE module and not a drive. It's generally up to some extra software to take appropriate action during overheat conditions.

I would say that unless your contract or warranty covers indirect or consequential damage to other equipment, you really should be off the hook. HVAC equipment fails or hiccups all the time, and people ought to know that by now. You can't expect a car to run forever without maintenance and never break down or stall, and by maintenance, I'm not just talking filling the tank. HVAC is the same way.

Either you set up some failover redundancy in your HVAC, or you expect an occasional glitch.

(FYI, I used to work in the family business as a semi-hermetic HVAC compressor mechanic. 99% of the time, when we got a failed unit back, it would in fact be some installation or maintenance error. Most of the time either some faulty wiring would burn out the motor, or the evaporator would go south with nobody checking the oil level. Once or twice somebody let water get in the system, so we'd have warped valves and blown gaskets. There's just too many things that can go wrong, and not enough maintenance monkeys willing to do more than kick the machine if it hiccups.)

Originally posted by big daddy fatsacks
just as a note- the mgr at my office from like 3 yrs ago bought a jetstore unit. it's a scsi unit and i don't recall the exact model, but maybe it's a jetstore2(??) in any case, the thing would die if you looked at it funny, much less if there was actually some extreme heat conditions or anything. AC&NC (the makers of the jetstore) produce total and utter crap equipment.

we got the unit and within 6 months lost 3 out of 4 HDs. now hardware failure is possible but that's ridiculous. what is more ridiculous is the fact that when we returned a drive they would RMA the drive and wait until THEY got the RMA back before reissuing a drive. ARE YOU SERIOUS???? it took over a month to get the first drive back. subsequent drives came back slightly more expediently, but not by much.

Now that IS just wrong. Any decent vendor should at least give you a temporary-deposit option for advance replacement. That someone selling this (advertised) class of equipment should refuse that is intolerable. :mad:
 
Pretty much all the modern drives have temp probes and some protocol to get that data. Finding a program to read it over your particular controller is the challenge, as many controllers are still not quite right in the head. Industry groups can create standards until the cows take over the Earth, but those damn manufacturers just don't follow them. ATA devices are the worst for this.

As for RAID, I refuse to run anything in a professional environment that's not SCSI. IDE is just not there. In all the years I've been working with servers, I've seen exactly one SCSI RAID controller die, an Adaptec that was some 8 years old when it did, having a total of 6 hours down/off time during that 8 years. It just up and decided that a couple drives had failed, blew the array away on rebuild, and cost us a lot of money in recovery fees (tape from most recent backup was EOL, nearly 5 years old and used twice daily). As usually, the execs suddenly opened the wallets and a million pennies spilt forth to buy a new server.

Lastly, it occured to me you might rejoice that they're not trying to jack you over for lost productivity, recovery fees, etc.
 
Originally posted by Kelledin
....
I would say that unless your contract or warranty covers indirect or consequential damage to other equipment, you really should be off the hook. HVAC equipment fails or hiccups all the time, and people ought to know that by now. You can't expect a car to run forever without maintenance and never break down or stall, and by maintenance, I'm not just talking filling the tank. HVAC is the same way.....

This is precisely what I was thinking, it depends 100% on your contract with them, and I doubt they'd be able to hold you responsible without any monitoring in place. And the heat could have caused the failure, but it could have simply been coincedental as well.

How big of a company was this? We had an idependent AC unit for our server room that had the compressor fail a few times, which meant that the room got warm, but we had a very simple temperature monitor that was wired into the alarm system, so if the room went below 60 or over 80, we were alerted and could go in and open the room up until they repaired the ac.

And it most certainly is amazing what kind of money is availible when something does fail. Luckily, management generally conceded to the extra money spent in the service contracts, 4 hour replacement part service for our HP (compaq) servers and 24/7 hvac support :).
 
Kinda funny how the admins didn't know the AC unit failed. A smarter admin would have the doors open and exhaust fans up in action. They would have also gone the extra mile to prepare for machine shut down or worse... not just sit there and watch all damages add up.

It is also surprising that they didn't call an emergency HVAC crew to fix that AC ASAP or at least bring in a temporary alternative for the time being.

I don't think you should have to pay that bill. They made the choice for a single HVAC system (Most places I've worked at had two or more... some had as many as 6 and they all had network monitoring units).

I don't know about you, but it smells to me that they wanted new equipment so they let it fry
:eek:

Also, wierd how they want an entire new array but then they ask for specifics like an opteron proc, motherboard, and ram :confused: I don't know about you, but if I had proper servers, I would not want replacement parts (not to mention sitting there fixing a server like that)... that would be compromising the entire system (not to mention a big waste of time).

Sounds like they are trying to take you for a ride.
 
Originally posted by skritch
You might want to figure out whether that operating temp is room ambient or case internal. The two will differ wildly. If it's case internal (which is probably the case, since it has an internal(!) temp monitor), then there's no way it could remain below 104F internal with 95F ambient, and the case internal temp is directly impacted by the room ambient. Higher ambient forces higher case internal.

case internal...
 
Originally posted by SupaFly99
Kinda funny how the admins didn't know the AC unit failed. A smarter admin would have the doors open and exhaust fans up in action.

Having worked in a variety of environments, I can say that there are some circumstances under which it's impossible to leave the doors open, due to policy, security devices, or both (things like Halon fire suppression systems can't work in a non-sealed environment).

Also, there may be reasonable alternate location into which they could vent the heat, even if they had fans. And it's not just a venting issue; they'd also need to get cool air into the server room from somewhere, or they'd just be moving the hot air around.
 
Originally posted by skritch
Having worked in a variety of environments, I can say that there are some circumstances under which it's impossible to leave the doors open, due to policy, security devices, or both (things like Halon fire suppression systems can't work in a non-sealed environment).

Also, there may be reasonable alternate location into which they could vent the heat, even if they had fans. And it's not just a venting issue; they'd also need to get cool air into the server room from somewhere, or they'd just be moving the hot air around.

Yes. But then it is their fault don't you think? If you can spoon out that kind of cash for a halon system... then you'd be stupid not to have any kind of heat backup plan.

Also, same goes with policies... if there is no way that you can cool that environment down other than having an AC unit... then you ought to install a redundant cooling system.

You can move hot air around with fans... at least the outer environment is cooler and most of the time it's a much larger room. I guarantee you that the outer rooms are not going to be close to 95 degrees or even 80.
 
Originally posted by SupaFly99
Yes. But then it is their fualt don't you think? If you can spoon out that kind of cash for a halon system... then you'd be stupid not to have any kind of heat backup plan.

That's a risk management issue, and it's impossible to say whether or not it's stupid without knowing all the factors considered when making that decision.

There are situations in which that would be a sensible situation to be in. Risk analysis is a fairly complex field.
 
I don't know if I completely agree with you there. There has to be a balance... either you cut corners and assume the risk or you cover your rear by taking extra steps.

You can't blame your problems on a HVAC company. Unless if it is their fault of course (ex. forgot to service it or known defect that was purposely overlooked).

I've been in critical server rooms where $14 million pass through every hour w/ very sensitive data.... when their AC system failed... trust me... policy or not... your not going to risk that kind of loss. You'd rather swing those doors open and keep and eye on things.
 
Originally posted by SupaFly99
I don't know if I completely agree with you there. There has to be a balance... either you cut corners and assume the risk or you cover your rear by taking extra steps.

You can't blame your problems on a HVAC company. Unless if it is their fault of course (ex. forgot to service it or known defect that was purposely overlooked).

I've been in critical server rooms where $14 million pass through every hour w/ very sensitive data.... when their AC system failed... trust me... policy or not... your not going to risk that kind of loss. You'd rather swing those doors open and keep and eye on things.

*shrug* and I've been in critical server rooms where that much money passes through the system every few seconds, with data so sensitive they were regularly audited by the Nuclear Regulatory Commission. And they didn't have a backup HVAC. And the doors were not allowed to be opened, period. National security reasons.
 
Originally posted by skritch
*shrug* and I've been in critical server rooms where that much money passes through the system every few seconds, with data so sensitive they were regularly audited by the Nuclear Regulatory Commission. And they didn't have a backup HVAC. And the doors were not allowed to be opened, period. National security reasons.

*shrug* something tells me they won't have a cheap HVAC unit either, probably fully redundant.

But I do see your point too. I just don't think they could be held liable unless if they had prior agreements/contracts.
 
Originally posted by Kelledin
Now that IS just wrong. Any decent vendor should at least give you a temporary-deposit option for advance replacement. That someone selling this (advertised) class of equipment should refuse that is intolerable. :mad:
the experience with AC&NC regarding their jetstore was THE WORST customer service experience i have ever witnessed or been a part of.
 
Back
Top