Facts:
A fileserver based on Tyan Thunder K8SD Pro (S2882-D), equipped with 3Ware 9500S-8 SATA RAID card and 6 Western Digital 250Gb disks as raid5 in 2 enclosures Proware MS-324A and Proware MS-223A. The front door of the server case is locked with a key that the owner does not know where it is, no buttons are available outside the door (no power/reset/etc buttons). The server is running Gentoo Linux.
The day starts:
08:55 – Mobile phone rings. I wake up but I don’t pick it up since I am unable to speak on the phone due to sleepiness. I am thinking that there is absolutely NO way something good is ever going to come out of a phone ringing that early.
09:05 – I wake up, check the phone and CallerID says that is one of the customers that I do tech support for. I call them and ask what’s the problem. Conversation follows:
Me: Hello, good day, what’s the problem and you are calling me so early ?
Customer: Oh sorry, did I wake you up ?
Me: It’s ok, I was just about to wake up (HUGE LIE)…
Customer: The fileserver keeps beeping today as it did last night.
Me: Beeping ? Why didn’t you tell me yesterday ?
Customer: I didn’t think it was of any importance, so last night I pulled the plug to make it stop beeping.
(That’s when my lower jaw reached my desk. Remember that the front door is locked, he has no access to buttons and of course he has no Linux knowledge in order to ssh and power it off. Why didn’t he call me last night to do it though ???)
Me: You did what ? You pulled the plug ? And today you put it back online ? And it beeps again ?
Customer: That’s right. Do you know what is the problem ?
Me: No, but let me login remotely to the machine and I’ll take a look. I’ll call you back soon to tell you what’s going on.
First checks:
I ssh to the machine and start checking /var/log/messages. After some searching I find this:
3w-9xxx: scsi0: AEN: WARNING (0x04:0x0042): Primary DCB read error occurred:port=2, error=0x208.
I google for it and at the same time login to IRC to ask some friends if they know anything about that error. Noone seems to have met that before. Some websites say this error is of no importance. Some others say it is very important and that I should call the vendor. I go to 3ware’s site and start searching the knowledgebase. I find these pages:
a)http://www.3ware.com/KB/article.aspx?id=14335
b)http://www.3ware.com/KB/article.aspx?id=14687
c)http://www.3ware.com/KB/Article.aspx?id=12072
I also check the status of the array using tw_cli (3Ware Command Line Utility). It says that is verifying the array, probably due to the plug pulling the customer did.
I call the customer and tell him that the array is being verified and that I will call him back as soon as it finishes.
11:30 – The verify process ends. All is fine with the array.
11:32 – I call the customer and ask him if the beeping has stopped. He tells me that the beeping keeps on.
11:34 – I reboot the server and check the messages again. I now get
c0 [Fri Aug 31 09:42:53 2007] WARNING (0x04:0x0042): Primary DCB read error occurred: port=3, error=0x208
But no verifying process starts. I manually start a verifying process while examining various commands that the tw_cli provides. I ask another friend on IRC and he suggests that some disks might be failing.
Time for some face to face contact:
11:50 – I call the customer again and tell him that I am taking a taxi to go there in order to take a “closer” look.
11:55 – I am waiting for a taxi.
12:10 – Still waiting
12:15 – A taxi comes, I argue with an old man who is trying to take my turn for the taxi. I tell him that I have to go to a hospital immediately so he steps back.
12:25 – I arrive at the customer. The beeping sound can be heard all over the place and even though the server is in a seperate closed room one can hear it from 2 rooms beside.
I take a monitor and a keyboard from another PC and plug them to the fileserver, I reboot it and enter 3Ware’s BIOS. No alarms/no errors are shown. I reboot it and start checking the motherboard’s BIOS. PC Health Status looks fine (the room is airconditioned with a stable temperature of 21 degrees Celcius). I boot into Linux again. No errors at /var/log/messages or through tw_cli but the server keeps beeping. I am by then totally puzzled. I enter 3Ware’s site to create a customer account and open a trouble ticket. I take messages shown from tw_cli show diag command and the previous errors that I posted above along with various data from the machine to fill the needed details. I know that I won’t have an answer for at least 4-5 hours due to time difference with US so I start messing around with the controller through tw_cli trying to find any clues.
13:30 – Since it’s friday and the RAID5 array has no spare drive I decide to order one drive like the others from an online shop. Even if no drive at the moment has a problem it won’t hurt to have a spare drive for the future.
I am also trying to help people continue to do their jobs without the company’s fileserver while messing around with the controller. I run smartctl for every disk to check their SMART attributes using something like:smartctl -a -d 3ware,2 /dev/sda
. No errors at all from any disks. Temperatures normal. Then “-t short” SMART tests, no errors.
A strange idea:
14:30 – People have started leaving the company for noon break. I stay.
14:40 – I strange idea comes to mind. What if I remove the 3ware card ? Will the beeping stop ?
14:45 – I start to unscrew the box to pull the 3Ware card out of it. No success. The beeping continues.
14:55 – I pull the power plugs off the first enclosure, the Proware MS-324A. No success. The beeping continues.
15:00 – I pull the power plugs off the second enclosure, the Proware MS-223A. THE BEEPING STOPS!
15:05 – I put back on the power plugs of the MS-324A. NO BEEPING.
So I have found out whose fault is the beeping, right? I try to take the MS-223A out of the server box. The process is rather tricky due to faulty screws or screws improperly screwed (don’t laugh!) by the company who assembled the server (not me! NOT ME!!). I finally manage to take the enclosure away from the box and blow the dust away from it. While doing that I notice that one fan is not acting like the other 2 while I blow air at it. It doesn’t “turn” as fast as the others do. I put some plugs to the enclosure and I start the machine again. The beeping starts but what is clear is that one fan has a spinning problem, I guess it’s due to dust. I try to find the manual of MS-223A on the web. That’s where I notice this:
When a fan's rotation speed is lower than 1000rpm the buzzer will sound.
Trying to fix the problem:
I am now certain of who’s to blame. I try to unplug the fan from the enclosure and put back the enclosure to the server box. It keeps beeping.
16:00 – I start searching for a spare 60mm fan with a 3pin molex. Of course I can’t find any at the customer’s place. I go out and search the neighborhood for a computer store. I am lucky (you can laugh here) and I see a guy just enter his computer store, I go inside and ask him if he has any of the fans that I want. He doesn’t.
16:30 – I am back at the customer’s place. I order 3 60mm fans with a 3 pin molex from the net. Having some spare fans in the future sounds very very good to me.
17:30 – The customer and his employees come back at the company and I explain to him what has happened. I am shocked to learn by other employees that they often heard it beeping again in the past but nobody cared to tell me.
18:00 – I read my emails and 3ware’s support has replied to my case. They propose to download some other diagnostics and do some tests.
I was too tired to test the controller with the new diagnostics. Since it’s friday and the company closes for the weekend I will run the tests when I have the 60mm fan replaced. Until then (which could easily be tommorow if the fans arrive), I’ve shut the server down, just to be sure that there’s nothing wrong with the controller or any of the disks.
Conclusion:
I am almost sure that if it hadn’t been for the beeping sound I wouldn’t even have noticed 3ware’s “errors” which were probably caused by the pulling of the main plug of the PSU. It might sound a bit strange, but I don’t actually worry about the diagnostics test that 3ware’s customer support proposed. I am very impressed by 3ware’s customer support and responsiveness. I don’t know how all this will end yet, but I think it will all be fine by the time I replace the fan.
DAMN FAN! YOU RUINED MY DAY.
I still hear this “Beep! Beep! Beep! Beep! Beep! Beep!” sound inside my ears.