|
#1
|
||||
|
||||
|
Ok guys this is what I was in the middle of writing up when Dave made some progress.
At 5:01 the skyline server had uninterrupted processes and the server was unable to halt, even a forced shutdown failed to work. (basically the server became unresponsive) We had the datacenter reboot the server and this was there response... "I tried to boot the system up using a LiveCD but it failed due to the error below. It appears the drive is unreadable and can't be mount via LiveCD. hpt37x: reading /dev/sda[input/output error]" At this point we separated the two drives in the server array and put them in a brand new chassi one at a time to attempt recovery. We shortly discovered it would be impossible to recover either one of the drives. Hostgator along with the guys we asked at the datacenter have never seen an issue like this happen before. Your guess is as good as ours..... Maybe the powersupply acted up causing both drives to fail simultaneously. We have no idea. Where we stand now....... We have a brand new top of the line conroe chassi that is going online in a few minutes with raid10 to replace this older xeon that has failed. Once it is online we will begin doing a restore from backups that were taken on 4/15/07. There is no way to know how long this will take but we are guessing it will be at least around 12 hours to have everyone fully restored. This is a fully automated process so some of your sites might be up in a few minutes others 12 hours give or take. The newest plan Is way beyond my technical knowledge but from talking to our CTO dave on the phone he said something like this.... bio woudln't mount drives at all. fsck - l listed partitions then stopped. another command make nod . I'm going to have him post something more official ones hes done working his majic. But right now it looks like we'll be able to get everyone online just with a few things disabled on the server until we can rsync all the data to the new conroe. The problem is the drive is screwed up in the beginning were var and some other files are so if we access those partitions the io spikes and it brings down the server. I know you guys will have more questions, and I have them to so I'll keep you as updated as I can. We think this is going to work but not entirely sure yet.
__________________
Gators love marshmallows. |
|
#2
|
|||
|
|||
|
is there any chance that an email can be sent to thoses on the server once its back up or should we guess and check
also now that were all going to the new box do our accounts get upgraded to the newer packages? ie when i signed up i think i had aluminum package 5 gig of space and 50 of transfer, now the package is 12 gig of space and 125 transfer? Last edited by Demon; 04-19-2007 at 08:56 PM. |
|
#3
|
||||
|
||||
|
Good news is dave's jerry riggin seems to be holding up so far.
The drive is completely shot up, but we managed to get your sites online. Mysql will work, httpd, and dns also work. What doesn't work will be mail, ftp, and Imap. We believe no mail will be lost as in theory it should be stored in var until the rsync is done. We are taking the box down now to install a 2nd nic so that it can rsync over to the new conroe. We are hoping all this will work, if it doesn't it means 12 hours of downtime or more for everyone while the restores are running. So while it isn't ideal the options are what we have now or 12 hours downtime. I think everyone here prefers the broken setup were currently running then being completely offline. =) The box should be backup in 20 minutes with everything mentioned above taking place. I apologize for the bad grammar, and disorganization. I'm keeping you updated as much as possible on what were doing to get this up as soon as possible, and what all of are options have been up until now.
__________________
Gators love marshmallows. |
|
#4
|
||||
|
||||
|
Once the restore is done we will be upgrading all of the reseller packages to the new plans.
__________________
Gators love marshmallows. |
|
#5
|
|||
|
|||
|
thanks for the updates
|
|
#6
|
|||
|
|||
|
So is a backup being put in from 4-15? Or have u gotten data off the drive. I've done a bunch of work in the last days after 4-15, and so have some clients. I need that data how it was before it went down.
Is this possible or not. |
|
#7
|
||||
|
||||
|
Quote:
ETA: The communication has been great from Brent in this. I know it's hard for you all on that server but imagine what it would be like if you didn't know anything. Increased communication is something HG's needed for a while and this looks like a step in the right direction.
__________________
Follow me on Twitter! http://twitter.com/mrw |
|
#8
|
|||
|
|||
|
Some sites are up, some are down still as of time of writing this.
|
|
#9
|
||||
|
||||
|
As many of you know, the reseller server Skyline has had an emergency issue regarding it's hard drives.
The health of this Skyline's hard drives were fine yesterday, if they were not I would have received an e-mail saying a drive was degraded, or more importantly S.M.A.R.T errors from one of the drives. If I remember correctly, we first noticed an issue at around 5:00PM, there were uninterruptible processes which we could not kill. We tried everything we could think of with in a reasonable amount of time to stop the processes, such methods as clearing the Semaphores of the applications running, and listing the open files that the processes were using. During this time, we did notice the Disk IO was abnormally high, it was sitting at about 60% to 100% Disk IO usage. We ended up rebooting the server, after the POST ran, it then loaded the 3ware BIOS, which did not print out any errors at all. The boot loader (Grub) was not able to run at all. From there we had the data center break the array up, and had the server booted to a 'live' CD (knoppix). During the boot, the server was unable to find any of the partition tables and was printing out endless IO errors. We ended up having to boot to a different live CD (gentoo) using failsafe options. Once the system was booted I attempted to run "fdisk -l" to list the partitions, and it was completely blank. I verified that the 3ware RAID module was loaded, and ran smartctl. SMART showed some major errors from only today. Because the SMART errors were presented nearly instantly, the hardware failure may have come from another component in the machine. This is also the assumed reason for the IO errors and disk IO usage previously encountered. The current hypothesis is that this failure might have been partly a power supply malfunction, but nothing can be confirmed yet. Now since the partitions would not list, it looked pretty bleak. I ended up trying to re-install GRUB into the MBR, which actually allowed me to list the partitions for some odd reason... I am guessing there was some major corruption in the first 512~ bytes of the drive. Since the corruption was so bad in the beginning of the drive, udev would not build the character devices for /dev/sda*. Building the character devices by hand using mknod seemed to have worked, all of the partitions were able to mount. At this point I was pretty amazed the not 1, not 2, but ALL of the partitions were mountable. I chrooted into the environment and just made sure everything was there, and remounted most of the partitions as read-only to prevent anymore corruption. The data is currently rsyncing from the old Xeon server to a Conroe box which is connected directly with a cross over cable. Once the data is properly transferred from the old machine to the new one, accurate damage assessment can be completed. I'll stay up until this server is functioning perfectly. If needed, I have a stash of Red Bull ready to go! |
|
#10
|
||||
|
||||
|
Thanks so much for the dedication and communication during this server crisis. Your effort is not going unnoticed and is greatly appreciated. I think we're all pretty clear on HG's technical prowess, it's the way you deal with your customers that has been kind of a sore spot in the past. This particular initiative has done loads to restore my faith in HG's people skills. Again, thank you.
|
|
#11
|
||||
|
||||
|
It looks like the transfer is about 2/3 way done now. I should have the other server up and running at around 6:00-7:00AM.
If there is any missing information, let support know, and we will be glad to restore which ever files are missing from you account obviously free of charge. |
|
#12
|
||||
|
||||
|
Ok the transfer just finished about 5 minutes ago. I am working on getting the network configured. The new server should be online in about 30 minutes.
|
|
#13
|
||||
|
||||
|
Ok the new server should be online now. If you see any issues, please submit a support ticket.
|
|
#14
|
|||
|
|||
|
Issues found so far:
- Cannot log in to WHM and Cpanel. - Horde is down. - Some e-mail accounts won't work. - Mail sent to my addresses earlier today is not showing up. |
|
#15
|
|||
|
|||
|
i'm also seeing that email is not working, i can not get in to cpanel or whm, 2 of my sites are missing files and are not useable
i will put a ticket in for these issues. i realy appreciate the hard work HG has gone to, to get the server backup and running. |
|
#16
|
|||
|
|||
|
my sites are half way up and half way down. FOr example, going to 1lightmedia.com shows a directory listing and index files missing, etc.
|
|
#17
|
|||
|
|||
|
this is not my server so its EASY for me to be impressed.... but i would just like to throw in my 2cents that this is first rate crisiss management and particularly great flow of information. we all know "stuff" happens... its what you do when it does that counts!
__________________
Cheers!! Larry D.
|
|
#18
|
|||
|
|||
|
i was wondering if cpanel/whm was up yet? It finally loads a login prompt, but wont accept my passwords.
Maybe its too soon? |
|
#19
|
||||
|
||||
|
Quote:
Can you submit a support ticket with the Subject as "ATTN: DaveC". Please include your login information, and basic description of the problem, and I would be glad to look into it. |
|
#20
|
|||
|
|||
|
Quote:
Disk Space 5000 Megabytes Bandwidth 50000 Megabytes Are we going to get what Brent promised? |
|
#21
|
||||
|
||||
|
Quote:
Yes, we can upgrade your account, please submit a ticket to Sales and they can upgrade it for you. |
|
#22
|
||||
|
||||
|
Wow! I give my 2 thumbs up on this to Hostgator. Great communication
![]() Did GatorDaveC had a chance to catch sleep? Let's get him a raise!
__________________
█ Jean Boudreau - SysAdmin WannaBe @ Host And Mail █ Shared, Reseller cPanel Hosting and Backup Solutions █ http://www.hostnmail.com/ |
|
#23
|
||||
|
||||
|
Yes, even the worst problems can look half as bad if people are kept informed.
|
|
#24
|
|||
|
|||
|
This may sound slightly geeky but itīs quite interesting to read this error report.
__________________
A review directory of cheap web hosting providers
Read my Hostgator review before you sign up Listings of the cheapest web hosting providers |
|
#25
|
|||
|
|||
|
The good communication that HG had durring this issue would make people sign up as they would know how they would be treated
|
![]() |
| Bookmarks |
| Thread Tools | |
|
|