Go Back   HostGator Peer Support Forums > HostGator Announcements > Network Status

Notices

Reply
 
Thread Tools
  #1  
Old 04-19-2007, 08:43 PM
GatorBrent's Avatar
GatorBrent GatorBrent is offline
HostGator Staff
 
Join Date: Oct 2002
Location: houston, texas
Posts: 2,977
Default Skyline Major outage 4/19/07 server ip 67.19.58.194

Ok guys this is what I was in the middle of writing up when Dave made some progress.


At 5:01 the skyline server had uninterrupted processes and the server was unable to halt, even a forced shutdown failed to work. (basically the server became unresponsive)


We had the datacenter reboot the server and this was there response...

"I tried to boot the system up using a LiveCD but it failed due to the error below. It appears the drive is unreadable and can't be mount via LiveCD.

hpt37x: reading /dev/sda[input/output error]"


At this point we separated the two drives in the server array and put them in a brand new chassi one at a time to attempt recovery. We shortly discovered it would be impossible to recover either one of the drives.

Hostgator along with the guys we asked at the datacenter have never seen an issue like this happen before. Your guess is as good as ours.....
Maybe the powersupply acted up causing both drives to fail simultaneously. We have no idea.


Where we stand now.......

We have a brand new top of the line conroe chassi that is going online in a few minutes with raid10 to replace this older xeon that has failed. Once it is online we will begin doing a restore from backups that were taken on 4/15/07.


There is no way to know how long this will take but we are guessing it will be at least around 12 hours to have everyone fully restored. This is a fully automated process so some of your sites might be up in a few minutes others 12 hours give or take.




The newest plan Is way beyond my technical knowledge but from talking to our CTO dave on the phone he said something like this....
bio woudln't mount drives at all. fsck - l listed partitions then stopped. another command make nod .

I'm going to have him post something more official ones hes done working his majic. But right now it looks like we'll be able to get everyone online just with a few things disabled on the server until we can rsync all the data to the new conroe. The problem is the drive is screwed up in the beginning were var and some other files are so if we access those partitions the io spikes and it brings down the server.


I know you guys will have more questions, and I have them to so I'll keep you as updated as I can. We think this is going to work but not entirely sure yet.
__________________
Gators love marshmallows.
Reply With Quote
  #2  
Old 04-19-2007, 08:51 PM
Demon Demon is offline
Hatchling Croc
 
Join Date: Sep 2004
Posts: 21
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

is there any chance that an email can be sent to thoses on the server once its back up or should we guess and check

also now that were all going to the new box do our accounts get upgraded to the newer packages? ie when i signed up i think i had aluminum package 5 gig of space and 50 of transfer, now the package is 12 gig of space and 125 transfer?

Last edited by Demon; 04-19-2007 at 08:56 PM.
Reply With Quote
  #3  
Old 04-19-2007, 08:57 PM
GatorBrent's Avatar
GatorBrent GatorBrent is offline
HostGator Staff
 
Join Date: Oct 2002
Location: houston, texas
Posts: 2,977
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

Good news is dave's jerry riggin seems to be holding up so far.

The drive is completely shot up, but we managed to get your sites online. Mysql will work, httpd, and dns also work.


What doesn't work will be mail, ftp, and Imap. We believe no mail will be lost as in theory it should be stored in var until the rsync is done. We are taking the box down now to install a 2nd nic so that it can rsync over to the new conroe.


We are hoping all this will work, if it doesn't it means 12 hours of downtime or more for everyone while the restores are running. So while it isn't ideal the options are what we have now or 12 hours downtime. I think everyone here prefers the broken setup were currently running then being completely offline. =)

The box should be backup in 20 minutes with everything mentioned above taking place. I apologize for the bad grammar, and disorganization. I'm keeping you updated as much as possible on what were doing to get this up as soon as possible, and what all of are options have been up until now.
__________________
Gators love marshmallows.
Reply With Quote
  #4  
Old 04-19-2007, 08:58 PM
GatorBrent's Avatar
GatorBrent GatorBrent is offline
HostGator Staff
 
Join Date: Oct 2002
Location: houston, texas
Posts: 2,977
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

Once the restore is done we will be upgrading all of the reseller packages to the new plans.
__________________
Gators love marshmallows.
Reply With Quote
  #5  
Old 04-19-2007, 09:17 PM
Demon Demon is offline
Hatchling Croc
 
Join Date: Sep 2004
Posts: 21
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

thanks for the updates
Reply With Quote
  #6  
Old 04-19-2007, 09:33 PM
humanbn humanbn is offline
Hatchling Croc
 
Join Date: Apr 2007
Posts: 3
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

So is a backup being put in from 4-15? Or have u gotten data off the drive. I've done a bunch of work in the last days after 4-15, and so have some clients. I need that data how it was before it went down.

Is this possible or not.
Reply With Quote
  #7  
Old 04-19-2007, 09:40 PM
slapshotw's Avatar
slapshotw slapshotw is offline
Veteran Croc
 
Join Date: Jun 2006
Posts: 5,163
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

Quote:
Originally Posted by humanbn View Post
So is a backup being put in from 4-15? Or have u gotten data off the drive. I've done a bunch of work in the last days after 4-15, and so have some clients. I need that data how it was before it went down.

Is this possible or not.
As far as I could tell they're not running a backup, they're running the current server with limited processes. This means all your data will be up to date.

ETA: The communication has been great from Brent in this. I know it's hard for you all on that server but imagine what it would be like if you didn't know anything. Increased communication is something HG's needed for a while and this looks like a step in the right direction.
__________________
Follow me on Twitter! http://twitter.com/mrw
Reply With Quote
  #8  
Old 04-19-2007, 09:51 PM
humanbn humanbn is offline
Hatchling Croc
 
Join Date: Apr 2007
Posts: 3
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

Some sites are up, some are down still as of time of writing this.
Reply With Quote
  #9  
Old 04-20-2007, 12:15 AM
GatorDaveC's Avatar
GatorDaveC GatorDaveC is offline
HostGator Staff
 
Join Date: Mar 2006
Location: Ontario, Canada
Posts: 2,147,483,696
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

As many of you know, the reseller server Skyline has had an emergency issue regarding it's hard drives.

The health of this Skyline's hard drives were fine yesterday, if they were not I would have received an e-mail saying a drive was degraded, or more importantly S.M.A.R.T errors from one of the drives.

If I remember correctly, we first noticed an issue at around 5:00PM, there were uninterruptible processes which we could not kill. We tried everything we could think of with in a reasonable amount of time to stop the processes, such methods as clearing the Semaphores of the applications running, and listing the open files that the processes were using. During this time, we did notice the Disk IO was abnormally high, it was sitting at about 60% to 100% Disk IO usage.

We ended up rebooting the server, after the POST ran, it then loaded the 3ware BIOS, which did not print out any errors at all. The boot loader (Grub) was not able to run at all. From there we had the data center break the array up, and had the server booted to a 'live' CD (knoppix). During the boot, the server was unable to find any of the partition tables and was printing out endless IO errors. We ended up having to boot to a different live CD (gentoo) using failsafe options. Once the system was booted I attempted to run "fdisk -l" to list the partitions, and it was completely blank. I verified that the 3ware RAID module was loaded, and ran smartctl. SMART showed some major errors from only today.

Because the SMART errors were presented nearly instantly, the hardware failure may have come from another component in the machine. This is also the assumed reason for the IO errors and disk IO usage previously encountered. The current hypothesis is that this failure might have been partly a power supply malfunction, but nothing can be confirmed yet.

Now since the partitions would not list, it looked pretty bleak. I ended up trying to re-install GRUB into the MBR, which actually allowed me to list the partitions for some odd reason... I am guessing there was some major corruption in the first 512~ bytes of the drive. Since the corruption was so bad in the beginning of the drive, udev would not build the character devices for /dev/sda*. Building the character devices by hand using mknod seemed to have worked, all of the partitions were able to mount. At this point I was pretty amazed the not 1, not 2, but ALL of the partitions were mountable. I chrooted into the environment and just made sure everything was there, and remounted most of the partitions as read-only to prevent anymore corruption.

The data is currently rsyncing from the old Xeon server to a Conroe box which is connected directly with a cross over cable. Once the data is properly transferred from the old machine to the new one, accurate damage assessment can be completed.

I'll stay up until this server is functioning perfectly. If needed, I have a stash of Red Bull ready to go!
Reply With Quote
  #10  
Old 04-20-2007, 12:35 AM
justG's Avatar
justG justG is offline
Hatchling Croc
 
Join Date: Oct 2005
Location: LI, NY, US
Posts: 45
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

Thanks so much for the dedication and communication during this server crisis. Your effort is not going unnoticed and is greatly appreciated. I think we're all pretty clear on HG's technical prowess, it's the way you deal with your customers that has been kind of a sore spot in the past. This particular initiative has done loads to restore my faith in HG's people skills. Again, thank you.
Reply With Quote
  #11  
Old 04-20-2007, 02:33 AM
GatorDaveC's Avatar
GatorDaveC GatorDaveC is offline
HostGator Staff
 
Join Date: Mar 2006
Location: Ontario, Canada
Posts: 2,147,483,696
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

It looks like the transfer is about 2/3 way done now. I should have the other server up and running at around 6:00-7:00AM.

If there is any missing information, let support know, and we will be glad to restore which ever files are missing from you account obviously free of charge.
Reply With Quote
  #12  
Old 04-20-2007, 05:08 AM
GatorDaveC's Avatar
GatorDaveC GatorDaveC is offline
HostGator Staff
 
Join Date: Mar 2006
Location: Ontario, Canada
Posts: 2,147,483,696
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

Ok the transfer just finished about 5 minutes ago. I am working on getting the network configured. The new server should be online in about 30 minutes.
Reply With Quote
  #13  
Old 04-20-2007, 05:35 AM
GatorDaveC's Avatar
GatorDaveC GatorDaveC is offline
HostGator Staff
 
Join Date: Mar 2006
Location: Ontario, Canada
Posts: 2,147,483,696
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

Ok the new server should be online now. If you see any issues, please submit a support ticket.
Reply With Quote
  #14  
Old 04-20-2007, 06:14 AM
barna barna is offline
Hatchling Croc
 
Join Date: Mar 2005
Posts: 7
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

Issues found so far:

- Cannot log in to WHM and Cpanel.
- Horde is down.
- Some e-mail accounts won't work.
- Mail sent to my addresses earlier today is not showing up.
Reply With Quote
  #15  
Old 04-20-2007, 06:43 AM
Demon Demon is offline
Hatchling Croc
 
Join Date: Sep 2004
Posts: 21
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

i'm also seeing that email is not working, i can not get in to cpanel or whm, 2 of my sites are missing files and are not useable

i will put a ticket in for these issues.

i realy appreciate the hard work HG has gone to, to get the server backup and running.
Reply With Quote
  #16  
Old 04-20-2007, 07:45 AM
humanbn humanbn is offline
Hatchling Croc
 
Join Date: Apr 2007
Posts: 3
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

my sites are half way up and half way down. FOr example, going to 1lightmedia.com shows a directory listing and index files missing, etc.
Reply With Quote
  #17  
Old 04-20-2007, 07:54 AM
ldearing ldearing is offline
Royal Croc
 
Join Date: Dec 2006
Location: Atlanta, Ga.
Posts: 562
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

this is not my server so its EASY for me to be impressed.... but i would just like to throw in my 2cents that this is first rate crisiss management and particularly great flow of information. we all know "stuff" happens... its what you do when it does that counts!
__________________
Cheers!!
Larry D.
Reply With Quote
  #18  
Old 04-20-2007, 08:48 AM
benshar benshar is offline
Hatchling Croc
 
Join Date: Mar 2005
Posts: 2
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

i was wondering if cpanel/whm was up yet? It finally loads a login prompt, but wont accept my passwords.

Maybe its too soon?
Reply With Quote
  #19  
Old 04-20-2007, 03:47 PM
GatorDaveC's Avatar
GatorDaveC GatorDaveC is offline
HostGator Staff
 
Join Date: Mar 2006
Location: Ontario, Canada
Posts: 2,147,483,696
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

Quote:
Originally Posted by benshar View Post
i was wondering if cpanel/whm was up yet? It finally loads a login prompt, but wont accept my passwords.

Maybe its too soon?

Can you submit a support ticket with the Subject as "ATTN: DaveC". Please include your login information, and basic description of the problem, and I would be glad to look into it.
Reply With Quote
  #20  
Old 04-20-2007, 04:41 PM
barna barna is offline
Hatchling Croc
 
Join Date: Mar 2005
Posts: 7
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

Quote:
Originally Posted by GatorBrent View Post
Once the restore is done we will be upgrading all of the reseller packages to the new plans.
My reseller account after restore is

Disk Space 5000 Megabytes
Bandwidth 50000 Megabytes

Are we going to get what Brent promised?
Reply With Quote
  #21  
Old 04-20-2007, 06:04 PM
GatorDaveC's Avatar
GatorDaveC GatorDaveC is offline
HostGator Staff
 
Join Date: Mar 2006
Location: Ontario, Canada
Posts: 2,147,483,696
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

Quote:
Originally Posted by barna View Post
My reseller account after restore is

Disk Space 5000 Megabytes
Bandwidth 50000 Megabytes

Are we going to get what Brent promised?

Yes, we can upgrade your account, please submit a ticket to Sales and they can upgrade it for you.
Reply With Quote
  #22  
Old 04-20-2007, 06:30 PM
chaloupe's Avatar
chaloupe chaloupe is offline
King Croc
 
Join Date: Nov 2004
Location: Moncton, New-Brunswick, Canada
Posts: 1,167
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

Wow! I give my 2 thumbs up on this to Hostgator. Great communication

Did GatorDaveC had a chance to catch sleep? Let's get him a raise!
__________________
Chaloupe
www.jbwseries.com

Reply With Quote
  #23  
Old 04-21-2007, 09:06 AM
Serra's Avatar
Serra Serra is offline
Veteran Croc
 
Join Date: Feb 2005
Location: Orange Park, FL
Posts: 5,067
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

Yes, even the worst problems can look half as bad if people are kept informed.
__________________
Six stages of Dedi Ownership

Fashionable broken link
image included
Reply With Quote
  #24  
Old 04-22-2007, 09:39 AM
Hostalot Hostalot is offline
Baby Croc
 
Join Date: Mar 2007
Posts: 92
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

This may sound slightly geeky but itīs quite interesting to read this error report.
__________________
A review directory of cheap web hosting providers
Read my Hostgator review before you sign up
Listings of the cheapest web hosting providers
Reply With Quote
  #25  
Old 04-22-2007, 10:44 AM
Sam Sam is offline
Emperor Croc
 
Join Date: Jan 2007
Location: /bin/false
Posts: 3,059
Default Re: Skyline Major outage 4/19/07 server ip 67.19.58.194

The good communication that HG had durring this issue would make people sign up as they would know how they would be treated
Reply With Quote
Reply

Bookmarks

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off

Forum Jump

All times are GMT -6. The time now is 03:46 AM.

 
Forum SEO by Zoints