Go Back   HostGator Peer Support Forums > HostGator Announcements > Network Status

Notices

Closed Thread
 
Thread Tools
  #1  
Old 01-04-2005, 07:36 PM
GatorBrent's Avatar
GatorBrent GatorBrent is offline
HostGator Staff
 
Join Date: Oct 2002
Location: houston, texas
Posts: 2,977
Default Supra events that took place and what to do.

The events that took place in a nutshell:

On 1/2/05 supra had a raid controller fail. When the raid controller was swapped it was discovered that the main drive had a hard drive failure, along with the raid controller going bad. This left us with 1 Bad Drive and what looked to be one good drive. We had to rebuild the raid1 array for us to be able to access the "good drive” it took about 18 hours time. This is an automatic process that cannot be sped up by us or anyone else. At this point in time we believed it would only take about four hours for the drive to rebuild and everyone to be fully operational again. It took over 18 hours most likely because of the amount of corruption on the drive. Once it was done we could not even access shell due to the server's password file that stores root login as well as all the user logins was corrupted. We found an old password file on the server which was restored, this gave us root access. Once we had root access we began to fix everything including Apache mysql, ftp, and pretty much every other system file there is. No user files were corrupted it was all system and cpanel files. Once we got a few major things fixed on the server we found out that the raid controller stopped writing to the drive about a month ago. This meant that all files on the server were a month old, and anything created within that month did not exist. This includes domain names, e-mails files. (everything). We believed everything would be fine since we still had the weekly offsite backup, which did a back up one day prior. When we accessed the offsite backup server we found the backup was also corrupt. This is when all hope to restoring the server back to normal, was lost. Anything from the month of data missing cannot be recovered. There is nothing to recover it from.


Are current status:
We are left with a server that has many key cpanel files missing. One of them is the cpanel passwords file. The fix for this is if you are a reseller and you cannot login please e-mail support@hostgator.com with the password you would like your username reset to. To fix your cpanel users you must login to your whm and modify their password from there.

We are currently at work fixing everything that is reported in http://forums.hostgator.com/showthread.php?t=2074
If you find something that has not already been mentioned please post it here. Do not post about sites missing user files missing or passwords not working. The password fix is mentioned above, and files and sites missing cannot be restored.

We know how devastating this is for everyone including us. We did everything that we could to recover yours files and get your sites back online as fast as possible.

There was not much that we could have done to prevent the hardware failures if anything. We were actually in the server a few days prior looking for problems and found no signs of any hardware failure that took place or were going to take place.

We are extremely sorry for the events that took place. We have let everyone on the supra server down. I understand how much hate many of you have towards us. Please keep in mind we are suffering just as much as you are, if not more. Our reputation will be damaged for eternity; our monetary damages will exceed $10,000, and countless hours will be spent repairing the damages. We are all in this together.

No amount of money will bring back your files and repay you for the downtime that was caused. The following is not much but this is what we can offer you….

1 month credit if you email requesting credit to sales@hostgator.com now.
2 month credit if you are patient and don’t request credit until 1/7/05
3 month credit if you are patient and don’t request credit until 1/14/05

We hope you can give us another chance to regain your trust and continue to serve you. For those who are patient and forgiving we are forever in debt.
__________________
Gators love marshmallows.
  #2  
Old 01-04-2005, 07:56 PM
Nutter Nutter is offline
Baby Croc
 
Join Date: May 2004
Location: Houston, Texas
Posts: 94
Default Re: Supra events that took place and what to do.

Brent, I of course cannot speak for anyone else, but hatred is not the correct word. Frustration maybe. But, for me at least, the frustration is aimed at the situation and not any person or company. It's bad for us, and bad for you. In my case, it's worse for you than it is for me. I didn't lose any money, aside from maybe a few pennies in AdSense.

It also helped the situation to have at least partial access. Having my WHM passwords reset and being able to login at least gave me hope that I would be back up soon. As it stands, except for being able to send email, it appears that I am up (fully understanding that I may be one of the lucky few).

I've said this several time in the other threads, but thank you and your staff for the work you're doing. Unlike others, I feel that this will make me more likely to stay with you. Not the server crashing obviously, but the way it has been handled since. Y'all have seemed very honest as to what has and is happening, and I really appreciate that. It is rare.

- Ryan
  #3  
Old 01-04-2005, 08:06 PM
Trinity31 Trinity31 is offline
Baby Croc
 
Join Date: Mar 2004
Location: Canada
Posts: 78
Positive Re: Supra events that took place and what to do.

I'm very pleased with HostGator actually, and always have been.

They kept us up to date, were honest, and told us exactly what was going on. I hope most of you will stay with them!
__________________
Regards,
Andrew Ellsworth
Trinity31.Net
  #4  
Old 01-04-2005, 08:16 PM
ztechpc ztechpc is offline
Hatchling Croc
 
Join Date: Jan 2005
Posts: 27
Default Re: Supra events that took place and what to do.

I still think a lot of this is bs.

You said the raid controller stopped writing a month ago. Raid controllers dont just stop righting - if the raid controller cannot access the drive the system will lock up.

Also, if you were in monitoring the system, you would have noticed a message from the raid controller software that something was not correct.

And if you did weekly backs and the most recent one was corrupt - then why not try the previous weeks backup instead of leaving everyone on Novembers data?
  #5  
Old 01-04-2005, 08:26 PM
Nutter Nutter is offline
Baby Croc
 
Join Date: May 2004
Location: Houston, Texas
Posts: 94
Default Re: Supra events that took place and what to do.

Personally I would rather be back on line now with old data than back on line a couple days from now with data that's not quite as old. Either way, I'd have to upload my backups, and I have and am back up (minus sending email). I, for one, am glad they did it the way they did. Maybe they should offer to go back after newer old data for those that want it off the previous weekly backup set?
  #6  
Old 01-04-2005, 10:19 PM
GatorBrent's Avatar
GatorBrent GatorBrent is offline
HostGator Staff
 
Join Date: Oct 2002
Location: houston, texas
Posts: 2,977
Default Re: Supra events that took place and what to do.

Quote:
Originally Posted by ztechpc
I still think a lot of this is bs.

You said the raid controller stopped writing a month ago. Raid controllers dont just stop righting - if the raid controller cannot access the drive the system will lock up.

Also, if you were in monitoring the system, you would have noticed a message from the raid controller software that something was not correct.

And if you did weekly backs and the most recent one was corrupt - then why not try the previous weeks backup instead of leaving everyone on Novembers data?
That's what I thought as well! But this is the only explanation since the drive, we are using is from the raid and all content is a month-old. If it was working. It would be an exact mirror.


We had something checking the array on all the boxes not to long ago but removed it off all servers after they were using a ton of resources.

The weekly backup is set to overwrite every week.
__________________
Gators love marshmallows.
  #7  
Old 01-04-2005, 10:29 PM
bigreed bigreed is offline
Hatchling Croc
 
Join Date: Dec 2003
Posts: 14
Default Re: Supra events that took place and what to do.

Quote:
Originally Posted by GatorBrent

The weekly backup is set to overwrite every week.
Is this still the case? What is being done to prevent this from occuring? It seems that your system was obviously flawed.
  #8  
Old 01-04-2005, 10:32 PM
cdsailer cdsailer is offline
Hatchling Croc
 
Join Date: Apr 2004
Posts: 27
Default Re: Supra events that took place and what to do.

syslog was not reporting any errors? What about dmesg? You could have a cron job that e-mails this info to you on a nightly basis that would not impact performance at all.



Quote:
Originally Posted by GatorBrent
That's what I thought as well! But this is the only explanation since the drive, we are using is from the raid and all content is a month-old. If it was working. It would be an exact mirror.


We had something checking the array on all the boxes not to long ago but removed it off all servers after they were using a ton of resources.

The weekly backup is set to overwrite every week.
  #9  
Old 01-05-2005, 07:18 AM
Matrix0 Matrix0 is offline
Baby Croc
 
Join Date: Jan 2005
Posts: 74
Post Re: Supra events that took place and what to do.

Brent,

Here's what I would like to see... and I think it is all reasonable and important stuff:

a) First and formost: my package looks to be back in some semblance of order. I need a 'green flag' to tell me that the server is now stable, and that there will be no more corruptions, and that no more outages or 403's are expected.

When this fiasco started I move two business critical sites out to another of our hosts. I am wary to bring them back until you give that green light, and assure me that things are back to normal.

Can you give it yet? And if not, when can you give it?

b) I would like specific commitments in terms of what you were going to do to insure this won't ever happen again (or anything like it).

c) It would be nice to have some guarantees on performance of this particular server. For example, that lost customers and those moving off this server won't be replaced with others. It's always been sluggish so I would also like to see more memory and less population on it.

This was really needed anyway, as performance was very poor. To set limits now to improve on what we had before would seem to be sensible and timely.

d) Set up specific monitoring of server performance and reliability over quite a prolonged period. This should include some sort of priority support for any issues raised against this server for that period.

e) One last thing on downtimes generally. An ETA is a critical piece of information for any business which depends on websites for survival. It enables decision making, such as whether to transfer business critical sites out on a temporary basis. On this occasion I felt that one COULD have been given, even if it was only ballpark. It wasn't. I would like to think that you could amend your procedures to give one in future. That sort of honesty, in the long run, would help you retain more business than you would otherwise lose through outage.


I hope that you will consider all these seriously. They are actually far more important than money to us... and some of them would make you a better host. None of them are at all unreasonable in the circumstances.

Perhaps you could let us have your response to them when you have a minute or two?
  #10  
Old 01-05-2005, 01:21 PM
ztechpc ztechpc is offline
Hatchling Croc
 
Join Date: Jan 2005
Posts: 27
Default Re: Supra events that took place and what to do.

Quote:
Originally Posted by GatorBrent
That's what I thought as well! But this is the only explanation since the drive, we are using is from the raid and all content is a month-old. If it was working. It would be an exact mirror.


We had something checking the array on all the boxes not to long ago but removed it off all servers after they were using a ton of resources.

The weekly backup is set to overwrite every week.
LOL WHAT?!?!?! You can't overwrite a back up every week, thats dumb, plain and simple. This is all part of disaster recovery. Same as removing the raid monitoring software. You simply CANNOT do that.

And maybe you need new techs because like a said, its impossible for a raid controller to stop writing to an array. If it cannot access the drives the system will lock up. and why were you running in raid 1 and not raid 5? if you were running raid 1 you should have a hot spare at the least. Maybe cluster your servers for redundancy?



oh and whats this? I thought you didnt have monitoring software.

Quote:
Originally Posted by GatorBrent
3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #1.
3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #1.
3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #1.
3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #1.
3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #1.


We need to take down the server for a few minutes to swap out one of the servers drives and have the array rebuilt. We have already spoken with the datacenter and are planning to do this tonight.

This needs to be done before the drive completely crashes and causes damage. It also is causing the server to be unstable at the moment. The datacenter assured us they can rebuild the array while the box is online and the good drive will not be at risk. We have done this on a few occasions without any problems. If you have any questions, please let us know.

Last edited by ztechpc; 01-05-2005 at 01:57 PM.
  #11  
Old 01-05-2005, 02:52 PM
airkat airkat is offline
Baby Croc
 
Join Date: Sep 2004
Posts: 75
Default Re: Supra events that took place and what to do.

interesting... Id like to hear the response to that?
  #12  
Old 01-05-2005, 03:06 PM
cdsailer cdsailer is offline
Hatchling Croc
 
Join Date: Apr 2004
Posts: 27
Default Re: Supra events that took place and what to do.

Brent??????
  #13  
Old 01-05-2005, 03:44 PM
Denoha Denoha is offline
Hatchling Croc
 
Join Date: May 2004
Posts: 30
Default Re: Supra events that took place and what to do.

Not to detract from the super-sleuthing going on, but I'd just like to back up Matrix0's post, in particular getting some kind of "Official Green Light" when the server is back to its normal operating status and all functions have been restored.

I would also like some kind of confirmation that the backup process will be modified so that we are not all depending on a single backup in hopes of survival should something like this occur again.

I appreciate HG's offers of credit and am willing to wait the alotted time, but that's really not going to do me any good if HG's business, and thus my business, is not dependable enough to provide whats it advertises.

My cpanel, webhost manager, websites, sql, and email all appear to be working fairly normally. I am still not seeing Fantastico, and I hope that will return soon as I have a client eager to use some of the services that were included there.

Any official word would be appreciated.
  #14  
Old 01-05-2005, 04:25 PM
onsite onsite is offline
Hatchling Croc
 
Join Date: Apr 2004
Posts: 41
Default Re: Supra events that took place and what to do.

Brent if the Raid WAS NOT WORKING then you would not have a working sytem. It would fail. Then. Not 2 months later. For a raid to work it has to have multible drives. If one fails then the raid stops working. That's it. End of story. No more raid without 2 drives. That's the one of the reason for a raid! Also why raid 1 and not 5?
Ztechpc said it all for me. Why would you remove your monitoring software for your raid controller? Why did your guy have to scale a 16 foot fence? Why no up to date backups that worked? There are too many why's and not enough good answers. Too many responses that don't make any sense. To many lax procudures. To me it sounds like a couple of kids got together and started hosting and did not have enough real world experience and we are paying for their on the job training.

I'm not interested in a credit. I want a refund. I don't trust your company anymore. With the answers that you have been giving out I don't trust you either. Not because you are lying but because you don't know enough yet. Either way you have lost my confidence and my business.




Quote:
Originally Posted by GatorBrent
That's what I thought as well! But this is the only explanation since the drive, we are using is from the raid and all content is a month-old. If it was working. It would be an exact mirror.


We had something checking the array on all the boxes not to long ago but removed it off all servers after they were using a ton of resources.

The weekly backup is set to overwrite every week.
  #15  
Old 01-05-2005, 04:35 PM
gameutopia gameutopia is offline
Hatchling Croc
 
Join Date: May 2004
Posts: 17
Default Re: Supra events that took place and what to do.

This is complete bull...I'm about out of here too. You guys can sit back there and listen to their lines of crap all you want. All they want is their customers to stay with them so they don't loose all there money or worse case senario everything and bankrupt themselves.

They claim all this quality service, yet look how long this is taking to resolve. Green light you say good luck with that.

The data center and server is actually leased or rented from the planet which is out of dallas, tx. Who know's where the the server actually resides if it's in dallas or somewhere else.

Hostgator claims to be in florida, so they're halfway across the country and performing no hardware fixes themselves. If anyone climbed a 16 foot anything it was nobody from hostgator, it was someone at the planet. They are probably a couple of guys and gals familiar with linux web servers running this place from there home or some small office space.

If we are lucky one of them know's somewhat what they are doing and that hopefully they have a dedicated connection so they can work on this server. If not hopefully their dsl, cable, or other form of connectivety don't fail them.


This server is somewhat normal, you got to be kidding. Unless you are seeing something I'm not, besides the files, pages, and hours of work I lost. I have empty directory's in all sites whether it be in cpanel file manager or uploading with ftp'ing client software.

Fantastico is gone. postgresql is all messed up. Can't download the current backup in case they mess this up again. Can't upload a backup without getting cut off halfway through.

packages, passwords, contact email in whm are gone.

And I have not even come close to checking everything.

I have requested that they delete my entire reseller acount and move me to another server as basically a new customer/account. I'll cut my losses and loose even more, but to get off this server and start with a fresh un-messed up server would be more than worth it to me.

No word on this, but I am also in the process of searching for a new host. If I can find a new one of decent quality and similar price before they get around to moving me...then see ya!!

Gameutopia
  #16  
Old 01-05-2005, 05:09 PM
cdsailer cdsailer is offline
Hatchling Croc
 
Join Date: Apr 2004
Posts: 27
Default Re: Supra events that took place and what to do.

Here is a simple script that you can run in a nightly cron that will at least give you some info, you can either parse the output yourself or write a script to parse it on another server. It will take up virtually no resources on your servers:

#!/bin/bash
################################################## #######################
# Summary #
################################################## #######################
# FILENAME: system_report_linux
# CREATED: 10/19/2001 #
# AUTHOR: csailer
# DESCRIPTION: Collect general system info on linux box and mail to #
# $MAIL #
################################################## #######################

################################################## ###############
# History #
################################################## ###############
# 10/19/2001 Created history section #
# 10/22/2001 Added crontab -l to general info section. #
################################################## ###############

STAMP=`date +%Y%m%d`
REPORT_FILE="/var/adm/system_reports/sys_report_${STAMP}"
HOST=`uname -n`
DATE=`date "+%b %d %Y"`
MAIL="Add e-mail address here"
#
echo "System Report for $HOST " > $REPORT_FILE
echo "`date`" >> $REPORT_FILE
echo " " >> $REPORT_FILE
echo "---------------------------------------------------" >> $REPORT_FILE
echo " " >> $REPORT_FILE
echo "ROOT CRONTAB" >> $REPORT_FILE
crontab -l >> $REPORT_FILE
echo "************************************************* **" >> $REPORT_FILE
echo "* Gerneral System/Usage Info *" >> $REPORT_FILE
echo "************************************************* **" >> $REPORT_FILE
echo "---------------------------------------------------" >> $REPORT_FILE
echo "UNAME -A " >> $REPORT_FILE
uname -a >> $REPORT_FILE
echo " " >> $REPORT_FILE
echo "---------------------------------------------------" >> $REPORT_FILE
echo " " >> $REPORT_FILE
echo "HOST_ID " >> $REPORT_FILE
hostid >> $REPORT_FILE
echo " " >> $REPORT_FILE
echo "---------------------------------------------------" >> $REPORT_FILE
echo " " >> $REPORT_FILE
echo "IFCONFIG -A " >> $REPORT_FILE
/sbin/ifconfig -a >> $REPORT_FILE
echo " " >> $REPORT_FILE
echo "---------------------------------------------------" >> $REPORT_FILE
echo "#w " >> $REPORT_FILE
w >> $REPORT_FILE
echo " " >> $REPORT_FILE
echo "---------------------------------------------------" >> $REPORT_FILE
echo "#last | head -10 " >> $REPORT_FILE
last | head -10 >> $REPORT_FILE
echo " " >> $REPORT_FILE
echo "************************************************* **" >> $REPORT_FILE
echo "* Disk Info *" >> $REPORT_FILE
echo "************************************************* **" >> $REPORT_FILE
echo "DF -K" >> $REPORT_FILE
df -k >> $REPORT_FILE
echo " " >> $REPORT_FILE
echo "---------------------------------------------------" >> $REPORT_FILE
echo "FSTAB " >> $REPORT_FILE
cat /etc/fstab >> $REPORT_FILE
echo " " >> $REPORT_FILE
echo "************************************************* **" >> $REPORT_FILE
echo "* Share Info *" >> $REPORT_FILE
echo "************************************************* **" >> $REPORT_FILE
echo "EXPORTS" >> $REPORT_FILE
cat /etc/exports >> $REPORT_FILE
echo "************************************************* **" >> $REPORT_FILE
echo "* Hardware info *" >> $REPORT_FILE
echo "************************************************* **" >> $REPORT_FILE
echo "DMESG" >> $REPORT_FILE
dmesg >> $REPORT_FILE
echo " " >> $REPORT_FILE
echo "************************************************* **" >> $REPORT_FILE
echo "* Tail of messages *" >> $REPORT_FILE
echo "************************************************* **" >> $REPORT_FILE
echo "tail -100 /var/log/messages" >> $REPORT_FILE
tail -100 /var/log/messages >> $REPORT_FILE
echo " " >> $REPORT_FILE
mail -s "$HOST SYSTEM REPORT $DATE" $MAIL < $REPORT_FILE
  #17  
Old 01-05-2005, 05:53 PM
cdsailer cdsailer is offline
Hatchling Croc
 
Join Date: Apr 2004
Posts: 27
Default Re: Supra events that took place and what to do.

We need some feedback from HG. What is going on? It's been a long time since any decent info was posted.
  #18  
Old 01-05-2005, 06:38 PM
nodtveidt nodtveidt is offline
Hatchling Croc
 
Join Date: Jan 2005
Posts: 19
Cool My thoughts on this situation...

As far as I'm concerned, anyone who doesn't maintain the proper backups of their own sites deserves whatever misfortunes are brought upon them. I encourage every one of my clients to utilize backup techniques, be it local copies of their scripts/pages, nightly backups of databases, etc. and most all of them do it because these things do happen and when they do, you can either whine and lament about "oh I lost such-and-such data and I'm sooooo mad" or you can do "okay, well, I still have my data, it can still be restored, whether I get a new host or not is irrelevant because hey, I still have my data". I can never stress enough to my clients to keep their own backups.

I use HG as somewhat of a secondary server for sites with lower requirements than the big sites we host on dedicated servers elsewhere. While the events of this are certainly not favorable, having a backup plan is always in your best interests. That's strike #2 against everyone whining...you also had no backup plan in effect, so you were hit harder. As far as finding a new host...whenever you run into troubles like this, you should always have another host on standby. Have an extra account somewhere else, just in case of a monumental foulup like this one. So just pointing the finger at HG isn't going to change the fact that you also were not prepared for this disaster. Granted the mailserver thing is a little harder to work around, but a clever business owner has backup plans for that too.

However, strike #3 is against HG itself. While it's true that everyone who signs up for an account has agreed to the TOS which states that you are not responsible for people's data, part of customer service is going above and beyond what is expected and delivering a product and service better than your competitors. Keeping proper backups and doing more careful monitoring of your servers would count as such superior service in this case. I'm not going to pretend to be "know-it-all supersleuth" like these other guys, but if the story about the corrupted backups is true, you could have prevented such a loss by doing regular backup checks to make sure they weren't corrupted, and stored them somewhere other than the server itself, which it sounds like you were doing. Storing the backup on the server you're backing up in the first place makes very little sense.

HG, I hope you learn a thing or two from this experience and use this incident to improve the level of service and quality you offer to customers. And people, I hope you also learn a thing or two from this experience and take your own safeguards and precautions from now on, no matter where you may end up. I'm not throwing harsh words against anyone here, but the hard facts are that everyone got their just desserts for not planning ahead.

Brent, I believe the people do deserve somewhat of an update on the status of supra, don't you think?
  #19  
Old 01-05-2005, 07:26 PM
fuzzfree's Avatar
fuzzfree fuzzfree is offline
Baby Croc
 
Join Date: Dec 2004
Location: Greece
Posts: 68
Default Re: Supra events that took place and what to do.

I agree about the matter that a reseller (and their customers) must have their own backups / good backup plan. And having a secondary host is essential.
The most critical for a business is their emails to be working, so having a second host you could change the dns and the emails would work in a few hours. No need to create ALL email accounts, just 1 and point all undelivered messages to this address (the website is less critical - you could have a intro only page with a notice 'site is down for the moment').

But for those who had not the second host option - the problem with supra is that even if ALL resellers/customers had their OWN backups then many of them could not restore them (& sites still are not functioning) since for a big period their Cpanel / WHM passwords were corrupted, dbs corrupted, email not working etc.
Essential HG data were lost from the server... and the server is still unstable..

I hope everything will be fixed for you guys - it is terrible what happened..
  #20  
Old 01-05-2005, 09:27 PM
dreamerhost dreamerhost is offline
Hatchling Croc
 
Join Date: Jan 2005
Posts: 2
Unhappy Re: Supra events that took place and what to do.

Feels like I was swept under by a huge tsunami and now I'm still trying to recover. I'm afraid to trust the place where the wave struck (HG)... and I'm looking for a safer place to do business.

Last edited by dreamerhost; 01-05-2005 at 09:35 PM. Reason: addition
  #21  
Old 01-05-2005, 09:42 PM
GatorBrent's Avatar
GatorBrent GatorBrent is offline
HostGator Staff
 
Join Date: Oct 2002
Location: houston, texas
Posts: 2,977
Question Re: Supra events that took place and what to do.

The server is stable except for the exception when we are fixing many of the problems people are reporting, and we have to take services off-line to do it. (brief moments)
you have the green light however, some minor issues still linger and we are fixing them one by one. Nothing major, that should keep you from doing business.

Both drives are connected to the controller it's not software that you load. It's the interface controller the drives are using the READ/WRITE data

If the controller is malfunctioning it can cause the drives to read and/or write data improperly causing corruption. This is obviously the case from the current drive that everyone is using. The drive itself is fine but has corruption that took place from the raid and a month of data that was lost / was not written to it. The server stayed online for a month and then crashed a few days ago with a raid controller, that was not working and a hard drive that could not be salvaged. We can only guess what exactly took place but the outcome is still the same. We have posted all the information we had on the failure so feel free to make your own assumptions on how it went down. If you know for a fact what happened please do tell...


Simply saying we are lying is not an acceptable answer. This is not a pretty situation, what good can lying do? The data is lost either way no lie is going to bring it back.... Everything we have told you is the complete truth based on what we know and my interpretation of what happened and based on the information that was provided.


In all honesty it would have been cheaper and much easier to kick everyone on supra to the curb. We are doing all that we can with the cards we have been played. We aren't about to give up and have made a ton of progress. I'm sorry this is not enough. We feel horrible about what happened and are fixing all the remaining problems one by one.

We have weekly offsite backup just in case something like this took place. The offsite backups were corrupted. We are not sure why.... this is the same backup drive we restore multiple sites a day from, for those who accidently delete files, are deleted for not paying their bill, etc. the one time, we really need it, it completely failed us. Why? We believe it had something to do with the state of the drive when everything was backed up (some form of corruption) the backups were done one day prior of the failure.

We caught acura's hard drive failure notices before it happened. Believe it or not we weren't even looking for it. Acura was having problems it was one of the obvious things we found.


We spent much more time in supra from the prior problems everyone was having days before. We saw no sign of what was to come.... if it was just a hard drive going bad we most likely would have caught the warnings as we did with Acura and we would still have a good 2nd drive. If the raid caused the damage both drives could be affected as what happened. perhaps when supra first started having problems it was just with reading..... days later it was with writing which caused meltdown on the main drive.

We are looking into finding a reliable source to monitor raid arrays so we can detect this type of problem in the future. We are also going to overhaul / more frequent tests of backups being performed. If I was in any of your shoes I'd be worried about it happening again as well. This is such a freak incident the chances of it happening to begin with are close to none. We still are going to revamp the offsite backups.
__________________
Gators love marshmallows.
  #22  
Old 01-05-2005, 10:00 PM
nodtveidt nodtveidt is offline
Hatchling Croc
 
Join Date: Jan 2005
Posts: 19
Default Re: Supra events that took place and what to do.

Quote:
Originally Posted by Aesop
A miller and his son were taking their donkey to sell at market, when they passed a group of girls, who laughed at how foolish the miller was to have a donkey and yet be walking. So the miller put his son on the donkey. Further down the road they passed some old people who scolded the miller for allowing his young son to ride, when he should be riding himself. So the miller removed his son and mounted the donkey himself. Further along the road, they passed some travellers who said that if he wanted to sell the donkey the two of them should carry him or he'd be exhausted and worthless. So the miller and his son bound the donkey's legs to a pole and carried him. When they approached the town the people laughed at the sight of them, so loud that the noise frightened the donkey, who kicked out and fell off a bridge into the river and drowned. The embarrassed miller and son went home with nothing.
In other words...if you try to please everybody, you will eventually lose your ass.

You're doing fine, Brent.
  #23  
Old 01-05-2005, 10:11 PM
orbitant orbitant is offline
Hatchling Croc
 
Join Date: Apr 2004
Location: Singapore
Posts: 24
Default Re: Supra events that took place and what to do.

Hi Brent,

Everything seems to be going smooth for me now ...

Thanks for the effort and updates, and hell it was a good experience to start a New Year.

Cheers to all @ HG.
  #24  
Old 01-06-2005, 02:36 AM
Matrix0 Matrix0 is offline
Baby Croc
 
Join Date: Jan 2005
Posts: 74
Default Re: Supra events that took place and what to do.

Quote:
Originally Posted by Matrix0
Brent,

Here's what I would like to see... and I think it is all reasonable and important stuff:

a) First and formost: my package looks to be back in some semblance of order. I need a 'green flag' to tell me that the server is now stable, and that there will be no more corruptions, and that no more outages or 403's are expected.

When this fiasco started I move two business critical sites out to another of our hosts. I am wary to bring them back until you give that green light, and assure me that things are back to normal.

Can you give it yet? And if not, when can you give it?

b) I would like specific commitments in terms of what you were going to do to insure this won't ever happen again (or anything like it).

c) It would be nice to have some guarantees on performance of this particular server. For example, that lost customers and those moving off this server won't be replaced with others. It's always been sluggish so I would also like to see more memory and less population on it.

This was really needed anyway, as performance was very poor. To set limits now to improve on what we had before would seem to be sensible and timely.

d) Set up specific monitoring of server performance and reliability over quite a prolonged period. This should include some sort of priority support for any issues raised against this server for that period.

e) One last thing on downtimes generally. An ETA is a critical piece of information for any business which depends on websites for survival. It enables decision making, such as whether to transfer business critical sites out on a temporary basis. On this occasion I felt that one COULD have been given, even if it was only ballpark. It wasn't. I would like to think that you could amend your procedures to give one in future. That sort of honesty, in the long run, would help you retain more business than you would otherwise lose through outage.


I hope that you will consider all these seriously. They are actually far more important than money to us... and some of them would make you a better host. None of them are at all unreasonable in the circumstances.

Perhaps you could let us have your response to them when you have a minute or two?
Brent: You covered the green light but nothing else on here.

Could you respond to the other points? I think all of them are reasonable, and it sounds as though plenty of other customers agree with me.
  #25  
Old 01-06-2005, 10:38 AM
nodtveidt nodtveidt is offline
Hatchling Croc
 
Join Date: Jan 2005
Posts: 19
Cool Re: Supra events that took place and what to do.

What I find to be extremely hypocritical are the people who whined that "Brent's just saying this to keep his customers". These same people are also "saying stuff to keep their customers" to their own customers, right? So they have no place to complain about it in that sense.

In any event, is there any word on our account sizes being corrected? That's pretty much the only thing I'm waiting for, everything else seems to work as expected and my clients also don't seem to notice any other irregularities. Explaining the /public_html issue was a breeze.
Closed Thread

Bookmarks

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off

Forum Jump

All times are GMT -6. The time now is 11:52 PM.

 
Forum SEO by Zoints