|
#26
|
|||
|
|||
|
My sites that were up and running are now lost... can't find server. I even uploaded current files for the one with old pages... it was ok for a minute... then gone!
My clents are calling... hope this if fixed quickly! |
|
#27
|
|||
|
|||
|
I'd say things are still in flux BGE ... I've been watching my sites and they're still going up and down. It will likely take some time before the dust is fully settled.
Brent, can you get back to us here and confirm whether the March files are the most recent backup? REALLY hope you can find a tape backup from May 6 as originally stated. I know this sucks for you guys too. Thanks for keeping us posted. |
|
#28
|
|||
|
|||
|
My sites are all back up again except one; I have submitted a ticket for that and they promised it will be fixed very soon.
I have to get out now on some belated errands... goodluck guys! I also commend the Host Gator team! BGE |
|
#29
|
||||
|
||||
|
Quote:
I suggest your customer move to a hosting company that has their own servers. Perhaps, you could give me their name and I'll offer them a better quality of hosting... |
|
#30
|
||||
|
||||
|
Sorry for the lack of updates. I've been on the phone with Tim G. who is in charge of all hostgator backups for over an hour. He will be updating everyone shortly with what's going on. Like you we were all shocked to find out the home partition being out of date by over two months. What I myself have had the hardest time understanding is with over 100's of servers being backed up, and as rare as a failure is how is it only the mercedes server that is so out of date.
It's mind boggling, but from what I've been able to find myself all but a couple boxes are up to date so as it usually is it seems to just be very bad luck, and out of the few that aren't up to date besides mercedes the oldest backups are a few weeks old. We still have the failing drive so Tim will be working on updating all out of date files and missing accounts off this server. The good news is that mysql is completely up to date, and NO ONE will be missing their databases.
__________________
Gators love marshmallows. |
|
#31
|
||||
|
||||
|
Quote:
I just want to point out that keeping backups is really what the reseller's customers should be doing if they have mission critical data. Trust no one, basically. Backups are never 100% safe, so don't trust HG to keep your critical data safe, crap happens and when it does, its always at the worst time. Resellers, if your clients are keeping mission critical data on their reseller account, you need to offer them a level of protection by keeping backups yourself. MySQL can easily be backed up. Files too, not so easy, but it can be done. This is a value added service you can offer your customers. Make sure your TOS indicates that YOU are not responsible for backups of their stuff. |
|
#32
|
||||
|
||||
|
I just got word all the missing accounts have been restored and are fully up to date. We are now working to update all those older files and even see if we can get mail restored.
__________________
Gators love marshmallows. |
|
#33
|
|||
|
|||
|
Some of my customers websites are out of date. Should I be trying to fix these problems myself manually with backed up data they might have or is there still a possibility some of the files from the home partition will be replaced?
|
|
#34
|
|||
|
|||
|
Hello,
We are currently syncing up data from the failing drive, which is mounted and running on a boot CD on a temporary server and IP. As I type this, things are moving along very well and it's looking good. Of course, the drive could start heating up as it works to copy the data and create some issues, but overall it's significantly better and is looking positive. We are working on this as quickly and effectively as possible. We understand the frustrations, anger and disappointment if and when data is missing, servers lose data and drives die, and we have people in on their days off, working triple shifts and so on to help deal with this in the most appropriate manner to do whatever we can to remedy the problem and provide the most satisfaction possible in a situation such as this. Regarding the backups and dates, people's observations are correct. A good portion of them are from late March, a few weeks ago. Some data was backed up on May 6th, but it did not complete (for yet unknown reasons). The primary issue was that the backups weren't running for a couple of weeks successfully, more than the problem of the backups not completing with all and only the newest data, which combined created the problem we're looking at with the data we all agree is too old to be acceptable. What happened, is that our backups system works in a very unique, but secure way. To both allow for only the newest data that exists in that weekly backup run to be synced over, and save time and space, and ensure that at the time of the backup, everything is up to date, without re-syncing up the data each time for the entire server and have it take days to complete. The issue came about when the server was added at some point with a virtual IP, rather than the main eth0 IP of the system. The backup systems are off site and have a high amount of security, and the Mercedes server had failed to be allowed to connect due to it not being added properly to the server list with the main IP (instead it was listed with a virtual IP), and since it connects out from the main IP, it wasn't seen as a trusted source. This in itself wasn't an issue, as it was more to the fact that I didn't see the notice of the particular failure to remedy it until May 6th, after that cycle reported the error. The previous errors were missed. The date it was seen, I ran the backups manually that morning, but at some point they appear to have failed, for an as of yet unknown reason, so only some of the data was up to date as of the 6th. You can be assured we take this very seriously and have already discussed several methods to implement in our backup systems to catch even more failures and checks for success, which are beyond the scope of trying to outline those here. However, it will be in conjunction with our existing security and failure and success checks and this should be quite effective in prevention of any other such issue. We have a great number of servers and this just happened to be the one that had the backup issue (which was one of two, and both were caught and resolved, but unfortunately not in enough time before this one (Mercedes) of the two experienced the failures of the existing data). So, we have data on the backups older than any of us like. While some have expressed that it's better than nothing as seem relatively okay with it, even if it's not recent, we are absolutely not comfortable or complacent with that and we are therefore immediately implementing various methods to ensure that as rare as failures are with backups, that we don't ever risk such a thing affecting our client base like this again. On that note, and as Serra suggested, we have always encouraged users to make frequent backups of their data on their local systems just in case. Even though we do everything we can to treat your data as if you have no backups of your own and each file is worth a million dollars and is irreplaceable, we all know that no method or hardware is ever 100% safe from failure and, as rare as it is, it quickly becomes irrelevant when it's you personally that are affected. Finally, we are still copying over data from the old drive and at this time, we are not experiencing any errors. At this time, it is my hope that we can obtain all of data from the old drives where it is even more desirable and current than the backups would have been anyway, had they completed on may 6th as they should have initially. We'll update you further on the progress and success of this data sync from the old failing drive as we have it. Thank you. |
|
#35
|
|||
|
|||
|
Quote:
We are safely syncing the data to the new server, but not in a dangerous manner that would overwrite any data. The reason for this is simple. Some clients would rather not wait or have data they'd prefer to just restore now themselves. We'd risk overwriting that, as well as any emails that have come in between the issue and the first account restore, so we'll restore to the last instance, other than for those that ask us not to touch anything at this time so we don't destroy any recent changes they've made since their accounts were back up. |
|
#36
|
|||
|
|||
|
This is a small update to say that as long as the data synced over doesn't suffer any problems (or major problems) where we'd not want to use the data (or try to), that we'll go ahead and do the restores from that data -- unless someone asks us specifically not to. The reason is that the data will be very new, as in right before the drive issue happened.
Additionally, we'll back up any data before we overwrite it, in case someone did have some data overwritten by the data from earlier today (from the failing drive) and wants to revert it and didn't specifically ask us not to restore their data. After all, we don't want to cause users any issues if they've already taken it upon themselves to take steps to restore their own data from another source of their own. So far, it's looking good enough where it might allow us to pick up from right before the failure for ALL of the data (or most otherwise), which is more desirable than backups from almost a week ago anyway (though we've already resolved that and are working on adding even more ultra paranoid type of checks to prevent such an issue in the future regarding the backups -- since we're on the subject anyway). More updates will follow shortly. Thank you. |
|
#37
|
|||
|
|||
|
Some of my sites 3 from my biggest clients are not being found by the DNS. They are: admart.mobi, karenspencer.com, mortgagelendinggrp.com, tdghome.com. Will these be coming up soon!
|
|
#38
|
|||
|
|||
|
I've synced all of these with the data just prior to the drive failure. They appear to be working at this time.
|
|
#39
|
|||
|
|||
|
If anyone wishes to post their username of an account that remains to be restored (no need to post the domain and in fact, the username would be better than the domain so we don't have to see what user owns what domain), please post it here or PM it to me and I'll manually sync it over in the meantime. We're not really giving anyone priority and the complete sync is going to take a while, so this could help expedite a specific restore.
|
|
#40
|
||||
|
||||
|
Quote:
The general rule is never update anything that is syncing. You can actually cause more problems than you fix. Wait until they say it is finished, then if you feel your backups are better, apply them. I understand that is a very passive approach, but when you are dealing with a big machine like the shared boxes, its all very slow. It is often difficult for Hostgator to exclude people from the sync, so if you are requesting that, it may make everything slower. Best to just wait it out and then apply any needed changes. |
|
#41
|
|||
|
|||
|
username of an account that remains to be restored are - acerent, allsport, aimminte, aaronsca, these are sites that have no pages or email accounts at this time allgone. Allother sites are as of March backups
|
|
#42
|
|||
|
|||
|
I was able to restore these. One of the accounts had three files fail to sync over from the failing drive. I will PM you those, and hopefully they are easily replaced or can maybe have those files restored from the March backup if those are not very new.
|
|
#43
|
|||
|
|||
|
username of an account that remains to be restored : spl1tsec, biz888
Last edited by Mgtek; 05-11-2007 at 10:53 PM. |
|
#44
|
|||
|
|||
|
I have synced over what I could for the spl1tsec account. Hopefully it synced up all of the data you needed. I have synced the biz888 account now as well.
|
|
#45
|
|||
|
|||
|
The drive is producing a great deal of errors now. However, the main rsync of data is still running strong and getting a lot of data, which is good news. But, some files are failing hit and miss. I think a good portion of data will be restored without issue, especially in combination with the older backups for data that might not have changed since then, but we are seeing some errors at this time.
|
|
#46
|
|||
|
|||
|
Thanks Guys. Great Job
|
|
#47
|
|||
|
|||
|
Just an update to say that due to the way rsync works and that most of the files across accounts were the same from the older backup, the sync was set to sync up with the accounts themselves. This is the best solution to save time and speed up the restore of the most recent data. Only files that have changed from the older backup (that was restored) will be updated with the data from the drive that failed this morning (or yesterday for those of you in different time zones). It will also have the more recent email data, etc., which is very desirable. This will take some time to complete, but we're at about 20%. It is hard to give an ETA since depending on the number of files modified or not, it can take more or less time. It should be complete in the coming hours, but if you need it expedited (before the progress of the sync gets to an account you have), please let us know. Thank you.
|
|
#48
|
|||
|
|||
|
Upon investigating and testing, it appears that the manual run of the backups on the morning of the 6th failed after a short time due to the data center we back the data up to having something likely detecting too many connections from our backup servers, such as an SSH flood. This explains why we didn't have the data update completely that morning after an issue was resolved. Upon testing, possibly I thought due to io blocking via rsync, I found that all sources of a backup, including servers in the same netblock that never connected, ceased to be able to connect to the backup server (couldn't even ping it), while outside sources and other nas (backup) servers on the same network as the server we're backing up to, could both ping and connect, ruling out i/o blocking. I'm assuming at this time that there is something detecting the connections to their network, which was blocking the actual requests/connections from our class C/netblock.
|
|
#49
|
|||
|
|||
|
Further investigation reveals a possible network issue between the provider we have our servers on and the provider (upstreams, etc.) where we have some of our backup servers. We will continue to look into the issue while the sync from the May 10th-11th failing drive data continues it's progress.
|
|
#50
|
|||
|
|||
|
I have confirmed that the upstream was seeing our backup processes as an attack and was blocking all sources (sources being servers that were backing up to the backup server via rsync over ssh). This is what caused the issue on this and a few other servers. This helps explain a lot and will allow us to work with them to not block us again. The benefit is our better sanity checks due to this, but we had to pay the unfortunate price with the failing drives and the poor timing of this. The good news is, that we are getting a lot of data off the failing drive still, and we've also discovered this issue I'm posting about now and it will prevent the potential future issues of this nature.
|
![]() |
| Bookmarks |
«
Previous Thread
|
Next Thread
»
| Thread Tools | |
|
|
All times are GMT -6. The time now is 12:27 AM.










