|
#1
|
|||
|
|||
|
Hi,
As many people I'm sure know there has been quite a bit of disruption for reseller hosting customers. This was extremely unprofessional of HostGator. Where was the email to customers to tell them about this. Ok, you posted it in a forum ON THE DAY of the interruption! Not only this, I had to go onto live chat to then be pointed to the forum post. You should be pushing this information to us, particularly when it affects many peoples business, peoples emails, and peoples sales!!! HELLO? Are you insane? You just take many sites offline at your will without telling your customers which affect then many more customers? A forum post IS NOT telling your customers. That's a information pull mechanism ...you're expecting ALL customers to check the forums regularly for these kinds of updates? That's just not feasible! Come on.... Furthermore, we have no estimated time of completion. How does it take a few hours to upgrade a kernel? Were machines taken offline when not yet being worked on? Doesn't make much sense how a machine takes hours to be updated. Due to your lack of foresight and inconsideration, we, your customers, have had to deal then with our customers complaining about downtime and wanting to know answers! This could have been avoided by sending out an email. You certainly have the means to do so through your billing panel at very least. There is no vulnerability out there that is so impending you cannot contact us. Hugely disappointed in your lack of professionalism. Up until now, I had confidence in HostGator. As of today, I do not have anywhere near the confidence I once had. Enough so that I'm going to reconsider where I host future services. Regards, Dave. |
|
#2
|
||||
|
||||
|
Hi Dave,
First of all, I apologize for the inconveniences. We hate downtime as much as you do. And obviously, when our servers are down, it's not any better for us than it is for you. Our normal policy is to provide as much notice as possible (typically 24-72 hours) about planned upgrades and downtime. Providing more notification over additional mediums is something we've been working on a lot over the past 6 months or so and we have gotten a lot better at it (and we're still improving). However, that in no way implies where we're at currently is perfect. This particular situation was unique because the kernel upgrade was the result of a security concern (see this post for more details). We try to ere on the side of caution when it comes to the safety and security of our servers. In the long run, it is much easier to be down for a little while as the result of a kernel upgrade than it is to deal with an entire site (or server) being exploited. In the hosting industry, things like this do happen and we've obviously learned that we need to improve the communication processes around situations like this. That's something we'll certainly work on. With that in mind, we're more than happy to do whatever we need to do to ensure that your site is up and running fine and that we honor our uptime guarantees for you. I'm also going to ask someone more familiar with the technical aspects of these upgrades to come by and post with additional information. Again, please accept my apologies. Don't hesitate to let me know if you have any other questions or concerns.
__________________
Douglas Customer Service Manager HostGator.com LLC 1-866-96-GATOR |
|
#3
|
|||
|
|||
|
Dave,
Doug asked me to update this ticket with some further technical details about what happened last night. Let me begin by apologizing for what happened, it was not normal or plannned, and it should not happen. Kernel upgrades are a routine action and generally require 5-10 minutes of downtime per server. The new kernel is installed and then you have to reboot into it. If the server reboots correctly, you are down for 5 minutes. If a service fails to start we had admins actively monitoring who can start the servers. As discussed in the thread last night, there were new root level exploits released that effected the socket.c file. All of the published exploits effect protocals that we don't use, but we are warry of someone using the same vulnerability to exploit protocals we do use and wipe an entire server. So we decided to do the upgrades right away. Because we have to actively monitor and sometimes actively correct issues, we don't do every server all at once. That is why we try to roll these out over a period of a few hours and why we don't normally give a specific 15-30 minute window. In this case we had an issue with the kernel that was compiled that caused some processes to hang using 100% of the cpu and refuse to be killed. The only way to correct this was to reboot the servers. It is rare to get these full on zombie processes, on our normally humming farm we might get one server with the issue every 3 to 4 days. After a few batches of servers had been done, I started to get reports that this was happening a lot more frequently. I investigated this and found that the servers we had upgraded were seeing this happen far more frequently. So I made the descision to revert the change. This required changing the boot sequence and doing a second reboot back into a known good kernel. I was more agressive in the pace with the second round of reboots as the kernel was causing problems and I wanted to get back to a stable one faster. However the same bug that was causing userspace processes to hang, also caused the shutdown process to hang on some servers. On these servers all services were stopped and all the drives unmounted, but it never actually rebooted. All of these servers had to be manually rebooted by a datacenter technican. Normally we do rolling reboots on about 50 boxes at once, of those normally 1 or 2 will have an issue with a service and we can get on it right away and get things resolved. In this case we did a slightly larger batches, and the failure to restart normally rate was far higher. This created a backlog of reboot requests for our datacenter which resulted in further longer then expected periods of downtime for some of the effected boxes. So those are the technical details of exactly what went wrong. As Doug said we are revieiwing this and will be improving how we communicate in the future. When you are aware of a risk and deciding how to respond, there is a careful balance that must be found. If we had tested these patches for 24 hours on 100 machines we may have found the issue and prevented it from effecting as many people. If during those 24 hours 3 servers had been hacked and all the data lost, when we had a patch that could have prevented it from happening, people would be justifiably upset as well. When I came into work at 11:30 last night, we had become aware of the vulnerability 3 hours earlier, were testing our server images and getting the patch ready. We like to give 24/72 hours notice, in this case giving that level of notice would have alerted hackers that we were vulnerable and would have been a delay we were not willing to risk. I understand that this was a painful experiance and would prompts you to ask hard questions. We will do our best to answer all of them as directly, honestly, and transparently as we can. Beyond features and technical measurements web hosting is fundamentally about trust. You have to trust that we will manage systems you depend on every day. I hope that you will continue to give us the oportunity to rebuild that trust over the months ahead. |
![]() |
| Bookmarks |
«
Previous Thread
|
Next Thread
»
| Thread Tools | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| 444 Script or Action Blocked - SecurePHPx v1.0 | raybeta | Shared Hosting Support | 2 | 11-28-2005 11:45 AM |
| CafeMX.com - eXtreme action for Men! | pulse | Site Promotion | 0 | 01-17-2005 06:20 AM |
All times are GMT -5. The time now is 03:14 PM.









