SolarVPS :: WinVPS35 Hardware failure/outage

Portal Home > Knowledgebase > Articles Database > SolarVPS :: WinVPS35 Hardware failure/outage

Posted by JanusMahzon, 02-07-2010, 12:46 AM
A VPS server had a raid array failure and the host has taken over a day to get it back up (still is not up). They said it may need a bit by bit restore that will take several hours. Is this legit?
Posted by htbsales, 02-07-2010, 01:21 AM
It depends on if your service provider has direct access to the servers. There may be a delay in communication of progress to you since they have to get the information from their provider(s). Based on the lenght of time it sounds like a RAID 5+ failure and the provider may have to actually provide some of the information from backups or from data recovery tools depending on the situation. This process can take many hours from our experience. However, each situation is unique and the restore times vary by RAID level, amount of data and server hardware specifications.
Posted by IGXHost, 02-07-2010, 01:39 AM
RAID-5 always scares me especially when it's for a designated hosting environment. It is probably legit. I believe if it was RAID-10 it may have been much quicker to fix.
Posted by TonyB, 02-07-2010, 01:59 AM
Well when I hear raid failure I think the entire raid was lost thus they're using backups now to fix it. It does not make sense otherwise as typically servers you're either swapping the bad drive on the fly or you're shutting it down replacing the drive then bringing it back up. In both cases it will rebuild in the background whether it's raid 1,5,6,10 etc. So I'm leaning towards them having to use backups which could take quite a while depending on the amount of data and the type of backup systems they have in place.
Posted by borgdrone7, 02-07-2010, 03:54 AM
Yeah, I guess somehow their complete RAID array failed because otherwise they could just swap failed HDD and rebuild array. I guess in that case there is possibility you will not get your most recent data.
Posted by SolarVPS|Justin, 02-07-2010, 04:02 AM
Greetings! I am guessing based on the specific terminology used that you're referring to our WinVPS35 RAID failure. To clarify: This occured on a 500GB+ RAID10 array We certainly do wish that this was a simple single drive failure, as we'd probably have had it fixed within an hour or two of it failing, at most. Unfortunately, this escalated into a corruption within the drives. Thus, the problem encompasses the entire array, unfortunately. After several attempts to rebuild the partition tables and boot the node, we've decided that the only option we have is to extract the data from the array itself and rebuild from there. Normally, we'd use our own disaster recovery backups to rectify the situation, however, this particular node is one of the very few nodes left that are not yet on our internal backup system, which is still in the process of being fully implemented (progress is estimated at about 95% completion on our entire network which spans 7 facilities in 5 cities and several dozen vLans). When finished, our backup system will take full disaster recovery backups of each individual container on a nightly basis, which are then stored on a dedicated backup node. Due to a few bugs in our backup system, however, full implementation has not yet been achieved. Thus, our only course of action is to extract the data from the drives the hard way. We certainly don't wish to hide the fact that this is a major failure, and we're certainly not going to hand our customers veiled half truths on the situation. In fact, I'd say that this is probably the most severe hardware failure we've experienced as a company. We fully expect this to take several hours to get the data into a bootable state that can be used and be stable. First the data must be extracted from the failed drives and checked for further corruption. This process is highly time consuming and combined with the fact that we're looking at around half of a terabyte of data to 3/4 of a terabyte of data, the time needed to complete this compounds on itself. Then, once we have the data, we need to then restore the RAID10 array itself to bring the node itself up to spec. Our CEO is actually at the facility right now working on this process as we speak along with two of our remote technicians who will be taking over and monitoring the data import. We have not yet 100% verified the cause of this failure, though we suspect that it has to do with the RAID controller in the node itself. We'll be examining the situation once we have gotten everyone back online, as that is our number one priority. We are also very frustrated with the situation and can understand if any of our customers share that frustration. Additionally, We would be more than happy to build you a new container on another node with the same IPs if you happen to have a backup of your data that can be imported. That would ultimately be the fastest means of getting you back online that we currently have. We appreciate your understanding in the matter. It has been and will be a very long night for us. Good thing I have a sizeable stash of coffee Thanks! Last edited by SolarVPS|Justin; 02-07-2010 at 04:12 AM.
Posted by SolarVPS|Justin, 02-07-2010, 07:55 PM
Unfortunately, the data on this server has become a complete loss. The index tables on the three disk drives is corrupted beyond repair. As such, all semblance of data structure has been entirely lost. The below notification has been sent to all of the customers on WinVPS35:

Add to Favourites Print this Article