![]() Identify the problem source, i.e., find the erroneous hard disk.Verify the backed-up data for consistency, and verify whether the data restore mechanism works.Move resources to the second node in your HA cluster if possible.We’d advise following this procedure once the array shows that a degraded state has occurred as a result of a disk failure: In addition to that, when using a HA cluster, there is an option of manually switching the production from the affected node to a second one so that you could perform maintenance on the affected node. Using the ZFS file system, it’s much easier to monitor the system and create a proper backup, with that you have the ability to retrieve data from a damaged disk and write it onto a new one. These problematic aspects of hot spare disks are why our advice would be to not rely on hot spare disks in complex data storage architectures and to use other business continuity solutions instead like High Availability (HA) clusters, backups and On- & Off-site Data Protection (ideally all of the aforementioned). Automation here is risky since it can start the domino effect, especially when the data storage infrastructure has been working for years and the hardware is worn out. Having spent decades providing customers with data storage solutions, we’ve heard of a lot of examples where a hot spare disk was the reason for the entire server failure and even data loss occurring. If you’re looking to create a system with no single point of failure, a hot spare disk will not provide you with much confidence given that the process of automatically replacing a failed disk has been known to occasionally fail, either partially or fully, and result in data loss. Hot Spare Disks Create a Single Point of Failure This is yet another factor that can affect the system’s overall performance and could potentially lead to data loss. It could still try to reconnect and start working again while the hot spare disk is trying to take over its role thus adding even more stress to the system. And when, eventually, it’s time for it to be used as a damaged disk’s replacement the hot spare disk itself could simply not be in a good enough state to actually replace the damaged disk.Īnother problematic aspect of hot spare disks is that they are used automatically once the disk failure is detected so the corrupted disk might still be connected to the system. From the moment it is connected to the system, it keeps on working. ![]() The next flaw of a hot spare disk is that it degrades over time. Problems in Overall Hot Spare Disk Design Having decades worth of experience, we’ve realized that the use of hot spare disks in complex enterprise systems increases the probability of additional disks failing as the resilvering process starts to put more and more stress on the existing disks and the system itself. This results in the server working at maximum achievable throughput for weeks, which can have dire consequences for the disks (especially HDDs). Since it’s a low-priority task, it can make the entire process of resilvering take very long (even up to a few weeks). ![]() Resilvering is a process that needs a lot of server resources so when it’s executed while the server is still in use, it has to compete with the production loads. This means that, while the resilvering process is taking place, the system will also still be occupied with the usual production data reads and writes. The main problem with hot spare disks is that they allow the rebuilding (resilvering) of a system that is still actively being used as a production server. Hot Spare Disks Add Stress to Vulnerable Systems Let’s have a look at some of these problematic aspects of hot spare disks. Anything that increases the risk of data loss could be a bad idea. That being said, our goal of creating a RAID is to continue operation and not lose data in the event of a disk failure. The Problematic Aspects of Using a Hot Spare DiskĪs is said in almost every theory, using a hot spare disk with ZFS, Solaris FMA or in any other data storage environment is a good solution as it will automatically react to damage in a Redundant Array of Independent Disks (RAID) array and a hot spare disk indeed helps to minimize the duration of a degraded array state. Due, in part, to the different views and opinions regarding the usage of hot spare disks in our previous post, we’ve decided to add an update for clarification.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |