It’s been a little while since I’ve written a new post. It is the result of a combination of forces outside my control, one of which seems to be the holiday season playing it’s part in filling up lots of my free time with “stuff” all of a sudden. Then I got an email from NetApp support late one Saturday night that our shared SolidFire cluster had a bad node and was running in a degraded state. I quickly went in and opened a support tunnel to allow NetApp support to get in and work their magic to calm things down until they could remediate the problem and investigate the next step forward.
It turns out that one node had bad NVRAM. The disks themselves were recoverable for the time being, so my block storage was still available at full capacity. I had always wondered why my SolidFire GUI showed warning and error levels at about 75% of the total block storage, but now it made perfect sense. Had I provisioned storage above the 75% threshold and lost 1 of 4 nodes, I’m sure things would have not been pretty.
In this case, block storage wasn’t any issue, but without functional NVRAM, that node couldn’t perform the SolidFire special sauce that makes it a powerful solution. I figured replacing this node wouldn’t be anything all that interesting and hadn’t even considered writing a post about it at all, until I thought about how often I ever really have to deal with hardware replacements like this. A lot of vendors out there tout how easy it is to recover from hardware failures, but when push comes to shove it always ends up being much more work than expected. Typically things just tend to hum along in the data center, but when a failure actually occurs, it is sometimes hard to reflect back no the process because you just want to get things back to normal and keep that consistency going.
That being said, I was a bit apprehensive going in to this node replacement. Thankfully, SolidFire support was great to work with and went out of their way to ensure that we had all our bases covered. The basic overview of the process was:
-
-
Remove disks on the failed node from the cluster. This was done three at a time to ensure performance was optimal as the data on the disks were transferred to other nodes in the cluster. This took about 15 minutes for every set of 3 disks.
-
Remove the failed node from the cluster.
-
Pull the failed node out of the rack and replace with the new node. Remove every disk from the failed node and place in the same slot in the new node.
-
Power on new node, connect crash cart and configure 1 GB management network.
-
If necessary, flash the node to the proper software version (in our case we did have to, based on the version baked into the new node versus what we were running).
-
Log into SolidFire GUI, add node back into cluster.
-
Add disks 1-9 from new node into cluster, once complete add disk 0.
And that was about it! Everything went exactly as expected, which was a welcome surprise. Typically when I make trips out to the data center, I hope for the best but usually am about 1-2 hrs behind my estimate on how long it will take. I definitely give two thumbs up to SolidFire support for the care that I got during the replacement, as well as the SolidFire team for truly delivering on what they promised in cluster resiliency.
-