First post fighting the FUD for Nutanix. I saw an image on Twitter recently where there was FUD about disk failure and data loss when using Nutanix. At Nutanix, we have a lot of different hardware models and also for people who don’t keep up to technology, we OEM with Dell and Lenovo as well, who have specific models of thier own.
Even with the different hardware models, our software handles various levels of failure with ease. The data consistency and resiliency of the software is not dependent on the hardware it runs. Be it a disk failure or node failure or block failure. There is varying level of redundancy that handle all of the above.
Let’s take the example of a disk failure or a node failure with any number of disks.
As excerpt from Nutanix Bible – “A disk failure can be characterized as just that, a disk which has either been removed, had a HW failure, or is experiencing I/O errors and has been proactively removed.”
VM impact in the case of a disk(s) failure is as below.
HA event: No
Failed I/Os: No
Latency: No impact
Impact on Data Availability or Data Consistency: None
In the event of a disk failure, a Curator scan (MapReduce Framework) will occur immediately. It will scan the metadata (Cassandra) to find the data previously hosted on the failed disk and the nodes / disks hosting the replicas.
Once it has found that data that needs to be “re-replicated”, it will distribute the replication tasks to the nodes throughout the cluster. An important thing to keep in mind is all the nodes that have the data of the failed disk; in the cluster participate in the replication of the data. Also the data in each node is distributed across multiple nodes and blocks, so in the event of 1 or more disks in the same node failing, it doesn’t change how the scenario above works.
Node failure is similar to any hardware failure on any server vendor. It happens because its hardware and is not designed to be resilient. Its software which needs to be resilient and Nutanix’s software is completely distributed. I should know my Major in Masters was Networking and Distributed Systems. Which is why using commodity hardware makes perfect sense and we use software to provide resiliency. Node failure results in HA event for the VMs. Once restarted , the VMs will continue to perform the I/Os as usual which will be handled by the local CVMs.
Everything else happens exactly as it does in the event of disk failure. Curator Scan will find the data and the data is re-replicated. If the node is down for more than 30 mins, the CVM will be removed from the node metadata ring. Once the CVM is back and stable for a duration of more than 5 mins, it will be added back into the ring.
If RF3 is enabled and the initial cluster size has five or more blocks, the same level of failure will be protected at the block layer. Its called Block Awareness. Block awareness is auto turned on. RF2/RF3 + 5 blocks or more during the creation of the cluster enables Block awareness for Cassandra. You wanna read more about that go on ahead to Steven Poitras’ “The Nutanix Bible“.
HA event: Yes
Failed I/Os : No
Latency : No impact
Data Consistency : No Impact.
So lets come back to the FUD about not being able to sustain a 5 disk failure. Lets consider this. We have multiple nodes with multiple disks in each of them for anyone who doesn’t know our hardware platform. Let’s take the example of our most storage dense platform NX-8150 which has 24 disks in each node. Lets look at another of our best seller NX-3460 which as 4 nodes in each block. With RF3, we can tolerate of a total of 48 disks or 2 nodes failure in NX-8150 or 8 nodes and 2 block failures with NX-3460. 5 disks … PFFT! Let’s kick it up a notch, even after the failure of 2 nodes, because all the remaining nodes participate in the rebuild of the data on the nodes above, we will most probably have the data re-replicated before another failure happens. And there is an algorithm which places the data in various blocks as the data protection factors are applied. So unless someone magically turns off the three blocks and the specific nodes on which the data resides, there is no reason why a node or block or 2 block failure should affect the availability of the data.
Let me ask you this, how many times have you had more than 3 different UCS servers or HP Servers or any other servers fail on you at the same time. Data Center meltdown or fire in the data center, well that’s what your DR plan is supposed to help you with. If you don’t have one, come speak to one of us at Nutanix and we’ll take care of it :).
Now that I’ve refuted the misinformation about node failure or block failure, lets look at the other parts of the story that was conjured up . The other part that was misinformed is data loss. There is a a very big difference between data availability and data loss.
Data Availability vs Data Loss
Lets for a moment assume, that there has been a catastrophic event which has rendered 3 or more blocks not being available out of a cluster of 5. The cluster will now become unusable i.e. the services which run the cluster are now down and there will be no I/O that will sent or received from the cluster. This is when the data thats being held in the DSF is not accessible.
Is the data still robust? YES.
Is the data still consistent ? YES.
Will the data be available when nodes are restarted ? YES
Does this mean that the data which is inside the cluster is no longer available even if the nodes are brought up ? BIG FAT No. The data that is inside the cluster is still available. So as soon as bring up a node or a block which was previously down, as soon as it fulfils the cluster startup criteria, the data starts replicating to protect against further failures. Want further proof ? How about you head on over to Cameron Stockwell’s site and check this out. This shows how resilient our infrastructure is.
Data loss is a scenario where there is corruption or misalignment of data in the event of the nodes or servers going down. As almost all the data (hot and cold) resides in persistent media, SSD and HDD. So even in the event of the nodes being not available, the data inside the nodes is still robust and is not corrupted. Regardless of the number of nodes not being available or the cluster services being not available, the data inside the nodes is still robust. Hence the claims of “Data loss” is completely incorrect.
The only scenario where there is data loss, is if all the hard disks and SSD are fried due to electric fluctuations or if the disks in the nodes have been tampered with and the positioning of the hard drives has been changed manually. Does this happen often? I don’t think so.
Capacity Optimization or Space Availability.
The Acropolis Platform has a very unique set of capacity Optimisation techniques, which operate on the Distributed Storage Fabric & can be used individually or used as a combination.
All of these techniques are implemented in the software layer and don’t require any specific hardware components. Neither is there a vast and extensive HCL to compare and lose sleep over. Since the techniques are implemented in the software layer, all the improvements made to these are provided to the clients both new and existing.
At Nutanix, the following capacity optimizations are available:
Compression – Both inline and post process
Deduplication – Both SSD and HDD layers
Erasure Coding – Applied on cold data to increase space optimisation.
For a detailed description of how Nutanix has implemented this, head on over to Nutanix Bible and Steven Poitras has explained it better than I ever can. But let me provide a brief conclusion of the above topics
Inline Compression: Inline compression will compress sequential streams of data or large I/O sizes in memory before it is written to disk, while offline compression will initially write the data as normal (in an un-compressed state) and then leverage the Curator framework to compress the data cluster wide. When inline compression is enabled but the I/Os are random in nature, the data will be written un-compressed in the OpLog, coalesced, and then compressed in memory before being written to the Extent Store. The Google Snappy compression library is leveraged which provides good compression ratios with minimal computational overhead and extremely fast compression / decompression rates.
Post Process Compression: For offline compression, all new write I/O is written in an un-compressed state and follows the normal DSF I/O path. After the compression delay (configurable) is met and the data has become cold (down-migrated to the HDD tier via ILM), the data is eligible to become compressed. Offline compression uses the Curator MapReduce framework and all nodes will perform compression tasks. Compression tasks will be throttled by Chronos.
By using all the nodes available to perform the compression, it not only does it efficiently, the data path for the new I/O is never touched so all the hot data is still being written and read from the SSD tier.
Deduplication: The Elastic Dedupe Engine is a software-based feature of DSF which allows for data deduplication in the capacity (HDD) and performance (SSD/Memory) tiers. Streams of data are fingerprinted during ingest using a SHA-1 hash at a 16K granularity. This fingerprint is only done on data ingest and is then stored persistently as part of the written block’s metadata. NOTE: Initially a 4K granularity was used for fingerprinting, however after testing 16K offered the best blend of deduplication with reduced metadata overhead. Deduplicated data is pulled into the unified cache at a 4K granularity.
Erasure Coding: The Nutanix platform relies leverages factor (RF) for data protection and availability. This method provides the highest degree of availability because it does not require reading from more than one storage location or data re-computation on failure. However, this does come at the cost of storage resources as full copies are required.
To provide a balance between availability while reducing the amount of storage required, DSF provides the ability to encode data using erasure codes (EC).
There is a next part of this blog coming up talking about actual usable space in Nutanix environment when we use space optimisation in Nutanix.
Steven Poitras’ (@stevenpoitras)Nutanix Bible for the diagrams and text explaining Compression, Deduplication and EC-X.
Josh Odgers (@josh_odgers) for providing his input into this post & Jason Yeo (@jasonyzs888) for proof reading it.