VMworld 2020 is upon us and there are a ton of announcements related to VMware Cloud on AWS (VMC). I hope to review those here soon, but wanted to highlight another change that came around earlier this summer for SDDC versions 1.10+.
VMC provides a full VMware Software Defined Data Center (SDDC) within a specific AWS region. The storage component of the SDDC is VMware vSAN, which is configurable via Storage Based Policy Management (SBPM). When it comes to vSAN, the main way to utilize SPBM is to define the number failures that vSAN can tolerate. This setting affects the way that VM disk data is distributed across a vSAN cluster. Every option has different effects when it comes to space utilization and performance.
Since VMC is essentially VMware as-a-service, there are Service Level Agreements (SLAs) that come into play. One of the SLA features of VMC related to vSAN is the number of Failures To Tolerate (FTT) depending on the size of a VMC cluster. To meet the SLA requirements, any VMC cluster that is 5 hosts or smaller must have a minimum FTT=1 and clusters that are 6+ hosts must have a minimum FTT=2.
Depending on your VMC use case, you definitely need to keep an eye on your cluster design, especially as you migrate VMs into VMC. Prior to SDDC version 1.10, the default vSAN policy within VMC was 1 failure – RAID-1 (Mirroring). If your cluster expanded beyond 5 hosts, it was up to an administrator to ensure that the cluster policy was manually changed to 2 failures – RAID-6 (Erasure Coding) to help maintain the VMC SLA.
Thanks to a new feature that was released mid-2020, for VMC versions 1.10+ the cluster policy will automatically be set to either RAID-1 or RAID-6 depending on the cluster size. If a cluster were to scale either manually or via Elastic DRS beyond 5 hosts, or you were to create a 6+ host SDDC from scratch, the default cluster vSAN policy would automatically be configured for RAID-6. This is a great feature, as these settings can very easily get looked over while in the middle of a large scale migration.
VMs use SPBM too…
One very important thing to keep in mind is that just because the cluster default vSAN policy changes, doesn’t mean that it automatically takes effect for VMs that already run on that cluster. If a VMC cluster scales +/- to a threshold that changes the vSAN policy, existing VMs will still need manual intervention to address VM level protection. VMs can be manually changed or simply migrated to a new cluster and selecting to use the datastore default policy.
It would be really nice there were guardrails built in for this, or at the very least some kind of SLA warning that alerts you to the situation should you be out of SLA “compliance.” In the meantime, it is easy enough to right click a VM, select Edit VM Storage Policies… and change the setting there.
Once a VM’s vSAN policy is changed, vCenter will begin the process of non-disruptively resyncing the VM disk objects under the hood. Depending on the RAID / FTT level the VM is moving to, there could be some space savings, or some additional consumption, so you always want to be aware of how much slack space is available prior to changing the policy, as well as what the outcome of the change will be. Editing the policy will show the vSAN consumption prior to making the change. Once you hit OK, you can check the status of a resync by going to Cluster Name -> Monitor -> vSAN -> Resyncing Objects
While objects are resyncing, you will probably notice a vSAN warning. If you click into the warning or browse to vSAN Object Health, you will notice any VM disks that are resyncing will show Reduced Availability. This warning is a normal part of the process and will go away once the resync is complete. I have also even seen errors in the case where the system can only queue up so many disks at once, and some VMs may show availability errors until other disks are able to complete.
VMC isn’t one size fits all, and there are many design considerations to take into account depending on a whole slew of factors. Use case, network connectivity and workload size are just a few of the heavy hitters that can make VMC deployments fairly complex. Storage policies and how they relate to SLAs are one topic that could be handled with better automation as it pertains to this service. In the meantime, while there have been improvements and VMC is constantly iterating, remember to keep an eye on your workloads and SPBM.