vSAN Fault Domains | Some Design Thoughts

Recently I have been working on a number of projects using VCF, and native vSAN with rack awareness requirements.

Differing Fault Domain approaches using multiple VCF Workload domains

The vSAN fault domain feature is extremely useful to ensure that data component placement considers the physical rack architecture of the datacentre.

vSAN Fault Domain & rack mapping considerations

However, as with all features there are design impacts and operation processes to consider.

Some useful questions I find to think about when using vSAN Fault domains are;

Do you need fault domains at all? Does it solve your business requirement?

What disaster event do you need to protect against? Consider all areas of the infrastructure, additional features can add complexity and in some cases reduce flexibility. Depending on the requirements and physical platform vSAN fault domains will not mask reduced redundancy at other layers of the datacentre (ie network/power links and diversity).

Are you planning for object availability with automatic rebuild?

When the vSAN fault domain feature is enabled and the domains mapped within a cluster, the default 1 fault domain per ESXi host is changed to a rack mapping. Depending on the FTT value, there is a minimum number of fault domains required. Ensure when planning the use of this feature the impact to vSAN capacity/availability following component failure is considered.

Do you need additional hosts for a rebuild, are you relying on administrator intervention?

Why would a rack fail in a platform? Are there rack interdependencies? Is there likely to be multiple rack failures, one and then another, or both at the same time?

Do all the project workloads require the same platform requirements?

Can different approaches be used? Would the use of implicit fault domains, placing hosts thinly across racks rather than lots of servers in a lower number of racks be more effective from a management or cost perspective?

Consider the addition of rebuild capacity and slack space. If using vSAN 7u1 review the use of the new reserved capacity controls. Some features cannot be combined with fault domains.

Each approach has value and could help with capacity, and complexity of operations, however, rack to rack networking and clustered cache sizing should also be considered.

What is the scale of the deployment?

Fault domains require physical mapping and planning. It is a post VCF workload deployment task and if incorrectly configured/maintained, the feature could impact capacity and availability considerably.

Create a strategy/process for rack scaling following a capacity growth trigger. Consider using scripting/automation to maintain physical mapping.

What is the impact to the normal day 2 operations?

How many ESXi hosts per rack can be placed into maintenance with each vSAN option? What is the risk associated with the selected approach. Ensure this is well understood

I have summarised these and other considerations with links to useful documentation references in a mind map below,

My vSAN Fault Domain Consideration Summary Mind Map