vSAN Fault Domains | Some Design Thoughts

Recently I have been working on a number of projects using VCF, and native vSAN with rack awareness requirements.

Differing Fault Domain approaches using multiple VCF Workload domains

The vSAN fault domain feature is extremely useful to ensure that data component placement considers the physical rack architecture of the datacentre.

vSAN Fault Domain & rack mapping considerations

However, as with all features there are design impacts and operation processes to consider.

Some useful questions I find to think about when using vSAN Fault domains are;

Do you need fault domains at all? Does it solve your business requirement?

What disaster event do you need to protect against? Consider all areas of the infrastructure, additional features can add complexity and in some cases reduce flexibility. Depending on the requirements and physical platform vSAN fault domains will not mask reduced redundancy at other layers of the datacentre (ie network/power links and diversity).

Are you planning for object availability with automatic rebuild?

When the vSAN fault domain feature is enabled and the domains mapped within a cluster, the default 1 fault domain per ESXi host is changed to a rack mapping. Depending on the FTT value, there is a minimum number of fault domains required. Ensure when planning the use of this feature the impact to vSAN capacity/availability following component failure is considered.

Do you need additional hosts for a rebuild, are you relying on administrator intervention?

Why would a rack fail in a platform? Are there rack interdependencies? Is there likely to be multiple rack failures, one and then another, or both at the same time?

Do all the project workloads require the same platform requirements?

Can different approaches be used? Would the use of implicit fault domains, placing hosts thinly across racks rather than lots of servers in a lower number of racks be more effective from a management or cost perspective?

Consider the addition of rebuild capacity and slack space. If using vSAN 7u1 review the use of the new reserved capacity controls. Some features cannot be combined with fault domains.

Each approach has value and could help with capacity, and complexity of operations, however, rack to rack networking and clustered cache sizing should also be considered.

What is the scale of the deployment?

Fault domains require physical mapping and planning. It is a post VCF workload deployment task and if incorrectly configured/maintained, the feature could impact capacity and availability considerably.

Create a strategy/process for rack scaling following a capacity growth trigger. Consider using scripting/automation to maintain physical mapping.

What is the impact to the normal day 2 operations?

How many ESXi hosts per rack can be placed into maintenance with each vSAN option? What is the risk associated with the selected approach. Ensure this is well understood

I have summarised these and other considerations with links to useful documentation references in a mind map below,

My vSAN Fault Domain Consideration Summary Mind Map

Cleveland VMUG Webinar – Operational HCI & Thank you

WIth the new HCI releases in the VMware stack,  I have been busy updating my architectural and operational content for HCI

I recently had the pleasure to try out some of my updated material with a focus on VVD and VCF at the Cleveland VMUG. 

Although it had to be over webinar,  it was great fun to chat to everyone.

Thank you to Andy Bidlen and Richard Henry for the great opportunity.

I hope to visit soon!


HCI Technical comfort using vSAN, VVD, and VCF at my VMworld 2019 US Sessions.

It has been exactly a year this week since I joined the Advanced Customer Engagement team at VMware.   It’s been a fast-moving and fun experience.   

A lot of my time has been spent working with customers on their larger projects with a focus on the vSAN and HCI deployments, growth, design considerations, etc.   I have also had the opportunity to present a fair few operationalizing vSAN/HCI workshops throughout Europe.

When the VMworld US Call for papers came along,  I thought I would base my submissions on this experience,  namely achieving Technical comfort and creating predictable infrastructure when using VMware HCI products.   

I am excited to say I have 3 sessions at the US event this year.   2 solo breakouts and a 3rd with a colleague of mine from the vSAN GSS team.  

I’m looking forward to discussing everything HCI  at the event.   If you would like to register for any of the sessions,  the  info is below and you can confirm in the schedule builder  using this link  




NSX-T enablement mind map, & useful links

Following the amazing release of  NSX-T 2.4  and the recent VVD / VCF  enhancements that now support the network virtualization platform, I have recently spent some time reviewing and self-enabling deeper on NSX-T.

NSX-T Architecture Mind Map

From an architect perspective, the maturity of network and security features within NSX-T across platform architectures make it a requirement to be well versed.   

With the introduction of HCX being supported from 2.4.1 allowing the mass onboarding of workloads from vSphere, and NSX-V,  the use within larger-scale  HCI platforms is going to be more common.

Below is a list of my recommended links & my architecture overview  mind map .

There is also a new VMware Education course released for the 2.4 version.

Due to the dynamic and fast-moving nature of the NSX releases the below content is useful for understanding,  however, features change & improve between releases.  They will need to be reviewed against the NSX-T release notes to ensure the approaches and facts are valid with the specific build being used.


NSX-T use case & Architecture  – great overviews of components, &  data flow.

Feature deep dives ,  packet walks, & good practices.
Operational considerations – Upgrades, automation & monitoring.
Design Thoughts  – Reference architecture discussions.  
NSX-T  2.4 Update Videos  & demos.
Useful related Articles 

VVD 5.0 release, mind map and Cloud Builder impact.

Version 5.0 of the VMware Validated Design (VVD) approach to the SDDC was just released this week.  The amount of work that has gone into this product from the team is very impressive.
As an architect,  it is always the primary concern to meet business requirements,  however, ease of implementation and care for operationalizing is often the rate determining factor for new technology (ie Experience with vSAN, NSX, SDDC product stack).

Also, it is common in larger enterprises for the project architect not be the person who configures the final platform for business use.   The architect will often create a design, and assist via discussions with network teams, security and compliance etc.  This can increase project elapsed time and increase the risk of incorrect deployment.

Within the new VVD 5.0  release, there are the expected improvements, such as an updated bill of materials using the latest software suite ( vSphere  6.7 U1, vSAN, vROps 7,  etc ).  However, another improvement which I feel that can really can help other architects deploy VMware SDDC solutions faster and more consistently for a variety of use cases is the new automated deployment approach.

This new process (similar to the VCF approach)  not only helps reduce the time to deploy the actual VMware software components,   but also allows an architect to make key decisions such as OEM choice,  sizing, local integration standards  and provide an implementation spreadsheet with a single iso file of the VVD software bundle to be used for implementation.

These spreadsheets can be used during an accelerated design process review using VVD material for multi-team understanding of essential pre-requisites (ie Network, DNS, Authentication, security).

Once ready to deploy,  the process uses a new cloud builder appliance which is deployed by the implementation team and configured for the automation.

The configuration spreadsheets are converted by the cloud builder appliance into a JSON format and once validated, the platform can be deployed using a web portal (There are a few post-clean-up tasks such as updates to host profiles and passwords etc).

High level VVD 5.0 Deployment process flow.

The end result is a known consistent platform based on a fully supported design using VVD documentation,  and a set of reference endpoint information from the spreadsheets.

I have been working with the product in my lab today and while deploying I created an updated mind map for the latest VVD 5.0 release. I have also included my previous VVD post and recommended reference material in the links below.

VVD 5.0 Mind map

Useful Links