Following on from the previous post, this post will concentrate on finding the solution to the symptoms mentioned below:
Symptom 1: Cross site vMotion doesn’t happen on Metro Cluster.
Symptom 2: DRS doesn’t balance the load on either site (8 hosts at each site). Host utilisation varies from 50% to 90%.
Symptom 3: Management Network access is really slow.
Symptom 4: VM access is fine from RDP but console access from vSphere Client is unavailable or slow.
Symptom 5: DRS migrations have either failed or been waiting for services to be available.
Symptom 6: Manual vMotion is really slow too.
Symptom 7: All Management access is lost (last one).
When looking at a problem you have to
cross off whatever doesn’t match the criteria. Lets take the pillars of virtualisation: Compute, Network, Storage & Cluster Config.
Based on the symptoms above, it doesn’t look like the compute layer is the problem. Lets deduce it further. We’ve confirmed the following:
- VMs don’t appear to be hung (RDP access confirms it).
- No performance impact because of the symptoms above.
- VMs are responsive when connected to RDP. There appears to be no loss of connectivity to the VMs due to performance issues.
- 50% memory and cpu failover capacity (stretched metro cluster) not reduced.
- No HA events, failover events, network isolation events.
Based on the symptoms, storage might not be the culprit as well. We’ve confirmed the following to rule out storage
- Distributed data stores presented by VPLEX. So no single point of failure on storage end.
- VPLEX ISL intact. VPLEX Management intact (Separate VLAN to the ESX Management Network).
- VPLEX Cluster status available
- VMAX data stores are available and not under stress (plenty of space / ions available).
If there are no issues with network, we have to go back to the drawing board. But lets look at the networking in detail.
Here is a logical configuration of the network design (how it should be across all the hosts):
VIC 1240 CNA configured for 4 NICs and 2HBAs
2 NICS configured for Standard vSwitch – Running Management and Nexus Traffic
2 NICS configured for Nexus 1000v – Running VM Networks and vMotion.
The Management NICs are connected to the dedicated management uplink switch (Catalyst 3750) via Nexus 5k.
We grabbed a configuration backup that was taken using RVTools. We found the following issues across multiple hosts, some of them had the correct configuration as above, while others had the following mistakes:
- vMotion and Management were selected to go over the same vmknic.
- There was no QoS configured for Management or vMotion traffic at UCS / Nexus Layer.
- Some hosts had a single uplink configured for each vSwitch (Standard and Nexus 1000). Yep you read correctly. SINGLE UPLINK.
- No redundancy configured at the Fabric level for NIC Fabric failover.
So we remediate the issues above, this resulted in the the all the symptoms being rectified except one. Symptom 1 – Cross Site VM migration (automated to adhere to site bias) is still off.. way off.
So off we go to look at the cluster config.
On checking the cluster config, we found the relevant host and vm groups for each site. The DRS rules were configured using “Should” rule to keep the “VMs at Site1” on “Hosts at Site1” and vice versa for Site2.
Manually run the DRS and it migrates a few VMs but still not to the correct sites for all of them. Delve deeper into the VM Group and Host Groups, we find that not all the VMs are part of the VMGroups. This explains it. So we “modify” the groups to have the correct VMs in the VMgroup.
vMotion traffic which was incorrectly configured to go over the management network vmknic flooded the network, thereby reducing the amount of bandwidth available for “Management Traffic”. And lack of QoS or COS setting, didn’t stop the vMotion or the Management traffic from flooding the link, thereby contributing to slow management network connection.
Incorrectly configured Cluster Settings also had a role to play in this. Once all the symptoms were dealt with, there were no more issues in the environment.
Disclaimer: Though this is an actual scenario, a few details have been modified and the client’s relevant details (if any) were replaced with fictitious ones.