Tuesday 14 May 2013

Problems Clustering Virtual Machines on Windows Server 2012 Hyper-V

I was re-building our lab environment at work the other week in preparation for our big Summit13 event, that and the lab had been trashed over the last year...

As part of the re-build I had decided to implement a couple of virtual machine clusters, one for a scale-out file server and one as a SQL cluster.

I'd deployed the virtual machines for the cluster nodes using Service Templates in SCVMM and as part of that template chosen to use an availability set to ensure the VM's were separated across hosts (a cluster doesn't provide much High Availability if they all reside on the same host that's failed!).

When I started to create the cluster I ran straight into a problem with the Failover Cluster Manager reporting problems due to timeouts when creating the cluster.

Creating a single node cluster worked fine, but then would again fail when trying to add in another node.

I happened to put one of the Hyper-V hosts into maintenance mode for something and it migrated all the VM's onto the same host, at which point creating the cluster worked flawlessly, yay!

However, when the Hyper-V host came back out of maintenance mode and the availability sets kicked in during optimisation forcing a VM node back onto a separate physical host, the clusters broke again, not yay :(

So after some Googling Binging about and a shout on Twitter (Thanks @hvredevoort and @WorkingHardInIT) an issue with Broadcom NICs was brought to my attention and I came across this TechNet Forum post talking about the same issue.

Sophia_whx suggested trying to use Disable-NetAdapterChecksumOffload on the NICs to help with the issue.

So, first things first.  Use the Get-NetAdapterChecksumOffload to see just what the configuration was and sure enough Checksum Offload is enabled for just about all services across the majority of the NICs.

I then used the Disable-NetAdapterChecksumOffload * -TcpIPv4 command which resulted in this:

A reboot later and then perform it on the second host and whoa....
For some reason, the virtual switch really didn't like having that done to it.
I wish I had some screenshots, but I went into "get it fixed fast" mode.
Basically the switch via powershell was showing as up, the NIC Teaming GUI was showing it down and all the bound adapters as failed. SCVMM had lost all configuration for the switch altogether.
Deleting the switch from SCVMM didn't delete it from the host, but brought it back to life on the host but was missing in SCVMM.  SCVMM then wouldn't redetect it or let you build it again as it was still there, apparently???
I had to manually remove the team from a remote NIC Teaming GUI (I could of PowerShell'd it I know!) and then recreated via SCVMM.
Anyway... at first this looked to have fixed the clustering within virtual machine issues, but it only delayed the symptoms i.e. it took longer to evict nodes and randomly brought them back online.
So next I tried disabling Checksum Offload for all services, being careful not to touch the Virtual Switch this time.
Rather than going adapter by adapter I used the following command:
Get-NetAdapter | Where-Object {$_.Name -notlike "Converged*"} | Disable-NetAdapterChecksumOffload
This resulted in the Checksum Offload being disabled for the various services as shown, except for my virtual switch.

After doing this on the other host and giving them a reboot, my clustered virtual machines appear to be nice and stable when split across physical hosts. Yay! Problem fixed.

Just as another side note about Broadcom adapters, there have also been reports of performance issues when the Virtual Machine Queue (VMQ) setting is enabled, despite it being a recommended setting.

A quick check of my hosts showed it was enabled:

Another quick PowerShell line later and it wasn't:

Get-NetAdapterVmq -InterfaceDescription Broad* | Disable-NetAdapterVmq