Long distance vMotion – now supported over network latency upto 150 ms – support for Geo distances
In vSphere 5.5 it was 10 ms latency
Standard vMotion guarantee – 1 sec execution switchover time
Batched RPCs, tcp congestion window handoff
vMotion – had to find a way to bypass TCP delays
- Used control channel to send all content, data channel was used to send VM data
Very large bandwidth delay product – tcp performs poorly
- Revised tcp algorithm – condition control algorithm not suited for large bandwidth delay projec
- Changed the delay product to HSTCP
Packet loss was another concern –
- Avoided packet drop within ESX host by adding mechanisms like flow control
- Helped target packet loss rate reduce to 0.01%
What if the ESX hosts are in different subnets and you want to do long distance vMotion
vMotion across L3 network
- Use ESX tcp/ip network stack virtualization (an ESX host can have multiple default gateways) – default gateways of the management network.
- vMotion gets its own network stack instance (separate from management network)
- Use default gateway for vMotion to go across L3 (and go across subnets)
VM network still requires L2 network stretching
DRS cluster requirements
- All ESX hosts on L2 or all on L3 vMotion network
- Mixed mode of L2 and L3 not supported
L2 adjacency limitations
- Fault tolerance not supported
- DPM using Wake on LAN requires subnet directed broadcast (if you are not different subnets then routers filter out broadcast traffic).
What if ESX hosts have different virtual switches
- vSphere standard switch (VSS)
- vSphere Distributed switch (VDS)
Ability to go from one vds to another vds – transfers all properties (e.g. Network io control, bandwidth restrictions etc)
Don’t support downgrade from vds to standard switch – since standard switch does not have vds type capabilities
What if ESX hosts are managed by different vCenters
In Vsphere 6 there is ability to vMotion across vCenters
- Simultaneously change compute, storage, networks, and vCenter servers
- Leverage vMotion without shared storage
Works with local, metro, and long distance vMotion
Preserve instance UUID and bios UUID (VM UUID)
Preserve VM historical data
- Events, alarms, task history (pulled from original vcenter – not moved to new vCenter)
- Preserve all HA and DRS properties, affinity/anti-affinity rules
SSO domain support
- Vsphere web client requires same SSO domain
- API support across SSO domains
What if you can take advantage of replication to the other site
Replication assisted vMotion
vSphere 6 supports Active/Active Async storage replication
- Disk copy takes majority of migration time
- Use replication to avoid disk copy
- Leverage virtual volume (VVOL) technology
VVOL are primary unit of data management going forward – storage array knows vvol mapping to vm
Secondary site storage array promotes LUN containing replicated data
Active – active async replciation
- Switch replication mode to sync
- Migration start
- Prepare destination ESX host for VVOL binding
- Switch from async to sync replication
- Migration end
- Complete vvol binding on esx hosts
- Switch back to async replication
vSphere 6.0 vMotion features interop
What if you have 40GBE NICs for vMotion network (vMotion performance and scalability improvements)
vMotion scalability
-Rearchitect vMotion to saturate 40GbE NIC
- Zero copy transmission
- New threading model for better CPU utilization on the receive side
- Reduce locking
For e.g in maintenance mode – we may have to move 400gb memory from one host to another host. The entire maintenance mode will take only 5 mins on 40GbE env.
Reduce vMotion execution switchover time (improved)
- Constant time VM memory metadata updates
- Not a function of VM memory size
Reduce stack overhead
- Improve VM power on time (all power on optimizations have been optimized)
Performance and Debugging
How to Gauge vMotion performance
- Migration time (memory, disk, total)
- Switchover time
- Impact on guest applications
- Application latency and throughput during vMotion
- Time to resume to normal level of performance after migration
Monster VM vMotion performance
- In vsphere 5.5 each vMotion was by default assigned 2 helper threads – running on 2 cores .. Could only go 20GB/sec
- By increasing number of helper threads (tuned ESX) – throughput increased slightly
- In 60Gb scenario the increase of helper threads does not help futher
- In vsphere 6.0 locking has been removed and thus dynamically creates appropriate number of tcp channels and helper thread is automatically created. Thus performance improves significantly without any tuning
Debugging tips
Each vMotion has a unique id associated with it called Migration ID
Grep that migration id since it is unique across both source and destination hosts
From web client – select VM – and go to tasks
See the high level details in the task info – Vmware is adding migration id there
VPXD – find operation id of vMotion (in VPXD logs)
What’s next for vMotion
Cross cloud vMotion
- vMotion between vCLoud air (vCA) and on-premise datacenter
- No vendor lock-in; vMotion to vCA and from vCA to on-premise
- Support for vSphere 5.5 for on-premise version (will be backwards compatible)
Non-volatile Memory (NVM) – disks, ssd’s
- NVM resides in a Dual Inline Memory Module (DIMM)
- Exposed as Memory and Virtual disk to VMs (persistent memory and disks through SCSI card)
- Enable vMotion for VMs and NVM
- Explore NVM to improve vMotion performance and scalability
Active/Passive Storage Replciation
- Leverage broad partner ecosystem to optimize disk copy
- VVOL required to reverse replication direction after vMotion
- vMotion support for RDMA (don’t have to use CPU and can use RDMA for performance improvement instead)
Conclusions
Vsphere 6 vMotion is a big step towards vMotion anywhere
- Cross geo boundaries
- Cross management boundaries
- Cross cloud vMotion