One of the things that I have on my must do list is attend the lab sessions at VMworld 2014. I liked this one lab in particular because it was more detailed (about 300 slides) and offered real world practical scenarios and solutions to troubleshooting.
The lab includes the following modules:
- Module 1 – Basic vSphere Performance Concepts and Troubleshooting (60 minutes)
- Module 2 – Performance Features in vSphere (vSphere Flash Read Cache) (45 minutes)
- Module 3 – Understanding the New Latency Sensitivity Feature in vSphere (30 minutes)
- Module 4 – vBenchmark: Free Tool for Measurement and Peer Benchmarking of Datacenter’s Operational Statistics (15 minutes)
- Module 5 – StatsFeeder: Scalable Statistics Collection for vSphere (20 minutes)
- Module 6 – Using esxtop (60 minutes plus 20 minute bonus section)
I went through Module 1 yesterday and plan to finish up today and want to blog in depth on this particular session which btw was also trending in the top 4 labs at VMworld. But just to enjoy the depth of this troubleshooting lab it is worthwhile not to run through it and read the material carefully.
In Module 1, there is content on CPU, Memory, and Storage optimization.
- High Ready time – you may have issues if Ready time is above 10%. It is measured in milliseconds but I’ll list the formula below to convert this to a percentage
- High CoStop (CSTP) time – you may have allocated more vCPU’s than necessary.
- CPU limits- a physical limit setting may impact performance because the limit does not allow increased CPU usage if necessary
- Host CPU saturation – Consistent CPU usage over 85% leads to vSphere host saturatation
- Guest CPU saturation – VM is using 90% or more than assigned resource and thus more CPU is not available to the application
- Incorrect SMP usage – Large SMP VMs can cause extra overhead. Not all apps support multi processing so be careful when selecting SMP for an application which is single threaded
- Low Guest usage – application not configured correctly or starved due to memory or I/O
A VM has four CPU states
- Wait – VM guest OS is idle or waiting on vSphere tasks. Also called VMWAIT
- Ready – VM is ready to run but unable to do so because sphere scheduler is unable to find physical host CPU resources. reported as MLMTD (max limited)
- CoSTOP (CSTP) – Time vCPU’s of multi-way VM spent waiting to be co-started. Indicator of co-scheduling overhead
- Run – Time VM was running on physical processor
To get %age values of ready time:
Metric value in percent = CPU Ready Metric value in ms/Total time of sample period (default 20,000ms in vcenter)
vNUMA – Ideally setting either multiple vCPU or vCPU cores in a VM should not change performance (note that when you add a CPU you are offered two choices – vCPU or Cores). But for larger VMs with more than 8vCPU this will not always be true. Unused vCPU still consume timer interrupts in some guest OS. Maintaining a consistent memory view among multiple vCPU can consume additional resources. Hardware assisted MMU can reduces this CPU taxation.
Most guest OS execute an idle loop during periods of inactivity. If you are using an older OS it will result in higher consumption of resources. e.g Windows 2000, Solaris 8 and 9
- Active vs Consumed Memory Usage
- Types of swapping, when they kick in and impact
- Memory metrics to detect potential memory issues
Transparent page sharing (TPS) – redundant copies of pages are eliminated. TPS is always running by default . On latest hardware vSphere will back guest physical pages with large host physical pages (2mb contiguous memory region instead of 4kb for regular pages). See kb.vmware.com/kb/2017642 for more clarity.
There are four main memory overcommit techniques
- High (no memory pressure) – Transparent page sharing (TPS)
- Soft (less than min free memory available) – TPS, ballooning
- Hard (less than 2/3 min free memory available) – TPS, ballooning, compression, hot swapping. It is at the hard state that large memory pages will be broken down to small pages and TPS will be able to consolidate identical pages.
- Low – Less than 1/3 min free memory available – Swapping. VM halted until memory pressure relieved.
MinFree Memory is calculated by default on a sliding scale from 6% to 1% of physical host memory
vSphere 5.1 allows very large VMs with upto 64 vCPU.
Avoid a large vm on too small a platform.
Rule of thumb
- 1 to 4 vCPU on dual socket hosts, 8+ vCPU on quad socket host.
- Very busy workloads do not allow high consolidation ratios – Memory or CPU
- Tier 1 apps demand more performant workloads and resources
Approximately 90% of performance issues in vSphere are related to Storage. However, not all issues are related to the Storage array so we need to troubleshoot and detect where the problem lies, identify it correctly, and take remedial steps based on the issue.
Some things to remember:
- Payload (throughput) is fundamentally different from IOP (cmd/s)
- IOP performance is always lower than throughput
A good rule of thumb on the total number of IOPS any given disk will provide:
- 7.2k rpm – 80 IOPS
- 10k rpm – 120 IOPS
- 15k rpm – 150 IOPS
- EFD/SSD – 5k-10k IOPS (max ≠ real world)
So, if you want to know how many IOPs you can achieve with a given number of disks:
- Total Raw IOPS = Disk IOPS * Number of disks
- Functional IOPS = (Raw IOPS * Write%)/(Raid Penalty) + (Raw IOPS * Read %)
Use IOmeter or VMware fling – I/O analyzer tool
- Average I/O response time (long latencies)
- Total I/O per second
- Total MBPS (low throughput)
Disk I/O latency is derived from – davg, qavg, kavg, gavg
The value of KAVG ~= QAVG
In a well configured system QAVG should be zero
- GAVG – Guest average latency
- DAVG = time spent in device from driver HBA and storage array
- KAVG = time spent in the esxi kernel and is a derived value.
From ESXi we see 3 main latencies that are reported in esxtop and vCenter.
The top most is GAVG, or Guest Average latency, that is the total amount of latency that ESXi can detect.
Total Latency – DAVG = KAVG.
Guidance: This shows the importance of sizing your storage correctly and that sometimes when you have two storage intensive sequential workloads sharing the same spindles, the performance can be greatly impacted. Try to keep workloads separated – sequential workloads separate (back by different spindles/LUNs) from random workloads.
Guidance: From a vSphere perspective, the use of one large datastore vs. many small datastores usually does not cause a performance impact. However, the use of one large LUN vs. several LUNs is storage array dependent and most storage arrays perform better in a multi LUN configuration than a single large LUN configuration.
Things to keep in mind with storage are….
- Kernel latency greater than 2ms may indicate a storage performance issue.
- Use the Parvirtualized (PVSCSI) device Drivers for best storage performance and lower CPU utilization
- VMFS performs equally well compared to RDMs. In general, there are no performance reasons to use RDMs instead of VMFS.)
- vSphere has several Storage Queues, Queues may cause bottlenecks for storage intensive applications. Check VM, Adapter, and Device/LUN queues for bottlenecks.
For more details on these topics, see the Performance Best Practices and Troubleshooting Guides on the VMware website.