Virtual Ascetic

everything about virtualization, storage, and technology

  • Home
  • About

Your likes and shares help increase my enthusiasm to share knowledge. Please leave a comment if you like any post

STO2496 – vSphere Storage Best Practices: Next-Gen Storage Technologies

August 30, 2014 By asceticadmin

This was a panel like session that wasn’t vendor specific but broadly gave pointers on new type of arrays – like all vSAN,SDRS, VVOLs, flash arrays, datastore types, and jumbo frame usage etc. It truly lived up to its name – not just by content but also by its duration. The session ran over its scheduled duration of 1 hour and actually finished in 1.5 hours but no one was complaining since there was a lot of interesting stuff.

Presenters – Rawlinson Rivera (VMware), Chad Sakac (EMC),  Vaughn Stewart (Pure Storage)

The session kicked off by talking about enabling simplicity in the storage environment. Some key points discussed were –

1) Use large datastores

  • NFS16Tb and VMFS 64Tb
  • Backup and restore times and objectives should be considered

2) Limit use of RDMs to when required for application support

3) Use datastore clusters and SDRS

  • Match Service Levels on all datastores on each datastore cluster
  • Disable SDRS IO Metric on all flash arrays and arrays with storage tiering

4) Use automated storage array services

  • Auto tiering for performance
  • Auto grow/extend for datastores

5) Avoid Jumbo frames for iSCSI and NFS

  • Jumbo frames provide performance gains with increased complexity and the improvement in storage technology no longer requires Jumbo frames

They spoke about the forms of Hybrid Storage and categorized them based on their key functionality –

  • Hybrid arrays – Nimble, Tintri, All modern arrays
  • Host Caches – PernixData, vFRC, SanDisk
  • Converged Infrastructure – Nutanix, vSAN, Simplivity

Benchmark Principles

Good Benchmarking is NOT easyYou need to benchmark over time – most arrays have some degree of behaviour variability over time

  • You need to look at lots of hosts, VMs – not a ‘single guest’ or ‘single datastore’
  • You need to benchmark mixed loads – in practice, all forms of IO will be flinging at the persistence layer
  • If you use good tools like SLDB or IOmeter – recognize that they are still artificial workloads, and make sure to configure them to drive out a lot of different workloads
  • With modern systems (particularly AFA’s  or all flash hyper-converged), its really, REALLY hard to drive sufficient load to saturate the system. Have a lot of workload generators (generating more than 20K IOPS out of a host isn’t easy)
  • Absolute performance more often than not is not the only design consideration

virtual disk formart can be IO bottleneck

 

Storage Networking Guidance

VMFS and NFS provide similar performance

  • FC, FCoE and NFS tend to provide slightly better performance than iSCSI

Always separate guest VM traffic from storage and VMkernel network

  • Converged infrastructures require similar separation as data is written to 1+ remote nodes

Recommendation: avoid Jumbo frames as risk via human error outweighs any gain

  • Goal is to increase IO while reducing host CPU
  • Ethernet is 1500 MTU
  • Jumbo frames are often viewed as 9000 MTU (9216)
  • FCoE auto negotiates to ‘baby/ – jumbo frame of 2112 MTU (2158)
  • Jumbo frames provide modes benefits in mixed workload clouds
  • TOE adapters can produce issues uncommon in software stacks

jumbo frame performance example

 

Jumbo Frame summary – Is it worth it ?

Large environments may derive the most benefit from Jumbo frames but are also the most difficult to maintain compliance

– All the steps need to align – on every device

Mismatched settings can severely hinder performance

– A simple human error will result in significant storage issue for a large environment

Isolate jumbo frames iSCSI traffic (e.g. backup/replication) – apply CoS/QoS

Unless you have control over all host/network/storage settings, best practice is to use standard 1500 MTU

The future – Path Maximum Transmission Unit Discovery (PMTUD) – It is an IP packet (L3 routers) whereas Jumbo frames are L2 (switches)

It is part of ICMP protocol (same protocol that has Ping, Traceroute, etc) and is available on all modern Operating Systems.

The speakers then got into Data Reduction technologies – they are the new norm (specially de-duplication in arrays)

Deduplication is generally good at reducing VM Binaries (OS and application files). Deduplication block size variances can be impacted by GOS file system fragmentation

  • 512B – Pure Storage
  • 4KB – NetApp FAS
  • 4KB – XtremIO
  • 16KB – HP 3Par
There is a major operational difference between Inline (Pure Storage, XtremIO type) and post-process (NetApp FAS, EMC VNX)
– The advice they provided is that try it yourself or talk to another customer (use VMUGs) – don’t take vendor claims seriously.
Compression is generally good at reducing storage capacity of applications
– Inline compression tends to provide moderate savings (2:1 common) but there are CPU/latency tradeoffs
Post process compression tends to provide additional savings (3:1 common)
Data reduction in Virtual disks
Thin, thick, and EZ-Thick VMDKs reduce to the same size
– Differences exist between array vendors but not between various disk types
T10 UNMAP is still not here in vSphere 5.5 – in the way people ‘expect’ – UNMAP is a SCSI command that allows to reclaim space from blocks that have been deleted by virtual machine.
– It is one of the rare cases where Windows is still ahead – but only in Windows Server 2012 R2
– Manual ‘vmkfstools -k’ option for vSphere 5.1 is available. See Cormac Hogan’s blog post by clicking on this link
– Manual ‘esxcli storage vmfs UNMAP’ in vSphere 5.5 can do > 2Tb volumes (a diagram depicting UNMAP of 15TB over 2 hours was displayed)
– Not all GOS zero properly which means you may not reclaim space properly via UNMAP
An entire set of Horizon specific and Citrix specific Best Practices to follow (vSphere config and GOS config)
Rawlinson who had stepped away from the stage as Chad and Vaughn spoke about storage stuff earlier, then came on to talk about VMware vSAN Best Practices
Network Connectivity
– 10GbE Preferred Speed (previously 1Gb connectivity used to be good enough. But vSAN works best if 10GbE connectivity is available – specifically because of the volume of data that travels over the network)
– Leverage vSphere Distributed Switches (vDS) – NIOC is not commonly used in most organizations but acts like SIOC where it performs QoS for the network traffic and throttles traffic to offer the best performance. Specifically, vDS offer the best flexibility and control over network performance with the feature set that is required in Enterprise environments.
Storage Controller Queue Depth – The Queue Depth setting is something that should not be setup manually anymore unless you are observing performance issues. VMware has specifically reviewed it and officially set it up at 256. In some environments however you may have a requirement to change. Just don’t change it for the sake of setting something up manually without allowing uninterrupted operation and associated monitoring of default values.
– Queue depth support of 256 or higher
– Higher storage controller queue depth will increase
  • Performance
  • Resynchronization
  • Rebuilding Operations
  • Pass-through node preferred
Disks and Disk Groups
  • Don’t mix disk types in a cluster for predictable performance
  • More disk groups are better than one

The session finally concluded at 6:30pm and after a few hand shakes everyone was on their way. But it was completely worthwhile and goes on to show why attending VMworld offers great insights that you cannot learn in a 4 day course. The structure and content of these sessions is not limited by any way.

 

Filed Under: General Tagged With: 2014, anil sedha, benchmark, best practices, chad sakac, clusters, compression, converged, data reduction, deduplication, Discovery, EMCElect, host cache, hybrid arrays, infrastructure, inline, iometer, jumbo frames, NFS, pass through, Path Maximum, performance, PMTUD, Queue Depth, Rawlinson Rivera, RDM, reclaim disk, SDRS, storage, Transmission Unit, UNMAP, Vaughn Stewart, vDS, vExpert, VMFS, vmworld, vsphere

HOL -SDC-1404 – Optimize vSphere Performance

August 25, 2014 By asceticadmin

One of the things that I have on my must do list is attend the lab sessions at VMworld 2014. I liked this one lab in particular because it was more detailed (about 300 slides) and offered real world practical scenarios and solutions to troubleshooting.

The lab includes the following modules:

  • Module 1 – Basic vSphere Performance Concepts and Troubleshooting (60 minutes)
  • Module 2 – Performance Features in vSphere (vSphere Flash Read Cache) (45 minutes)
  • Module 3 – Understanding the New Latency Sensitivity Feature in vSphere (30 minutes)
  • Module 4 – vBenchmark: Free Tool for Measurement and Peer Benchmarking of Datacenter’s Operational Statistics (15 minutes)
  • Module 5 – StatsFeeder: Scalable Statistics Collection for vSphere (20 minutes)
  • Module 6 – Using esxtop (60 minutes plus 20 minute bonus section)

I went through Module 1 yesterday and plan to finish up today and want to blog in depth on this particular session which btw was also trending in the top 4 labs at VMworld. But just to enjoy the depth of this troubleshooting lab it is worthwhile not to run through it and read the material carefully.

In Module 1, there is content on CPU, Memory, and Storage optimization.

CPU

  • High Ready time – you may have issues if Ready time is above 10%. It is measured in milliseconds but I’ll list the formula below to convert this to a percentage
  • High CoStop (CSTP) time – you may have allocated more vCPU’s than necessary.
  • CPU limits- a physical limit setting may impact performance because the limit does not allow increased CPU usage if necessary
  • Host CPU saturation – Consistent CPU usage over 85% leads to vSphere host saturatation
  • Guest CPU saturation – VM is using 90% or more than assigned resource and thus more CPU is not available to the application
  • Incorrect SMP usage – Large SMP VMs can cause extra overhead.  Not all apps support multi processing so be careful when selecting SMP for an application which is single threaded
  • Low Guest usage – application not configured correctly or starved due to memory or I/O

 

A VM has four CPU states

  1. Wait – VM guest OS is idle or waiting on vSphere tasks. Also called VMWAIT
  2. Ready – VM is ready to run but unable to do so because sphere scheduler is unable to find physical host CPU resources. reported as MLMTD (max limited)
  3. CoSTOP (CSTP) – Time vCPU’s of multi-way VM spent waiting to be co-started. Indicator of co-scheduling overhead
  4. Run – Time VM was running on physical processor

 

To get %age values of ready time:

Metric value in percent = CPU Ready Metric value in ms/Total time of sample period (default 20,000ms in vcenter)

 

vNUMA – Ideally setting either multiple vCPU or vCPU cores in a VM should not change performance (note that when you add a CPU you are offered two choices – vCPU or Cores). But for larger VMs with more than 8vCPU this will not always be true. Unused vCPU still consume timer interrupts in some guest OS. Maintaining a consistent memory view among multiple vCPU can consume additional resources. Hardware assisted MMU can reduces this CPU taxation.

Most guest OS execute an idle loop during periods of inactivity. If you are using an older OS it will result in higher consumption of resources. e.g Windows 2000, Solaris 8 and 9

 

Memory Management

  • Active vs Consumed Memory Usage
  • Types of swapping, when they kick in and impact
  • Memory metrics to detect potential memory issues

Transparent page sharing (TPS) – redundant copies of pages are eliminated. TPS is always running by default . On latest hardware vSphere will back guest physical pages with large host physical pages (2mb contiguous memory region instead of 4kb for regular pages). See kb.vmware.com/kb/2017642 for more clarity.

There are four main memory overcommit techniques

  1. High (no memory pressure) – Transparent page sharing (TPS)
  2. Soft (less than min free memory available) – TPS, ballooning
  3. Hard (less than 2/3 min free memory available) – TPS, ballooning, compression, hot swapping. It is at the hard state that large memory pages will be broken down to small pages and TPS will be able to consolidate identical pages.
  4. Low – Less than 1/3 min free memory available – Swapping. VM halted until memory pressure relieved.

MinFree Memory is calculated by default on a sliding scale from 6% to 1% of physical host memory

vSphere 5.1 allows very large VMs with upto 64 vCPU.

Avoid a large vm on too small a platform.

Rule of thumb

  • 1 to 4 vCPU on dual socket hosts, 8+ vCPU on quad socket host.
  • Very busy workloads do not allow high consolidation ratios – Memory or CPU
  • Tier 1 apps demand more performant workloads and resources

 

Storage

Approximately 90% of performance issues in vSphere are related to Storage. However, not all issues are related to the Storage array so we need to troubleshoot and detect where the problem lies, identify it correctly, and take remedial steps based on the issue.

Some things to remember:

  • Payload (throughput) is fundamentally different from IOP (cmd/s)
  • IOP performance is always lower than throughput

A good rule of thumb on the total number of IOPS any given disk will provide:

  • 7.2k rpm – 80 IOPS
  • 10k rpm – 120 IOPS
  • 15k rpm – 150 IOPS
  • EFD/SSD – 5k-10k IOPS (max ≠ real world)

So, if you want to know how many IOPs you can achieve with a given number of disks:

  • Total Raw IOPS = Disk IOPS * Number of disks
  • Functional IOPS = (Raw IOPS * Write%)/(Raid Penalty) + (Raw IOPS * Read %)

Use IOmeter or VMware fling – I/O analyzer tool

  • Average I/O response time (long latencies)
  • Total I/O per second
  • Total MBPS (low throughput)

Disk I/O latency is derived from – davg, qavg, kavg, gavg

The value of KAVG ~= QAVG

In a well configured system QAVG should be zero

  • GAVG – Guest average latency
  • DAVG = time spent in device from driver HBA and storage array
  • KAVG = time spent in the esxi kernel and is a derived value.

From ESXi we see 3 main latencies that are reported in esxtop and vCenter.

The top most is GAVG, or Guest Average latency, that is the total amount of latency that ESXi can detect.

Total Latency – DAVG = KAVG.

Guidance: This shows the importance of sizing your storage correctly and that sometimes when you have two storage intensive sequential workloads sharing the same spindles, the performance can be greatly impacted. Try to keep workloads separated – sequential workloads separate (back by different spindles/LUNs) from random workloads.

Guidance: From a vSphere perspective, the use of one large datastore vs. many small datastores usually does not cause a performance impact. However, the use of one large LUN vs. several LUNs is storage array dependent and most storage arrays perform better in a multi LUN configuration than a single large LUN configuration.

Things to keep in mind with storage are….

  • Kernel latency greater than 2ms may indicate a storage performance issue.
  • Use the Parvirtualized (PVSCSI) device Drivers for best storage performance and lower CPU utilization
  • VMFS performs equally well compared to RDMs. In general, there are no performance reasons to use RDMs instead of VMFS.)
  • vSphere has several Storage Queues, Queues may cause bottlenecks for storage intensive applications. Check VM, Adapter, and Device/LUN queues for bottlenecks.

For more details on these topics, see the Performance Best Practices and Troubleshooting Guides on the VMware website.

http://pubs.vmware.com/vsphere-51/topic/com.vmware.ICbase/PDF/vsphere-esxi-vcenter-server-51-storage-guide.pdf

http://communities.vmware.com/docs/DOC-19166

 

 

Filed Under: VMWorld Tagged With: anil sedha, ballooning, CPU ready, CPU wait, CSTP, davg, EMCElect, gavg, high cpu, high iops, hol-sdc-1404, hot swapping, i/o analyzer, iometer, kavg, multi smp, optimize, processing, qavg, rule of thumb, Run time, transparent page sharing, vExpert, vmware fling, vmworld, vNuma, vsphere performance

Let’s Get Social

  • Google+
  • Linkedin
  • Twitter


Recent Posts

  • VMware’s Photon OS 1.0 Now Available!
  • VMware’s Photon OS 1.0 Now Available!
  • VMware’s Photon OS 1.0 Now Available!
  • VMware’s Photon OS 1.0 Now Available!
  • VMware’s Photon OS 1.0 Now Available!

Archives

  • June 2016
  • May 2016
  • April 2016
  • February 2016
  • September 2015
  • August 2015
  • February 2015
  • January 2015
  • October 2014
  • September 2014
  • August 2014
  • May 2014
  • April 2014
  • March 2014
  • February 2014
  • January 2014
  • October 2013
  • September 2013
  • August 2013
  • June 2013
  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • November 2012
  • October 2012