Health Check Plugin in vSphere 6
Used for Virtual SAN in vSphere 6 – improved
Does 30 health checks – explains nature of failure and so on
vRealize Operations Management Pack for Storage Devices (MPSD) latest update released last week
vROps MPSD – Custom Dashboards
Build bespoke VSAN cluster info
Disk Group throughput
Log Insight – super tool for looking at logs generated by ESXi host and vSAN. Allows us to do analytics on logs – patterns, behaviours, intermittent issues
VSAN extension pack available with Log Insight
Real Life troubleshooting scenarios
We have realized there are gaps in Management story around VSAN – VMware has been working to provide right tools at the right places.
Low level – esxcli, rvc, observer
Need vSphere web client and vRealize as a standard way of consuming things. PowerCLI is also a tool in use
Things you probably want to check
- components must be on HCL
- Confirm network is good – e.g. Multicast
- Make sure VMs can be deployed successfully
- Test underlying storage components with a “stress test”
- Inject failures, ensuring that VMs remain available
- Test performance of Virtual SAN (VSAN)
Scenario#1
Assume HCL check fails.
Download the latest HCL file – maybe the device/driver/firmware support status has changed
Scenario#2
Run a storage performance test. This doubles as a stress test – VMware has released a tool called HCIbench – test performance of infrastructure
To test a bad drive don’t just remove it – run the special error injector with health check – simulate drive going bad – see POC guide
Scenario#3
- Injecting failures in Virtual SAN can be done quite simply
- Hosts (reboot or power off)
- Network (disconnect uplinks or disable VSAN traffic service)
- Disks (special error injector wit health check – simulate drive going bad – see POC guide)
- Use the health check to understand failures
When you pull out a drive – VSAN knows it – VSAN will wait 60 minutes before remediation
When a drive fails – VSAN knows that and puts it in Degraded mode. VSAN will perform immediate remediation
VSAN requirements are three nodes – it can tolerate failure of one node. But if we do four nodes – we can have one node failure. But it protects us against one more node failure
So it is preferred that you start with 4 nodes for VSAN.
Multicast – Multicast configuration is the most common issue. When alarm in Proactive Tests shows status as failed then there is an issue with multicast. This proactive test will verify if multicast performance is acceptable for VSAN cluster
Sneak peak of ‘Performance Service’ – new tool. To be released hopefully in near futue
Stores history VSAN performance statistics
Stored on VSAN itself
Always-on
Fully Integrated – no need to install anything
Exposed via vSphere Web Client (and API)
Distributed architecture, built directly into ESXi
- No network traffic going outside the cluster
- No CPU/Memory usage in VC
- Tiny impact on ESXi hosts
Benchmarked 50K IOPS (see image) using Performance Service
VSAN – When performing benchmark tests note that read cache in VSAN needs to warm up
Outstanding Ios (OIO) – not enough outstanding I/O for VM to push performance to its limits
You need to have enough VMs to perform a true performance test. We don’t run one giant VM but we run multiple VMs to ensure enough parallel I/O – that is what VSAN is built for.
Two new metrics in performance graphs – Delayed I/O percentage and Delayed I/O average latency
- How many IOs were delayed
- How many IOs did not make it to the pipeline