Key issues concerning performance –
- Bad Performance negatively affects revenue
- Revenue impact negatively affects technology decisions
- Technology decisions as a result of bad performance affect IT architecture
It’s a cyclical situation that many of us find ourselves in from time to time. We may run into a performance problem and as a result we spend significant amount of time resolving them. Our users keep complaining and the business starts losing trust in the IT department. When you then come up with new technology it takes a lot of time to convince the business/user community to move forward. Hence it is better to nip the problem in the bud before it flowers and starts affecting your work.
Performance tuning in a virtual environment has to be considered from various aspects. While the virtualization platform is owned by VMWare through its virtualization layer there are other components at play when considering performance issues. This is one topic very close to my heart so I am going to get into a few details here. For clarifications or further questions feel free to comment at the end of this post. So the components I was referring to are quite a few in number – VM Environment design, Hypervisor and VM Setup, OS, Computer and I/O configuration, HA and DRS rules, network constraints, and so on.
VM Environment Design – Depending on industry or business line of operation, the organizations may finalize a design that suits certain requirements. Those that run mission critical workloads and have the resources will devote sufficient number of servers to process the workload. Their design might involve load balancers or traffic redirectors that can affect performance. The selection of storage array or type of disk is also part of the environment design. For e.g. if you are running random load or sequential workload it is important that the correct infrastructure is in place to handle the type of workload. Usually, if the workload is normal many people don’t pay attention to this aspect. Office politics is another issue where your environment design may not be optimum – the only way to work through this is to review the entire design end to end. I would recommend starting with a visio diagram of your architecture and then laying it out in a excel sheet or a mind map. Then slowly start creating a before and after picture in which you can perform comparison of what the metrics are and what is the end state. After you have successfully understood the gaps in design or capabilities then again re-design for future scale/growth. This final step is very important otherwise you’ll run into performance/flexibility issues again in a short while (~1 to 2 years).
Hypervisor and VM Setup – There is a simple answer usually to every question that is asked about setting up a VM but there are intricacies that can often be missed. When you setup a VM decide what is the optimum configuration for it. Know what you can and cannot do – you cannot create more datastores than what are allowed, you cannot run thin disk provisioning for performance crazy applications, and so on. A VM setup is not about the nuts and bolts of creating a virtual machine, assigning processor and memory, and finally securing it with SRM. Virtual Machine setup also involves understanding the block size (luckily with ESXi 5.1 onwards that is not a concern), paravirtualized SCSI adapters, the bus type, storage I/O Control, multi-pathing, DRS impacts due to the way you may have set it up, and a whole lot more. Now don’t get concerned about how you can understand all of these aspects upfront because your application or build testing can guide you to the correct configuration that needs to be put in place. Problem is not many administrators go back to the drawing board and review the configuration unless the testing itself failed. I once resolved longer backup time and multiple failed backups by increasing the default CPU and Memory allocation for Service Console. With the newer ESXi 5.1 capabilities you no longer have the service console so it’s not a worry but if you are running older releases look at bumping up the service console cpu and memory reservations.
OS – The type of Operating System you selected and the frequency based on which you apply updates matters. Multi-threading capabilities of most linux/unix operating systems is well known and well proven. Not saying that Windows 2008 onwards the operating system does not perform well but there are a lot of constraints around using Windows Server OS specifically when you are working with issue troubleshooting. It is far more easy to audit and establish a clear path of troubleshooting when working with Linux. But when you work with Windows specific databases and applications, it is important for you to setup some level of logging and auditing. Otherwise, in a performance problem it will be difficult to identify what’s a resource problem and which is a application issue. As it usually happens most performance problems are actually blamed on the infrastructure but I have had the opportunity in the past to rebut every one of those events (except one) and actually show that the infrastructure was not at fault. The one exception i listed above was related to disk realignment so that was something again that helped improve performance. I will talk about it in another post since I want to cover some other details with it.
System and I/O configuration – I always ask application owners/vendors about their prefered hardware configuration – then ask for TPM and TPC requirements. Chances are 90% of the time you will find that the application owner/vendor has no clue of it. Then ask the same thing to your hardware vendor or read from the tech specs as to what is supported. VM’s are to be setup on physical hardware at the end of the day but if you don’t know how your hardware is capable of performing then you are working with limited knowledge. The knowledge of TPM and TPC is usually required in complex environments but then those are the environments usually where performance issues are the most complex to resolve. So every bit helps and every bit makes the business more confident of the services that you are able to provide. BIOS upgrades, hard drive firmware upgrades, and hyperthreading support are the kind of issues that are often overlooked by many who do not focus on design considerations primarily due to lack of experience. If you are using iSCSI configuration think of how much throughput you are getting to how many ESXi hosts. Maybe you require more data ports, the type of connection you use on the network side may not be sufficient – I once came across a server that had network adapter set to ‘Auto’ for speed settings. I set it manually to full duplex and it resolved a major problem with backup performance. Jumbo ports for data transfer need to be reviewed, disk alignement for faster I/O, multi-pathing, trunking for bandwidth aggregation, and a whole host of other things are to be reviewed. On the OS run IO stats to see what you are observing from inside the VM, then run the io stats on ESXi and finally view the storage array stats. Everything should co-relate or else you are having a problem at one of the layers. Memory leaks are something that occur within the OS but are not clearly visible as performance impacting factors. It is important to troubleshoot for memory leaks due to application code when you are not seeing anything at the infrastructure level.
HA and DRS – How you have setup HA and DRS matters. I have come across numerous questions from peers that I have known who wanted my take on their DRS setup. One thing I always advise everyone to do is to turn on HA and DRS in the early stages because it allows you to understand how the load patterns are changing and accordingly allows you to make tweaks to your setup. Those who don’t have that luxury because they are getting on DRS quite late – start with a smaller cluster and slowly start advancing it further by understanding what impacts are being caused to the servers/VM’s that are not part of the DRS resource pools. Understand how shares, reservations, and limits work and then slowly make changes. Review the number of vMotions for each VM and then see if their resource allocation needs to be tweaked. Use vCOPS to understand how your environment is performing and then review whether overprovisioning or underprovisioning is occuring.
I can essentially go on and on about various performance factors and criteria but it is better to treat each aspect on its own in new blog posts so look forward to other content around this. If you have questions or would like to share your own experience or views please leave a comment. I would be happy to hear from you.