Equallogic, VAAI and the Fear of Queues

Previously I posted on how using bigger VMFS volumes helps Equallogic reduce their scalability issues when it comes to total iSCSI connections. There was a comment about does this mean we can have a new best practice for VMFS size. I quickly said, “Yeah, make em big or go home.” I didn’t really say that but something like it. Since the commenter responded with a long response from Equallogic saying VAAI only fixes SCSI locks all the other issues with bigger datastores still remain. ALL the other issues being “Queue Depth.”

Here is my order of potential IO problems on with VMware on Equallogic:

  1. Being spindle bound. You have an awesome virtualized array that will send IO to every disk in the pool or group. Unlike some others you can take advantage of a lot of spindles. Even then, depending on the types of disks some IO workloads are going to use up all your potential IO.
    Solution(s): More spindles is always a good solution if you have unlimited budget. Not always practical. Put some planning into your deployment. Don’t just buy 17TB of SATA. Get some faster disk and break your Group into pools and separate the workloads into something better suited to the IO needs.
  2. Connection Limits. The next problem you will run into if you are not having IO problems is the total iSCSI connections. In an attempt to get all of the IO you can from your array you have multiple vmk ports using MPIO. This multiplies the connections very quickly. When you reach the limit, connections drop and bad things happen.
    Solution: The new 5.02 firmware increases the total maximum connections. Additionally, bigger datastores means less connections. Do the math.
  3. Queue Depth. There are queues everywhere, the SAN ports have queues. Each LUN has a queue. The HBA has a queue. I would need to defer to a this article by Frank Denneman (a much smarter guy than myself.) That balanced storage design is best course of action.
    Solution(s): Refer to problem 1. Properly designed storage is going to give you the best solution for any potential (even though unlikely) queue problems. In your great storage design, make room for monitoring. Equallogic gives you SANHQ USE IT!!! See how your front end queues are doing on all your ports. Use ESXTOP or RESXTOP to see how the queues look on the ESX host. Most of us will find that queues are not a problem when problem one is properly taken care of. If you still have a queuing problem then go ahead and make a new datastore. I would also request Equallogic (and others) release a Path Selection Policy plugin that uses a Least Queue Depth algorithm (or something smarter). That would help a lot.

So I will repeat my earlier statement that VAAI allows you to make bigger datastores and house more VM’s per store. I will add a caveat, if you have a particular application that needs a high IO workload, give it a datastore.

Gestalt IT – Tech Field Day

I am honored to be included in the upcoming Gestalt IT Field Day. Looks like a great group from the community will be in attendanc. I am looking forward to the collection of presenters. With how busy I have been delivering solutions lately it will be really good to dedicate some time to learning what is new and exciting. I plan to take good notes and share my thoughts here on the blog. For more information on the Field Day check it out right here: http://bit.ly/ITTFD4

How VAAI Helps Equallogic

I previously posted about the limits on iSCSI connections when using Equallogic arrays and MPIO. If you have lots of Datastores and lots of ESX hosts with multiple paths the numbers of connections multiplies pretty quickly. Now with VAAI support in the Equallogic 5.02 firmware (hopefully no recalls this time), the number of Virtual Machines per Datastore is not important. Among other improvements, the entire VMFS volume will not lock. As I understand VAAI the only the blocks (or files maybe?) are locked when exclusive access is needed.

Lets look at the improvement when using fewer larger EQ volumes:
Old way (with 500GB Datastores for example):
8Hosts x 2(vmkernel connections) x 10(Datastores) = 160 connections (already too many for the smaller arrays, PS 4000).

VAAI (with 1.9 TB* Datastores)
8 Hosts x 2(vmkernel connections) x 3(Datastores) = 48 connections

The scalability for Equallogic is much better with VAAI when trying to stay under the connection limits.

*Limit for VMFS is 2TB minus 512B so 1.9TB works out nicely.

Update Manager Problem after 4.1 Upgrade

A quick note to hopefully publicize a problem I had which I see is discussed in the VMware Community Forums already.

After building a new vCenter Server and Upgrading the vSphere 4.0 databases for vCenter and Update Manager. I noticed I could not scan hosts that were upgraded to 4.1. To be fair, by upgrading I mean rebuilt with a fresh install but with the exact same name and IP addresses. Seems that the process I took to upgrade has some kind of weird effect in the Update Manager Database. The scans fail almost immediately. I searched around the internet and found a couple of posts on the VMware Forums about the subject. One person was able to fix the problem by removing Update Manager and when reinstalling selecting the option to install a new database. I figured I didn’t have anything important in my UM database so I gave it a try and it worked like a champ.

Right now there is not any new patches for vSphere 4.1 but I have some Extension packages that need to be installed (Xsigo HCA Drivers). I wanted to note that I like the ability to upload extensions directly into Update Manager. This is a much cleaner process than loading the patches via the vMA for tracking and change control purposes.

ESXi 4.1 pNics Hard Coded to 1000 Full

I have recently made the transition to using ESXi for all customer installs. One thing I noticed was after installing with a couple different types of media (ISO and PXE install) the servers come up with the NIC’s hard coded to 1000 Full. I have always made it a practice to keep Gigabit Ethernet at auto-configure. I was told by a wise Cisco engineer many years ago that GigE and Auto/Auto is the way to go. You can also check the Internet for articles and best practices around using auto-configure with gigabit ethernet. Even the VMware “Health Analyzer” recommends using auto. So it is perplexing to me that ESXi 4.1 would start to default to hard set. Is it just me? Has anyone else noticed this behavior?

The only reason I make an issue is I was ready to call VMware support a couple weeks ago because nothing in a DRS/HA cluster just built with 4.1 would work. One vMotion would be successful, the next would fail. Editing settings on the hosts would fail miserably when done from the vSphere Client connected to vCenter. After changing all the pNics to auto everything worked just fine (matching the switches).

Hit me up in the comments or on twitter if you have noticed this.