VMWare Esxi link aggregation for the VM Network???

wasc · 10 Jul 2013 at 12:14

Hi,

I'm wondering what the best practises are for increasing bandwidth on a VMware virtual machine network for connectivity between vm's (and possibly vmotion too). At present, our iscsi traffic is nicely balanced, but our vm traffic is not.

I suppose our ultimate goal would be to have 4 nics configured for the virtual machine network on each hyper-visor. What we would then like to see is a vm on esxi1 with a 10gbit virtual nic being able to talk to a vm on esxi2 also with a 10gbit virtual nic at a speed of 4gbit (the 4 x 1gbit nics on the hypervisor). Is this possible?

Google searches have thrown up loads of articles which I've been reading but none have really clearly explained this.

Our setup is as follows.

Ok, so we have multiple ESXi Hypervisors, each with ten 1 gigabit Nics in them. These use Dell Equallogic SAN's for storage with 4 iscsi nics.

We have set it all up with iscsi mpio as per the dell best practises, and looking at the nic statistics on the SAN, we can see that traffic is evenly distributed across all the equallogic nics, so we're confident we have 4Gb iscsi bandwidth available rather than 1Gb and all is balanced nicely.

The question is, how do we nicely balance the VM traffic. We obviously have 6 nic ports left on each hyper-visor. Right now for example if we copy a file from a vm on esxi1 to a vm on esxi2, the copy runs at 1Gb. But if we simultaneously copy another file from another vm on esxi1 to another vm on esxi2, each copy now only runs at 500mbps since both copies are going out on the same gigabit nic on the hypervisor which bottlenecks it.

I see than VMware 5.1 supports LACP, and our switches are managed and will also support this, so am I best creating a LACP group for each hypervisor with say 4 nics. Would this then give 4Gb bandwidth for the virtual machine network between hyper-visors?

Little_Crow · 10 Jul 2013 at 12:56

We just go with the standard failover on our NIC's rather than aggregation so I can't speak from experience, but this thread seems to have a few links that nicely describe how to do it.

Buuuuuut, in KB2034277 (linked in that thread) it says 'LACP only works with IP Hash load balancing and Link Status Network failover detection'. So it still will only be sending the data over the single uplink that your source and destination IP's resolve it to.

I haven't been able to clarify but I think you're SOL and the only way to make all that bandwidth actually available is to invest in 10Gb NICs and the supporting infrastructure.

###############
Just found this article from January so it's reasonably current. Which says 'It is clear that if you don’t have any variations in the packet headers you won’t get better distribution'.
So if it's a file server you're OK, but for a big point to point copy you gain nothing.

ecksmen · 10 Jul 2013 at 18:12

LACP is only good for lots of connections, not single IP to single IP. ie Many to 1 will start to use all the links in the aggregate.

If you cannot run the VM on the same host for resilience (or some other factor) you need to upgrade the core network speed.

DRZ · 12 Jul 2013 at 05:04

1. You can only do LACP with a distributed vSwitch
2. You still won't exceed the individual link speeds for a single flow

You actually don't need LACP (and thus no distributed vSwitch) to achieve the above same as above, you can do exactly the same with vanilla EtherChannel (called IP Hashing in vSphere). All LACP is is a control protocol (clue is in the name) which does smarter linking by negotiation rather than assumption.

Check out the cost per port of jumping to 10GE. If your inter-host traffic really does need >1gbit throughput then the fact it will cost less per server in switch ports and deliver you better application performance should be a slam dunk...

covenantuk · 12 Jul 2013 at 13:16

As DRZ says.

We had a similar situation here but it was a legacy situation (we've been running since ESX v3.0).

Make sure you're talking to whoever looks after network infrastructure and make sure you're heading towards 10GB if you think you'll need that bandwidth, it makes life a whole lot easier.

We've already moved to 10GB for VDI and have put things in motion for 10Gb on the server vSphere Infrastructure now.

gavind · 12 Jul 2013 at 15:05

You also need to have bot VDS, a virtual distributed switch and Enterprise Plus licensing.

Then the rest should follow.

wasc · 12 Jul 2013 at 21:23

Thanks everyone - some very useful info here.
So basically what we are saying is that two vm's on different hyper-visors trying to talk to each other will have zero benefit from lacp or adding multiple nics into the vm network.

However many vm's talking to many vm's will use the additional nics. Upgrading to 10Gb will be hard, but not impossible - our servers have no spare PCIe slots but having said that I'm sure we can free some up.