Anyone using HP P4500 SAN + VMware?

Lanz · 26 Jul 2011 at 17:20

Just wondering if any of you guys are using this combo?

Dr. Pain · 26 Jul 2011 at 17:34

PS6000, PS5000, AND PS4000 SAN's but not a x500.

Any use?

Lanz · 26 Jul 2011 at 17:35

HP P4000/P4500 Lefthand people is what I need, thanks.

akakjs · 26 Jul 2011 at 19:40

We have one (a pair) of those at work; I don't manage them myself (they're at our CoLo, I do the office stuff) but I might be able to help/ask for you. What's your question?

akakjs

Lanz · 26 Jul 2011 at 20:06

Shotshots are taking ages, like 12+ mins for a 8GB VM, and we are losing heartbeats for maybe 10 seconds. This is only testing with 1 or 2 VMs, and we plan to put 150 on this thing.

I'm looking for a case study or some success stories of people hosting 100+ VMs on them, as HP says our config is correct, yet VMware support says there's something wrong with our SAN - We are going around in circles and getting nowhere.

I just want proof that there are people out there using this combo and having decent performance.

cheers

aspirin · 26 Jul 2011 at 20:42

Yes, we have deployed 2 of these solutions but we don't use the LH snapshots - we use Veeam instead. Can I ask if these are G1's or G2's? The exact part number would be better

akakjs · 26 Jul 2011 at 21:14

ahh. quite a hardcore question then

We only run about 16 off ours, but I'll have a poke and see what kind of performance we get.

I should add that I'm strictly an enthusiastic amateur, it's not my day job (software developer by trade); so if you have something specific in mind you'd like me to check do say.

akakjs

aspirin · 26 Jul 2011 at 22:11

Hi Lanz,
It will be one of these:
http://h18000.www1.hp.com/products/quickspecs/Division/Division.html#13362

Shaz]sigh[ · 27 Jul 2011 at 09:26

Lanz said:
Shotshots are taking ages, like 12+ mins for a 8GB VM, and we are losing heartbeats for maybe 10 seconds. This is only testing with 1 or 2 VMs, and we plan to put 150 on this thing.

I'm looking for a case study or some success stories of people hosting 100+ VMs on them, as HP says our config is correct, yet VMware support says there's something wrong with our SAN - We are going around in circles and getting nowhere.

I just want proof that there are people out there using this combo and having decent performance.

cheers

VM snapshot or array snapshot? First thing I thought was flow control personally. How is the array connected to the network? Dedicated switch? What switches? Bandwidth etc.

Caveat - I've not touched the Lefthand stuff but I have run IP storage with x000 VMs.

Lanz · 27 Jul 2011 at 10:23

BQ888A is the model, we have lots of them, 8 node cluster at each site, but not global clusters - just local network raid 10.

Esxi snapshots, not array snaps are taking time.

Its connected to a stack of 4 x Cisco 3750's that act as a single switch.

Flow control is enabled, as is Jumbo frames (which we might disable as it causes latency)

We are using BL460c G7 blades that are connected to Flex10 VC modules which have multiple 10G CX4 connection for iscsi to the 3750's.

We were sold these as a solution that will work with 100-200 VMs, but I know of no other company using this combo with that many. Most people just have 2 nodes, 10-15 VMs in a SMB type enviroment.

Shaz]sigh[ · 27 Jul 2011 at 12:41

Right-o, as I have no idea about the P4500 I'll work backwards.

Is the stack dedicated for the P4500 or does it hold other traffic?

So 3750 non-X so would be xenpak modules? I'd jump on the switch and have a look at the connected interfaces for any drops or flaps.

As well as flow control, disable BPDUs by disabling spanning-tree against those ports (portfast is not enough).

Also, are the interfaces dedicated to iSCSI or are they part of a larger trunk? If part of a larger trunk, how many VLANs do you have in the environment? This would likely only be important if you've seen drops.

That's the network part covered.

The next thing would be the VMware environment, are you using multiple vmks for each vmnic or just one and using LACP or something similar?

Lanz · 27 Jul 2011 at 13:09

The stack is dedicated to iscsi.
They are 3750-X, and there's no drops.
Flow control is on, spanning tree is disabled.
dedicated iscsi traffic, no VLANS.

Final design will have multiple vmks in each VMFS for each iscsi gateway (VIP) so 8 node P4500 cluster will have maybe 7 datastores, leaving one node for failover. 4 Node esxi cluster.

We are working with just a single datastore and a few VM's to test though.

A simple test that brings it to its knees:

Migrate a VM from datastore1 to datastore2, and do a snapshot of a VM already on datastore2 at the same time.

Pinging the VM that your snapshotting will result in lost packets for maybe 10 seconds , it'll get a red alert in virtual center and take maybe 12 mins to finish, all the while the pings will be terrible.

So in a production enviroment, a simple storage migration or snap will end up effecting all other VMs in that datastore.

aspirin · 27 Jul 2011 at 13:30

All I can add is that we had to disable Jumbo frames as it caused massive headaches when using a G1 version connected to DL360 G6's.

Shaz]sigh[ · 27 Jul 2011 at 15:40

Lanz said:
The stack is dedicated to iscsi.
They are 3750-X, and there's no drops.
Flow control is on, spanning tree is disabled.
dedicated iscsi traffic, no VLANS.

Final design will have multiple vmks in each VMFS for each iscsi gateway (VIP) so 8 node P4500 cluster will have maybe 7 datastores, leaving one node for failover. 4 Node esxi cluster.

We are working with just a single datastore and a few VM's to test though.

A simple test that brings it to its knees:

Migrate a VM from datastore1 to datastore2, and do a snapshot of a VM already on datastore2 at the same time.

Pinging the VM that your snapshotting will result in lost packets for maybe 10 seconds , it'll get a red alert in virtual center and take maybe 12 mins to finish, all the while the pings will be terrible.

So in a production enviroment, a simple storage migration or snap will end up effecting all other VMs in that datastore.

So a self contained switch stack which has ESXi and the array attached to it, all access, no spanning tree with flow control.

Based on that I'd say your fundamental storage network is good.

Re: jumbo frames, just to check, has your array been changed, your 3750 stack been changed and your vSphere hosts? All to support whatever MTU?

Creating a snapshot in VMware should be relatively pain free with regards to performance. Are you taking memory snaps or quiescing the file system? With memory, you will get disk writes based on the amount of memory which, depending on the size of the VM, could take a while.

As a side interest - what is the disk config of the datastore you're testing against and how much memory do the VMs have configured that you're testing on.

The fact that everything is dying a death when you're doing svMotion is interesting.

What NICs are you using for iSCSI? Do they have an independant driver? Can you check esxtop whilst doing the svMotion to see if there is any random CPU usage, specifically to cpu0.

If all this is fine, I'd hit the array next.

Lanz · 27 Jul 2011 at 16:00

We took Jumbo off the exsi end and it improved, we are now taking Jumbo off the storage nodes so there's nothing using it.

VM we are using for testing has 8GB Ram, the datastores are using 8MB blocks on a 750GB VMFS volume.

The Blades have NC553i flex10 nics, that have HBA option inside the Virtual Connect, we have tried Hardware and Software based initiators with the same problem. I think we'll end up just using software in the end as I've read there's lots of issues with HBAs.

Skidilliplop · 10 Aug 2011 at 13:36

Are they rebranded broadcom or intel chipsets? I've had many problems with broadcoms, some you literally cannot use jumbo frames with them in hardware initiator mode. And some of the 10Gbit ones there are all sorts of issues with latency spikes etc.

Just out of curiosity, you say you're moving a VMDK and creating a snap at once, what sort of MPIO are you using and how is it configured?

Lanz · 10 Aug 2011 at 13:41

MPIO is standard Round Robin on the Vmware side, but we have tested with fixed paths and its still the same.

VMware support now says we need to contact our storage vender, we are currently working with HP on a solution, but not getting very far.

danh83 · 20 Oct 2011 at 19:38

Did you get any further on this as i've got a similar setup of 6x HP P4500 nodes connected to 2x HP 6120G/XG blade switches in a HP C3000 enclosure.
each nic of the nodes is connected to each switch using ALB bonding on the nodes. We tested losing one of the switches and the lost quorum.
We have flow control and jumbo frames both enabled across the board.

Lanz · 20 Oct 2011 at 22:09

We are still testing, have upgraded all to SAN/IQ 9.5 and the problem still exists.

Your ALB should failover fine though. Flow Control (AND) jumbo frames can cause problems on some cheaper switches, its generally recommended to turn off Jumbo as it offers little or no improvement.

Skidilliplop · 21 Oct 2011 at 09:44

Lanz said:
its generally recommended to turn off Jumbo as it offers little or no improvement.

I'd beg to differ on that, I benched my SAN with and without and Jumbo frames and throughput was considerably higher with them enabled. In some instances it was in the region of 25% higher.

Generally speaking if you up your frame size and don't see any improvement, it usually just means the bottleneck is further up the OSI model.