How much breaks when vCenter Server is not running?

Goliath · 25 Apr 2013 at 08:10

Dogers said:
Got any more little side benefits of it?

We've just started to look at the VSG addon for our 1000Vs, it allows you to configure security profiles (as well as port profiles) in vCenter. For example, assuming you've got a front-end vlan for your publicly accessible hosts you can configure a profile that only lets servers of the same type talk "horizontally" within that vlan, so you can set your web servers not to be able to talk to your SAP servers without having to involve an upstream firewall etc, and the security profile automatically follows the VM when it gets v-motioned (we've got 350 VMs running in two DCs so this is a big issue for us, at the moment a lot of our traffic between servers is having to go via our FWSM blades).

As for things breaking when vCenter isn't running, don't forget backups and restores if you're using something like VEEM. Becomes a bit of an issue when your SAN craps out, corrupts your vCenter VM and you need to restore it :mad:

Shad · 25 Apr 2013 at 08:20

Goliath said:
As for things breaking when vCenter isn't running, don't forget backups and restores if you're using something like VEEM. Becomes a bit of an issue when your SAN craps out, corrupts your vCenter VM and you need to restore it

This is something I've just been trying to plan for. VEEM is running on my vCenter server so as you say, that won't work when it and/or the SAN are down. But I've got my backups going to a simple physical disk (RDM) that I can pull out and restore from. Although without having VEEM running that's going to be tricky... bugger :rolleyes:

Best workaround for that particular situation?

DRZ · 25 Apr 2013 at 10:25

Veeam doesn't use a catalogue in the same way as, say, Backup Exec does. You can quickly install Veeam on a machine that isn't down, double-click on the .vbk and restore the VM from there.

I'm amazed people are running Veeam actually on the vCenter server though. I have Veeam running on an R720xd with 12 cores and 64GB RAM and still hit resource constraints (although I'm backing up and replicating VMs around the world from this box). Just doesn't seem to be suitable to run as a virtual machine to me. Factor in the whole point of doing backups in the first place and it seems even more silly to have your backups dependant on the thing you're backing up!

We use Dell-supplied QLogic 8242s and 8262s. The only difference that I have noticed is that the 42s don't support LACP in Windows whereas the 62s do. We don't use many 42s now, we've retired all of the servers that were using them so they are there as a spare. Never had any issues with them

Dogers, am I right in thinking that you have your NFS stores connected via a dVSwitch? You've given that impression!

Goliath · 25 Apr 2013 at 11:57

Shad said:
Best workaround for that particular situation?

See below.

DRZ said:
You can quickly install Veeam on a machine that isn't down, double-click on the .vbk and restore the VM from there.

I'm amazed people are running Veeam actually on the vCenter server though. I have Veeam running on an R720xd with 12 cores and 64GB RAM and still hit resource constraints (although I'm backing up and replicating VMs around the world from this box).

I should have clarified, we use Backup Exec for the actual backups, it just looks at the VEEM Backup API on vCenter to backup whatever's on our ESXi hosts: hence it being a complete PITA when the vCenter server is unavailable.

We've ended up moving vCenter (plus a few other key systems such as our Backup Exec Master and SCOM platform) onto separate compute and storage from the systems they are looking after. We have an EMC SAN in both DCs that are mirrored etc, unfortunately last year one of them suffered a major failure that basically took the entire platform offline (bear in mind we boot from SAN too :eek:

) - as vCenter and Backup Exec were on the SAN that failed (the mirrors had fractured as a failsafe) we couldn't restore/move/do anything with our VMs.

Deleted member 138126 · 25 Apr 2013 at 14:45

Goliath said:
We've just started to look at the VSG addon for our 1000Vs, it allows you to configure security profiles (as well as port profiles) in vCenter. For example, assuming you've got a front-end vlan for your publicly accessible hosts you can configure a profile that only lets servers of the same type talk "horizontally" within that vlan, so you can set your web servers not to be able to talk to your SAP servers without having to involve an upstream firewall etc, and the security profile automatically follows the VM when it gets v-motioned (we've got 350 VMs running in two DCs so this is a big issue for us, at the moment a lot of our traffic between servers is having to go via our FWSM blades).

That looks interesting, although I suspect it is targeted primarily at multi-tenant situations (e.g. for a "Cloud" hosting provider). In your case it is just another firewall -- nothing wrong with that, but you are splitting your management and traffic flows, making it more difficult to troubleshoot and manage, no?

Deleted member 138126 · 25 Apr 2013 at 14:49

Dogers said:
Got any more little side benefits of it?

The Networks guys love it, because they can manage it just like a physical network device. As far as I know, it supports most (if not all) of Cisco's proprietary stuff, which fits in very nicely with an all-Cisco shop (think VOIP, QoS, stretch VLANs, etc). It is basically like a dvSwitch Super Ultra Deluxe. Expensive though; licensed per ESXi host CPU socket.

And the Nexus stuff in general is really really nice -- I'm a big fan. A couple of Nexus 5000 as core switches (they are 1U each), and then multiple Nexus 2000 sprinkled through your environment as edge/top-of-rack switches, all 10Gbps, super powerful and fast, and very flexible.

Deleted member 138126 · 25 Apr 2013 at 14:50

Shad said:
This idea of system/private VLANs has just fried my brain. Any chance you could walk us through a real-world example?

For now I think I'll stick to putting the vCenter traffic on the management vSwitch.

PVLANs are something completely different from the Nexus 1000v "system VLANs". PVLANs appear to be for securing multi-tenant environments.

DRZ · 25 Apr 2013 at 15:05

rotor said:
The Networks guys love it, because they can manage it just like a physical network device. As far as I know, it supports most (if not all) of Cisco's proprietary stuff, which fits in very nicely with an all-Cisco shop (think VOIP, QoS, stretch VLANs, etc). It is basically like a dvSwitch Super Ultra Deluxe. Expensive though; licensed per ESXi host CPU socket.

And the Nexus stuff in general is really really nice -- I'm a big fan. A couple of Nexus 5000 as core switches (they are 1U each), and then multiple Nexus 2000 sprinkled through your environment as edge/top-of-rack switches, all 10Gbps, super powerful and fast, and very flexible.

Running the L3 daughtercards in a Nexus 5500 absolutely decimates the throughput they are capable of and limits them in a number of ways (although those limitations are getting better all the time). I have some 5Ks "out there" with the Layer 3 cards in, where the 10GbE requirements were small enough that no predictable scaling up will exceed the L3 limitations fully on the proviso that should that come to be the case, we'd swap out the L3 cards for the L2 cards and punt the routing up a level to a 4500 or similar.

Nexus 2Ks are good but you can't "sprinkle them about" without careful consideration because they are not switches and there are some very important considerations to take into account when planning that environment out.

Where 5Ks really shine is when you get FC/FCoE and FabricPath involved. There are some really poor failure scenarios with vPC but past that, the possibilites are really great. I couldn't be happier with my 7K/5K/2Ks

Shad · 25 Apr 2013 at 17:33

DRZ said:
Veeam doesn't use a catalogue in the same way as, say, Backup Exec does. You can quickly install Veeam on a machine that isn't down, double-click on the .vbk and restore the VM from there.

I'm amazed people are running Veeam actually on the vCenter server though.

Ok that's good to know. Hopefully I'll never need to test that out but still, it's good to know it's doable.

For me I'm only working on a home lab so kit etc is limited.

rotor said:
PVLANs are something completely different from the Nexus 1000v "system VLANs". PVLANs appear to be for securing multi-tenant environments.

Ah my mistake, thanks for clarifying

Dogers · 25 Apr 2013 at 20:29

DRZ said:
Dogers, am I right in thinking that you have your NFS stores connected via a dVSwitch? You've given that impression!

We have indeed, yep! You sound like you're going to say that's bad?

Thinking back to these system VLANs though, I don't think they'd help us in the situation we've been in. I assume if you lose the supervisor you still can't move hosts as the ports aren't enabled/available on the new host? What would help though is being able to run two supervisors, potentially even FT'ed so there are really four (yeah yeah, I know that's not *quite* the same

) available.

DRZ · 26 Apr 2013 at 11:02

Thinking it through quickly, I can't see an advantage of having your storage presented via a dVSwitch. If anything, it is massively constraining because it puts loads of dependencies elsewhere.

We don't use dVSwitches (we're just careful with our vSwitch configurations) but if we did, there's no way in hell I'd run my storage over one.

I have two VLANs specifically for NFS, one carries Citrix VDI NFS traffic and the other carries VMWare NFS traffic. Each one has its own 20GbE ifgrp on our NetApps (one leg of each into a pair of 2Ks->5Ks) so the traffic from one can never bother the traffic for another. Normally our disks would run out of puff first but with so many deduped blocks being loaded into FlashCache you can't base a design on that sort of thing any more.

On the VMWare side its just a vmk port for the relevant NFS VLAN on each host. Probably adds 2 minutes to each ESXi host build. Sounds like dVSwitches have cost you many times that time already...

Deleted member 138126 · 26 Apr 2013 at 12:41

Quite a bit of FUD in this thread... dvSwitches aren't going to solve world hunger, but they aren't the devil, either. A user has an issue with what to me sounds like incompatible/unsupported cards -- this has nothing to do with whether dvSwitches are suitable or not.

What is reinforced by the various points of view (some valid, some misguided) is that virtual infrastructure requires careful planning, and it is critical to use supported hardware and software.

Deleted member 138126 · 26 Apr 2013 at 12:47

DRZ said:
On the VMWare side its just a vmk port for the relevant NFS VLAN on each host. Probably adds 2 minutes to each ESXi host build. Sounds like dVSwitches have cost you many times that time already...

dvSwitches clearly have benefits to someone, or no-one would use them. I see them --as a lot of other things in my work-- as incremental benefits to our lives as sysadmins, that designed, implemented and managed properly, are "a good thing". In my case, I value the simplicity of managing a single entity across many hosts, rather than having to manage each host individually.

The original issue appears to be faulty/incompatible/unsupported cards that are causing the entire environment to go belly-up -- how is that the fault of dvSwitches? I'm not defending the use of dvSwitches, but this particular environment needs urgent attention, and it's not the dvSwitches that need the attention.

Deleted member 138126 · 26 Apr 2013 at 12:57

DRZ said:
Running the L3 daughtercards in a Nexus 5500 absolutely decimates the throughput they are capable of and limits them in a number of ways (although those limitations are getting better all the time). I have some 5Ks "out there" with the Layer 3 cards in, where the 10GbE requirements were small enough that no predictable scaling up will exceed the L3 limitations fully on the proviso that should that come to be the case, we'd swap out the L3 cards for the L2 cards and punt the routing up a level to a 4500 or similar.

Nexus 2Ks are good but you can't "sprinkle them about" without careful consideration because they are not switches and there are some very important considerations to take into account when planning that environment out.

Where 5Ks really shine is when you get FC/FCoE and FabricPath involved. There are some really poor failure scenarios with vPC but past that, the possibilites are really great. I couldn't be happier with my 7K/5K/2Ks

Interesting feedback. In the environment where I had 5k/2k/1kv I honestly don't know if the 5ks had L2 or L3 line cards, so I can't comment. I haven't worked there for a year, so can't ask the Networks guys that set it up. We did plenty of benchmarking, and were blown away by the performance (e.g. vMotions were completing in less than 10 seconds).

Not sure about your comments re. can't "sprinkle them about" because they are not switches. What do you mean by them not being switches? Traffic between two hosts on a single 2000 on the same VLAN will not go up to the 5000 and back down, does it? Careful consideration is a given in any case, but otherwise I don't follow.

My view as a non Networking professional, is that the Nexus architecture is a massive improvement from the old Catalyst 6500 system of centralised giant switch with Cat5e running everywhere. I much prefer the "top of rack" approach to networking, to keep the cabling within the datacentre to a minimum. Put another way, the Nexus system is like a distributed 6500, with the 2000s as the 6500 blades and the 5000s as the 6500 supervisor blades, but located near the equipment they are connecting.

DRZ · 26 Apr 2013 at 17:08

rotor said:
Quite a bit of FUD in this thread... dvSwitches aren't going to solve world hunger, but they aren't the devil, either. A user has an issue with what to me sounds like incompatible/unsupported cards -- this has nothing to do with whether dvSwitches are suitable or not.

What is reinforced by the various points of view (some valid, some misguided) is that virtual infrastructure requires careful planning, and it is critical to use supported hardware and software.

Not sure you're aiming that at me, but my opinion is based on running enterprise-class environments. I don't want my vCenter going belly up (for any reason) to cripple anything. I could nuke my vCenter box right now and perhaps with the sole exception of Veeam, none of my production systems would be impacted. This guy can't say the same thing (sadly).

rotor said:
dvSwitches clearly have benefits to someone, or no-one would use them. I see them --as a lot of other things in my work-- as incremental benefits to our lives as sysadmins, that designed, implemented and managed properly, are "a good thing". In my case, I value the simplicity of managing a single entity across many hosts, rather than having to manage each host individually.

The original issue appears to be faulty/incompatible/unsupported cards that are causing the entire environment to go belly-up -- how is that the fault of dvSwitches? I'm not defending the use of dvSwitches, but this particular environment needs urgent attention, and it's not the dvSwitches that need the attention.

How a CNA failure could cause such cluster-wide failure is the point. That failure should be confined to a single host (unless he is unlucky, of course) which HA should be able to tolerate.

I don't know how you've taken the meaning of my posts, but I'm not saying he (or anyone) should abandon dvswitches but I'm saying that there are things you can do with them that you shouldn't. I'd categorise management access, vCenter and storage traffic as something you shouldn't do (at the moment).

rotor said:
Interesting feedback. In the environment where I had 5k/2k/1kv I honestly don't know if the 5ks had L2 or L3 line cards, so I can't comment. I haven't worked there for a year, so can't ask the Networks guys that set it up. We did plenty of benchmarking, and were blown away by the performance (e.g. vMotions were completing in less than 10 seconds).

Not sure about your comments re. can't "sprinkle them about" because they are not switches. What do you mean by them not being switches? Traffic between two hosts on a single 2000 on the same VLAN will not go up to the 5000 and back down, does it? Careful consideration is a given in any case, but otherwise I don't follow.

My view as a non Networking professional, is that the Nexus architecture is a massive improvement from the old Catalyst 6500 system of centralised giant switch with Cat5e running everywhere. I much prefer the "top of rack" approach to networking, to keep the cabling within the datacentre to a minimum. Put another way, the Nexus system is like a distributed 6500, with the 2000s as the 6500 blades and the 5000s as the 6500 supervisor blades, but located near the equipment they are connecting.

Traffic between two ports on a 2K absolutely go via the parent 5K. You configure them on the 5K as ports as if they were line cards. Which is essentially what they are, remote line cards.

The L2/L3 performance thing is on the Cisco site somewhere. I'll see if I can find it later. You're limited by how many 2Ks you can connect as well, dropping from 24 to 16 (used to be lower in previous software releases I think). There are other things like having to configure everything twice (which you can do with some config sharing stuff but I'm not entirely certain about that).

What you end up with is a massive bulk of fibre spanning your aisles, hooking back up to 5Ks, which in turn uplink to the 7Ks or whatever is doing your L3 stuff. Probably not a 6500 these days as they are in the Borderless Networks space, not the DC space. Regardless, the effective lack of in-unit switching is almost a crippling limitation of the 2Ks. It requires you to at least double the uplink bandwidth you were planning for your switches if your layout dictates a lot of in-rack traffic. It is this sort of thing that takes the Nexus range up a level in terms of planning and down a peg in terms of scale, especially if you're comparing it to end of row arrangements where backplane bandwidth figures are enormous.

ToR vs EoR is a big old argument...

Dogers · 26 Apr 2013 at 22:35

DRZ said:
Thinking it through quickly, I can't see an advantage of having your storage presented via a dVSwitch. If anything, it is massively constraining because it puts loads of dependencies elsewhere.

Load balance by physical NIC load? NIOC? We've done it mainly because VMWare Network Best Practices say that's the way to go. dvSwitches have been out for a couple of versions now so should be fairly stable?

We've had an amazing run of bad luck with our VMWare project. This week a HP NIC overheated (literally - there was a note in the logs from the driver saying it had hit 100*C) and shut itself down :eek:

Not a massive outage fortunately, but we're now planning to have two different makes of card in each host so one card overheating doesnt kill the driver for the other..

DRZ · 27 Apr 2013 at 09:16

From doing some more research last night, it does look like VMWare have revised dvSwitches for the better (and, admittedly, my opinion of them was formed when they were newer). I'm still not trusting them though, especially when the need to run a hybrid setup for management (which doesn't appear to be strictly true but I'd need serious convincing) means I'm running more cables to the hosts, which means more gigabit ports in the DC - and I've just removed almost all of them.

Best practices are great (and I do try and run best practice where possible) but in this case you seem to have been bitten by them, something I'm pretty confident wouldn't hurt my environment majorly.

NIOC is interesting but at the moment IP Hashing is doing a sterling job for me and I'm nowhere near the limits of my LAN performance per host at the moment. I guess if/when things get busier it might become the thing that forces my hand. We'll see.

I really hope your luck improves

Deleted member 138126 · 29 Apr 2013 at 12:51

Dogers said:
We've had an amazing run of bad luck with our VMWare project. This week a HP NIC overheated (literally - there was a note in the logs from the driver saying it had hit 100*C) and shut itself down

Not a massive outage fortunately, but we're now planning to have two different makes of card in each host so one card overheating doesnt kill the driver for the other..

I work with hundreds of HP servers, and have never seen a NIC overheat, so I would suggest it is a one-off hardware malfunction (could be the thermal sensor is faulty for all you know). It is far better to operate with all HP components, as they are tested and supported together, e.g. you can boot off a single ISO and update every single firmware on the system, drivers are released in tested bundles, you will only need a single ESXi driver bundle (the one from HP), etc.

DRZ · 29 Apr 2013 at 13:47

rotor said:
I work with hundreds of HP servers, and have never seen a NIC overheat, so I would suggest it is a one-off hardware malfunction (could be the thermal sensor is faulty for all you know). It is far better to operate with all HP components, as they are tested and supported together, e.g. you can boot off a single ISO and update every single firmware on the system, drivers are released in tested bundles, you will only need a single ESXi driver bundle (the one from HP), etc.

Not sure I've ever seen a NIC overheat no matter what the brand :eek:

Deleted member 138126 · 29 Apr 2013 at 13:59

DRZ said:
Not sure you're aiming that at me, but my opinion is based on running enterprise-class environments. I don't want my vCenter going belly up (for any reason) to cripple anything. I could nuke my vCenter box right now and perhaps with the sole exception of Veeam, none of my production systems would be impacted. This guy can't say the same thing (sadly).

Definitely wasn't aimed at anyone in particular (I didn't pay attention to who the authors of the various posts were) -- it was just a reaction to what seemed like a bunch of negative posts towards dvSwitches, which in my experience (also enterprise) have been fine.

I'm a big fan of simplicity, and requiring multiple NICs (not even possible past a certain point with blades), multiple switches to configure and look after, more cabling to pay for and maintain, etc. all constitute targets for simplification. Not saying that I'm taking anything to the extreme, but in my experience (running storage over dedicated Fibre -- I've never used iSCSI outside of dev/test), dvSwitches have never been a problem.

How much breaks when vCenter Server is not running?

Deleted member 138126

Deleted member 138126

Deleted member 138126

Deleted member 138126

Deleted member 138126

Deleted member 138126

Deleted member 138126

Deleted member 138126

Deleted member 138126

Deleted member 138126

Deleted member 138126

Deleted member 138126

Deleted member 138126

Deleted member 138126

Deleted member 138126

Deleted member 138126