Infiniband antics.

Associate
Joined
10 Nov 2004
Posts
2,237
Location
Expat in Singapore
As some who may be reading Shads fibre channel setup thread may be aware, I have been looking at infiniband rather than FC for a storage network.


There are a number of reasons for this;
  • Equipment is plentyful in the second hand market.
  • Equipment is fairly cheap if you go a couple of generations back.
  • Switches are generally cheaper than 10GbE in the second user markets.
  • Cables can be had for a decent price if you are lucky and hunt around.
Downsides;
  • Pretty old Infinihost cards need onboard ram to work with Solaris (from what I have read).
  • Cables can be big and difficult to easily manage.
  • Drivers / standards are not universal across OS's.
My current setup;
4x Mellanox ConnectX DDR (20Gbps) infiniband HCAs (Used - MHGH28-XTC): US$75 each.
1x Mellanox ConnectX-2 QDR card for one of my Dell C6100 server nodes (Used - Mez card): US$199.
1x Flextronics F-X430046 24-Port 4x DDR Infiniband Switch (used): US$200 (from a friend).
4x 8mtr CX4 -> CX4 DDR Infiniband cables (new): US$15 (caught a fantastic deal as these are usually over US$200 new but only had 8mtr available).
1x 2mtr QSFP -> CX4 QDR Infiniband cable (new): US$105

Item Total (approx): US$864 + shipping (US to Singpore where I currently am).

The current setup looks like this...
20130411_200350.jpg


This is the network the Infiniband equipment will be part of (planned completed setup).
ProposedNetwork_zps9dc66395.jpg


RB
 
Associate
OP
Joined
10 Nov 2004
Posts
2,237
Location
Expat in Singapore
Infiniband.

Infiniband (IB) is primarily a storage fabric. It is designed to add large bandwidth and low latency for connectivity between SAN / NAS and servers. It comes in 4 core levels (more are available but cost big chunks of cash) and uses 8b/10b encoding so for every 10 bytes sent, 8 bytes are actual data.

SDR: Single data rate - 10Gbps (8Gbps data).
DDR: Double data rate - 20Gbps (16Gbps data).
QDR: Quad data rate - 40Gbps (32Gbps data).

My equipment is DDR so can manage an upper limit of 16Gbps or 2GB/s (sender and receiver storage systems allowing).

The Infiniband stack is setup sort of like the OSI TCP/IP layers and consists of the following;

User Space
Application level: OpenSM(*1) / Applications making use of the User API or ULPs
User API Level: User level Verbs / uDAPL

Kernel Space
Upper Layer Protocols: IPoIB (CM/UD), EoIB, SDP, SRP, iSER, FCoIB, RDS, rNFS, Lustre LND
Mid Layer: Infiniband Transport Framework, SA Client, MAD, SMA, Communication Manager
Drivers Layer: (OS specific)
Hardware: ConnectX, ConnectX-2 etc.
(taken from an Oracle Solaris Infiniband Stack diagram - Will try to link but the original Oracle link is dead)

(*1): An Infiniband network requires a Subnet Manager to be running. This can be included in to a switch (model dependent) or a node on the network. The subnet manager polls the network and maintains the routing tables. A free subnet manager is available in the Open Fabrics Alliance drivers packages called OpenSM.

The Infiniband fabric is the connection between the HCAs (Host Channel Adapters), with or without 1 or more switches / routers etc. This creates the road for the traffic but needs to be paired with one or more protocols to enable data to flow. As such the protocols can be through of as cars of different types and passengers as the data. More than one protocol ban be used on the IB fabric at any one time.

By far the most written about is IPoIB which layers the Internet Protocol over the IB fabric. This is fairly easy to configure and is reasonably well documented but it is limited and tends not to be able to keep up with anything over SDR. To take a step up there is SRP (SCSI over RDMA Protocol) which allows one node to write directly to memory on another node and removed the overhead of running a IP protocol. This was my chosen path.

There are also other protocols coming up including iSER (iSCSI Extensions for RDMA - RDMA is Remote Direct Memory Access), NFS over RDMA and RDMA over Converged Ethernet (RoCE).

I have chosen SRP as it is one of the older protocols and can provide better than IPoIB speeds and so make the most of my DDR hardware. Mellanox, a leading provider of IB equipment and a partner in the Open Fabrics Alliance, also recently released a ESXi 5.1 driver including SRP.

Now on to my own experiences trying to get the theory working.
 
Associate
OP
Joined
10 Nov 2004
Posts
2,237
Location
Expat in Singapore
From the initial diagram, you can see that I am trying to connect 3 core operating systems.

Windows Server: Domain Controller, Backup Server, DNS / DHCP.
ESXi 5.1: Test servers.
Solaris 11.1 (or like systems): SAN with ZFS.

I found, to my dismay, there are a number of issues with the driver packages for the different OS's.

Windows 2012.
  • OpenSM: Yes
  • SRP: No
Windows 2008r2.
  • OpenSM: Yes
  • SRP: No (Mellanox), Yes (OFA)
VMware ESXi 5.1.
  • OpenSM: No
  • SRP: Yes
Solaris 11.1.
  • OpenSM: No
  • SRP: Yes
Linux (CentOS).
  • OpenSM: Yes (Redhat IB package)
  • SRP: Yes (unconfirmed)

Added to this, the OFA Windows 2008r2 SRP drivers seem not to like working with the Solaris SRP targets.

So, bottom line is that you need either a switch with a subnet manager or a Windows / Linux node to run a subnet manager. There are reports of someone having compiled a Solaris subnet manager if someone wants to go down that route but that is not a direction I am looking to take.

For may storage box I chose Solaris 11.1 as I wanted to get more involved with Solaris from an admin point of view, it has ZFS and Infiniband as standard. Infiniband is setup via the Comstar tool and is very simple following the documentation for Solaris.

Next I setup ESXi 5.1 and for this you need to download the Mellanox patch from their drivers page here. The package needs to be sftp'd to the ESXi server and then you need to run the installer from the command line on the server. For this you need to enable SSH access (see here).

The command to install the vib package is...
esxcli software vib install -d "/vmfs/volumes/Datastore/DirectoryName/PatchName.zip"

It is also possible to install straight from a HTTP location using...
esxcli software vib install -v viburl

Viburl is the full http path to the vib file. I have not tried this method myself so do not know how well it works.

I used the first method but found that you have to have a local disk for this as the ESXi system image does not have enough space for you to load the patch. When adding the first datastore (for temp patching), ESXi uses it for writing some log files to and this causes it to become a pain to remove. I had to setup my IB controller, mount the storage and then copy the folders to the mounted IB storage. I then had to change the path for the log files in the ESXi configuration to point to the new area before I could unmount the local drive. Once the vib is installed my Mellanox controller appeared under network adapters and storage adapters. No storage appeared though which I later tracked down to not having a subnet manager running. Once this was sorted out (on a Windows server with a IB card) the storage just appeared and could be used. I have a CentOS VM running on the storage network and it seems much faster than running local. I am unaware of any benchmarking tools between ESXi and Solaris but I can try between CentOS (VM) and Solaris at some point.

Windows installs now seem to be pretty much limited to IPoIB at this point if you want to connect to other OS's. The driver install is fairly easy and OpenSM installs as a service. There is a SRP miniport driver as part of the OFA OFED Windows 2008r2 driver package but as of yet, I have no been able to get this up and running with a Solaris SRP target.

Linux (CentOS) has the ability to run both Infiniband and OpenSM very easily but the problems then start with trying to get the SRP protocol and targets running. Even following fairly through guides like this one you still need to rebuild the kernel, build various packages and then try to get everything running. I have even tried building the latest kernel on CentOS which is listed as having IB_SRP support but still no go. ILO provided no more help either as it does not seem to have the ib_srpt modules included.

So far, Solaris has been the easiest to setup and get working with another OS (ESXi), Windows seems to have no SRP drivers that work with Solaris SRP targets, ESXi is also super easy to get up and running but neither Solaris or ESXi have a subnet manager which means you need to have a Windows or Linux machine with a IB HCA on the network. I did try doing a passthrough of the IB card to my Windows 2012 Essentials server (VM) but it failed to start reporting that there were passthrough device issues. I have not patched that ESXi though so the passthrough issues may be resolved now. I will patch this week and try again.

I will continue to update as I learn more and this information is purely from my own reading, internet guides and experiences. If any of it is felt to be inaccurate then let me know and I shall amend.

RB
 
Associate
Joined
1 Dec 2005
Posts
803
Antics indeed! Nice write up though, thanks for providing the details. Hope you get it running how you want :)
 

DRZ

DRZ

Soldato
Joined
2 Jun 2003
Posts
7,419
Location
In the top 1%
Out of interest, are you playing with this just out of pure geek curiosity or are you trying to learn storage to advance your career?

I only ask because I find it unlikely I'll be coming across IB in my career and while it is interesting, I'd rather be doing something that was both interesting and constructive.

Excellent write-up though! :)
 
Associate
OP
Joined
10 Nov 2004
Posts
2,237
Location
Expat in Singapore
Out of interest, are you playing with this just out of pure geek curiosity or are you trying to learn storage to advance your career?

I only ask because I find it unlikely I'll be coming across IB in my career and while it is interesting, I'd rather be doing something that was both interesting and constructive.

Excellent write-up though! :)

Interest and best bang for buck. I would tend to agree that IB has very select penetration in the enterprise environment and FC would probably be a better option for career advancement but considering the second hand market, IB can potentially give better bandwidth for a lower price. Looking at using IB as a server / cluster messaging medium could also be interesting but again, quite a niche market. Picking up a bit of Solaris storage admin, zfs etc is just a bit of a bonus. The HCAs do have dual ports though so multipathing is an option.

Paired with an IB switch and QDR IB mez cards, the Dell C6100 will provide an all in 1, 2U, 3 node ESXi cluster with dedicated fast san storage. That was where I was heading towards with this experiment.

RB
 
Associate
OP
Joined
10 Nov 2004
Posts
2,237
Location
Expat in Singapore
Antics indeed! Nice write up though, thanks for providing the details. Hope you get it running how you want :)

[RXP]Andy;24121064 said:
Wow. That's a pretty epic few posts, I will be following this thread with interest.

Thanks guys.

Had the info and was hoping to do a guide but the software / drivers issues cross OS's meant I don't yet have a final solution so I finally decided to put the info down regardless and then continue with it.

RB
 
Associate
OP
Joined
10 Nov 2004
Posts
2,237
Location
Expat in Singapore
Trying to leverage branded server hardware can be an ugly experience :).

Turns out that two of my Infiniband cards are v.A1 and the other two are v.A2. The A2s will not flash using the instructions for me. Others have reported it working for them so it may be hardware dependant (i.e. motherboard / PCIe slot etc). It turns out that this may not be a bad thing.

The cards are arranged like so (ports are the switch port numbers).
Port 1 - Solaris 11.1 - ConnectX-2 (SAN)
Port 2 - ESXi 5.1: ConnectX A1 (flashed)
Port 3 - ESXi 5.1: ConnectX A2 (unflashed)
Port 4 - CentOS 6.4 (OpenSM): ConnectX A2 (unflashed)
Port 5 - ESXi 5.1: ConnectX A1 (flashed)

The server on port 2 could see the targets presented by the server on port 1 but after changing the card to an A1 (suspect it was previously an A2) it cannot.

The server on port 5 has never been able to see the targets made available by the server on port 1. It is also the server where bare metal Windows was installed and I could not get the Windows SRP target working.

On investigation I happened to look at the OpenSM log files and it was reporting IB_Timeouts on port 2 (server on port 5 was turned off). The error stated there could be an issue with the mkeys on the HCA.

sm_mad_ctrl_send_err_cb: ERR 3120 Timeout while getting attribute 0x15 (PortInfo); Possible mis-set mkey?

I then changed the cards around.
Port 1 - Solaris 11.1 - ConnectX-2 (SAN)
Port 2 - ESXi 5.1: ConnectX A2 (unflashed)
Port 3 - ESXi 5.1: none
Port 4 - CentOS 6.4 (OpenSM): ConnectX A1 (flashed)
Port 5 - ESXi 5.1: ConnectX A2 (unflashed)

Now both the servers on ports 2 and 5 can see the targets prosented by the server on port 1 and there are no errors in the OpenSM logs. The fact it is running on an A1 card seems to make no difference. I suspect that if I tried to mount the targets it may well fail though.

The lesson is to make sure you get A2 or newer version cards of the MHGH28-XTC cards.

Interestingly, port 5 servers ESXi install did not see the ver A2 card after swapping out the A1 card (which it did see) and I had to remove and re-install the Mellanox vib for it to appear which is a bit of a pain. There may be an easier way to get it to 'refresh' but this seemed a fairly good bet to work so went that way.

The target appeared in the Port 5 server and more surprisingly, the datastore also appeared without needing to import it (it is also mounted on the port 2 server). Now I should have the ability to start VMs stored on that datastore on two servers (haven't tried it yet though ;) and trying it at the same time would probably be a bad thing :D).

I also found that the ver A1 cards will not work with ESXi 5.1 passthrough even after applying the latest patch to ESXi. After doing all the passthrough, reboot, assigning to VM stuff, on booting the VM it errors and wont start. No PSOD thankfully but still not usable this way. I have not tried A2 cards but expect the same issue.

I would need to reinstall Windows SBS 2011 Ess bare metal to the server in order to try out the Windows SRP function. After so many reinstalls of Windows servers and domain leaving and rejoining I have inflicted on my Wife and kids I am fairly loath to have to go through it all again.

I may pick up another ConnectX-2 card or two as the Dell C6100 mez cards are going for around US$150 each now which is a steal as standard PCIe ConnectX-2 cards are around US$250+ on the second hand market. Cables could double the price though :(.

RB
 
Associate
OP
Joined
10 Nov 2004
Posts
2,237
Location
Expat in Singapore
Well some of the mystery concerning the timeouts seems to be resolved.

It seems the ConnectX latest firmware does not play nice with the ESXi drivers. The card is seen but cannot bring the link up.

Reverting to the 2.7 firmware corrects the issue.

It is also worth noting that there seems to be no more development going on for the ConnectX cards as Mellanox is concentrating on the ConnectX-2 and -3 cards now.

I have a ConnectX-2 card on its way to me for my Windows server to see if I can get NFSoIB up and running. Firmware 3.0 should be out soon. I understand it is with various paries at the moment for user testing.

Just as a quick, very unscientific test I ran hdperm -t on the following setup;
Source: Solaris 11.1 -> 4x1.5TB raidZ with Intel 520 120GB ZIL and OCZ Agility 3 L2ARC (server has 48GB ram).
Fabric: Infiniband DDR (20Gbps) SRP.
Destination: ESXi 5.1 (free hypervisor) & Mellanox IB vib - 1x 3.7TB VMFS datastore containing 1x Linux (CentOS 6.4) and 1x Windows (2012 Ess) VMs (both running at the same time from different VHosts but not any user instigated tasks).

I created a new 10GB virtual hard drive (VMK) and added it to the CentOS VM. I then ran hdparm -t /dev/sdb
Read results came out at around 950MB/s

Doing the same on the boot disk(/dev/sda) produced much worse results but this really was more for fun than a serious benchmark.

If anyone has any decent ideas on how best to benchmark between ESXi and Solaris on a storage fabric (i.e. not a network benchmark) then please let me know.

I may try Crystalmark on my Win 2012 VM based on Shad and Andys results in the FC thread here.

RB
 
Associate
Joined
1 Dec 2005
Posts
803
Can't wait to see what IOPs that lot is capable of :)

I feel your pain though with the whole ConnectX thing and what is/isn't supported now or moving forward. They seem to move through hardware versions quite quickly. My 10GbE cards are ConnectX-2 thankfully and work well in ESXi, provided you only want to use one port...
 
Associate
OP
Joined
10 Nov 2004
Posts
2,237
Location
Expat in Singapore
Yeah it has been a little confusing what with the series (ConnectX -2 -3) not being retired but only the first gen not being developed any more.


I have just been asked to put together an all in one 4 node solution around the Dell C6100 cloud server, CentOS, Infiniband and Hadoop. Oh I suspect I have interesting days ahead.


RB
 
Associate
OP
Joined
10 Nov 2004
Posts
2,237
Location
Expat in Singapore
Just a little update...

Somone has compiled OpenSM (Subnet manager required for Infiniband) for ESXi 5.1 and packaged it up as a vib. Seems to work very well so far.

Take a look here.

There were some ConnectX-2 cards (MHRH2A-XSR) going on the US EBay for US$65 each and cables for around US$20.

So that is two cards and a cable for a basic setup coming in at around US$150 + shipping for upto 40Gbps Infiniband connectivity or 10GbE.

Splash out on the FDR cards going for around US$260 each and you can potentially get 56Gbps IB or 40Gbps Ethernet over IB
 
Associate
Joined
8 Dec 2018
Posts
3
I hope RimBlock is still here. I know a super old post that im going to bring TTT!!! please don't ban me for it but its my question I have googled for days..

so I too am doing an InfiniBand setup as I have been given HW for free.. so why not.. hah..

have you gotten any farther with your setup? did you get the connectx cards to work?

ill start my own post...
 
Back
Top Bottom