10Gbe poor performance in windows

Kei

Kei

Soldato
Joined
24 Oct 2008
Posts
2,751
Location
South Wales
I've run some basic testing on the 3 machines that are going to be linked together via 10Gbe and found some slightly strange performance issues.

This is the kit that is linked together:

Switch - Juniper EX3300 with juniper SR transceivers no vlans or L3 features enabled 1500 MTU
PC1 - Windows 10 1803 - Threadripper 1920x / Intel X710 with intel SR transceiver Tx/Rx Buffers adjusted to 4096
PC2 - Windows 10 1803 - i7 4820K / Intel X710 with intel SR transceiver Tx/Rx Buffers adjusted to 4096
NAS - Fedora 30 - i5 9600K / Solarflare SFN7122F with Avago SR transceiver no driver adjustments

Running iperf server on PC1 and client on NAS nets around ~7.5Gbps. Running iperf server on the NAS and PC1 as client nets ~3Gbps. Running iperf on PC1 and PC2 results in ~1.5Gbps no matter what I try.

I'm not seeing any dropped packets or errors on the switch monitoring for the 10gig ports. I've tried disabling flow control and interrupt moderation in the intel driver and that makes zero difference. Task manager doesn't show the cpu getting hammered either so I'm not sure what's going on here.

Since there is the suggestion that iperf is not optimised for windows, I also gave NTttcp a go too.

Code:
PS I:\Downloads\NTtcp> ./ntttcp.exe -s -m 8,*,192.168.0.160 -l 128k -a 2 -t 15
Copyright Version 5.33
Network activity progressing...


Thread  Time(s) Throughput(KB/s) Avg B / Compl
======  ======= ================ =============
     0   15.016        95863.612    131072.000
     1   15.454        34596.609    131072.000
     2   15.047        43018.276    131072.000
     3   14.750        13797.966    131072.000
     4   15.016        56984.550    131072.000
     5   15.016        55688.865    131072.000
     6   15.016        54248.269    131072.000
     7   15.016        56686.201    131072.000


#####  Totals:  #####


   Bytes(MEG)    realtime(s) Avg Frame Size Throughput(MB/s)
================ =========== ============== ================
     6037.750000      15.015       1460.054          402.115


Throughput(Buffers/s) Cycles/Byte       Buffers
===================== =========== =============
             3216.916       2.353     48302.000


DPCs(count/s) Pkts(num/DPC)   Intr(count/s) Pkts(num/intr)
============= ============= =============== ==============
    28524.875         0.627       52573.427          0.340


Packets Sent Packets Received Retransmits Errors Avg. CPU %
============ ================ =========== ====== ==========
     4336168           268517           0      0      1.184


In order to rule out the network infrastructure itself, I've also tried doing iperf tests on the local machine itself to the loopback address sees an average of 4.5Gbits/s. Setting the TCP window size to 2048000 results some heavy variability but the best I've seen is 8.9GBits/s and the average is around 6.5. This suggests to me that it's a windows issue.


Do the same thing on the 9600k linux server results in something like I'd expect.


Running similar tests in ntttcp give the same results.
Code:
PS I:\Downloads\NTtcp> ./ntttcp.exe -r -m 8,*,192.168.0.2 -l 128k -a 2 -t 15
Copyright Version 5.33
Network activity progressing...


Thread  Time(s) Throughput(KB/s) Avg B / Compl
======  ======= ================ =============
     0   15.000        76458.280     48480.812
     1   15.000        76458.755     48927.487
     2   14.999        76464.043     53761.016
     3   14.999        76463.758     48379.198
     4   15.000        76459.896     54396.665
     5   15.001        76454.229     64123.136
     6   15.000        76458.660     55378.178
     7   15.000        76458.755     51974.087


#####  Totals:  #####


   Bytes(MEG)    realtime(s) Avg Frame Size Throughput(MB/s)
================ =========== ============== ================
     8960.028477      15.000       1359.733          597.335


Throughput(Buffers/s) Cycles/Byte       Buffers
===================== =========== =============
             4778.682       9.155     71680.228


DPCs(count/s) Pkts(num/DPC)   Intr(count/s) Pkts(num/intr)
============= ============= =============== ==============
     1103.533       417.425       30830.067         14.941


Packets Sent Packets Received Retransmits Errors Avg. CPU %
============ ================ =========== ====== ==========
     6909840          6909643           0      0      6.840
PS I:\Downloads\NTtcp> ./ntttcp.exe -r -m 24,*,192.168.0.2 -l 256k -a 2 -t 15
Copyright Version 5.33
Network activity progressing...


Thread  Time(s) Throughput(KB/s) Avg B / Compl
======  ======= ================ =============
     0   15.000        24166.232    112041.449
     1   15.001        24164.716    107655.099
     2   14.993        24161.063    104284.717
     3   15.001        24164.716    111003.224
     4   15.008        24169.690    111544.877
     5   14.991        24164.381    104051.108
     6   14.994        24158.976    102780.116
     7   15.002        24163.105    105483.029
     8   14.999        24168.698    108318.197
     9   15.001        24164.716    112688.154
    10   14.992        24162.865    108146.840
    11   14.999        24167.938    117802.215
    12   15.000        24166.327    110474.637
    13   15.000        24166.327    108251.613
    14   15.001        24164.811    105483.444
    15   14.993        24160.778    102017.701
    16   14.994        24160.973    107432.407
    17   15.001        24164.716    105393.180
    18   15.003        24161.495    103224.355
    19   15.001        24163.955    105901.027
    20   15.001        24164.716    108409.690
    21   14.993        24162.394    102930.405
    22   14.992        24162.770    106807.429
    23   15.001        24163.480    109459.098


#####  Totals:  #####


   Bytes(MEG)    realtime(s) Avg Frame Size Throughput(MB/s)
================ =========== ============== ================
     8494.292297      15.000       1309.137          566.286


Throughput(Buffers/s) Cycles/Byte       Buffers
===================== =========== =============
             2265.145       7.807     33977.169


DPCs(count/s) Pkts(num/DPC)   Intr(count/s) Pkts(num/intr)
============= ============= =============== ==============
     1336.133       339.470      404661.400          1.121


Packets Sent Packets Received Retransmits Errors Avg. CPU %
============ ================ =========== ====== ==========
     6803737          6803653           0      0      5.530
PS I:\Downloads\NTtcp> ./ntttcp.exe -r -m 12,*,192.168.0.2 -l 32M -a 2 -t 15
Copyright Version 5.33
Network activity progressing...


Thread  Time(s) Throughput(KB/s) Avg B / Compl
======  ======= ================ =============
     0   14.877        61672.682  22369633.333
     1   14.880        61660.248  22369633.333
     2   14.880        61660.248  22369633.333
     3   14.882        61651.962  22369633.333
     4   14.895        61598.153  18067780.769
     5   14.883        61647.819  22369633.333
     6   14.893        61606.425  22369633.333
     7   14.888        61627.115  22369633.333
     8   14.885        61639.536  22369633.333
     9   14.897        61589.883  22369633.333
    10   14.880        61660.248  22369633.333
    11   14.883        61647.819  22369633.333


#####  Totals:  #####


   Bytes(MEG)    realtime(s) Avg Frame Size Throughput(MB/s)
================ =========== ============== ================
    10752.005768      15.000       1381.202          716.800


Throughput(Buffers/s) Cycles/Byte       Buffers
===================== =========== =============
               22.400       5.567       336.000


Run the same test on the threadripper system in a live boot of fedora 30 and all is well. Definitely seems to be a windows issue.



Question is, what's wrong in windows as I'm seeing the same problem on both my threaripper system and my x79 system. But not having any trouble if I use linux instead.
 
What baffles me is that windows itself isn't necessarily the issue as I've seen plenty of other people have no issues achieving near 10 gigabits/s transfers using windows 10 with similar or even older hardware. The performance I'm seeing on my threadripper system isn't actually too bad but it's not great either. The x79 system on the other hand is frankly awful. That thing can barely push past gigabit in windows. I'm suspecting that the OS on that system might be "borked" as I have the feeling it was upgraded from an F8320/990FX without reinstalling.

Showing the network adapter info in powershell makes me suspicious of the reported pcie link width.
Code:
PS C:\WINDOWS\system32> Get-NetAdapterHardwareInfo

Name                           Segment Bus Device Function Slot NumaNode PcieLinkSpeed PcieLinkWidth Version
----                           ------- --- ------ -------- ---- -------- ------------- ------------- -------
WiFi                                 0   4      0        0    1        0      2.5 GT/s             1 1.1
10Gbe 1                              0   8      0        0             0      8.0 GT/s             2 1.1
Gigabit Lan                          0   5      0        0    1        0      2.5 GT/s             1 1.1
10Gbe 2                              0   8      0        1             0      8.0 GT/s             2 1.1


This is what I get for the X79 system. Definitely operating at pcie 3.0 in both cases but the link width is definitely wrong on the threadripper system.
Code:
PS C:\WINDOWS\system32> Get-NetAdapterHardwareInfo

Name                           Segment Bus Device Function Slot NumaNode PcieLinkSpeed PcieLinkWidth Version
----                           ------- --- ------ -------- ---- -------- ------------- ------------- -------
Ethernet 2                           0   5      0        1                    8.0 GT/s             8 1.1
Ethernet 3                           0   5      0        0                    8.0 GT/s             8 1.1
Onboard 1Gbe                         0   0     25        0                     Unknown
 
Last edited:
Windows simply isn’t optimised for 10Gb and this has been discussed for several years, while I commend you on testing it, 30 seconds on google should have told you the same, SNB has a long running discussion on this with pretty much all tweaks required covered from memory.
Having done significant reading on this matter, that should only be true for server 2008 / vista and earlier in respect of the default TCP window size. The only tweaks that seem to be left open to the end user that are worth tinkering with are RSS queues and RX/TX buffers. I've read through what I could find on SNB and I've pretty much covered all of it except the MTU which I'd prefer to keep at 1500 as jumbo frames shouldn't be necessary to achieve 10Gb/s. Either way, testing iperf on the loopback address should be taking the network out of the equation and I should see what the system bus is capable of handling, hence why I was expecting 30+Gb/s rates, which I do get on linux but not on windows.

In respect of the pcie link width issue I spotted, I've now fixed it by swapping the NIC and audio interface around as both were in 8x slots. GetNetAdapterHardwareInfo reports the correct link width and speed now but it didn't make 1 iota of difference to the throughput though. Something I've just tested running 3 separate server/client instances of iperf3 which is suggested for 40G/100G network testing. In Work on my old Z800 workstation (dual x5650), one instance on the loopback address tops out at 9.5Gb/s. Three instances net similar results per instance meaning ~30Gb/s. I'll need to try this once I'm back at home and see how that fares.
 
Last edited:
I dont know why the guys performance is low (although i would question why he feels the need to have 10gbit connections in his house) but i dont think blaming Windows is the answer personally.
The main reason for upgrading is that I run out of bandwidth at 1gbit to my server. If I run one copy from my PC, it saturates the link which affects others in the house trying to use it. Rather than using multiple aggregated 1gbit links it seemed easier and only a little more expensive to just go the whole hog and get 10gbit. I don't expect to be able to saturate the 10gbit link from either of my pc's as their storage is limited to 6gbit as my only nvme drive is for the OS only. The server on the other hand should be able to manage to saturate the link with all 12 disks in the array as it manages 700MB/s with 7 disks.
 
Just pushed the 1903 update on my threadripper machine and iperf performance is now fixed on this machine. Across 5 instances I saw just shy of 40Gbit/s and a single instance gave just over 16Gbit/s. I'll look into doing the X79 system tomorrow and see if it also sees the same improvement.
 
Back
Top Bottom