[BPI-R4] SFP rss/lro performance

As the discussion on the [BPI-R4] and SFP thread aimed [X|G]PONs probems, I am creating this thread to publish and discuss test results of fiber 10G SFP+ interfaces on various rss/lro kernels on debian12. I am testing iperf3 and also nfs performance.

On BPI-R4 I am using:

  • sfp module MM Go-Fibereasy 10G SFP+ AOC detected by kernel as OEM SFP-10G-AOC2M rev 1.0 used as iperf --client
  • m.2 nvme 1TB KINGSTON SKC2500M81000G
  • miniPCIeX to 4x sata3 controller: ASMedia Technology Inc. ASM1064 Serial ATA Controller (rev 02)
  • raid0 (3x2,5"sata3 240GB WDC WDS240G2G0A-00JH30)

On the “server” side is ubuntu 24 with Intel X520

kernel 6.12.32-bpi-r4-main compiled from GitHub - frank-w/BPI-Router-Linux: Linux kernel 4.14+ for BPI-R2, 5.4+ for R64, 6.1+ for R2Pro and R3 default branch 6.12-main:

  • iperf3 --client
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  11.0 GBytes  9.42 Gbits/sec   33             sender
[  5]   0.00-10.00  sec  11.0 GBytes  9.41 Gbits/sec                  receiver
loadavg per second: 0.01 0.01 0.17 0.17 0.17 0.17 0.17 0.24 0.24 0.24 0.24
  • iperf3 --client --reverse
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  5.21 GBytes  4.47 Gbits/sec  117             sender
[  5]   0.00-10.00  sec  5.20 GBytes  4.47 Gbits/sec                  receiver
loadavg per second: 0.00 0.00 0.00 0.00 0.24 0.24 0.24 0.24 0.24 0.38 0.38 0.38
  • iperf3 --client --bidir
[ ID][Role] Interval           Transfer     Bitrate         Retr
[  5][TX-C]   0.00-10.00  sec  6.08 GBytes  5.22 Gbits/sec  223             sender
[  5][TX-C]   0.00-10.00  sec  6.08 GBytes  5.22 Gbits/sec                  receiver
[  7][RX-C]   0.00-10.00  sec  2.50 GBytes  2.15 Gbits/sec  761             sender
[  7][RX-C]   0.00-10.00  sec  2.50 GBytes  2.15 Gbits/sec                  receiver
loadavg per second: 0.03 0.03 0.03 0.18 0.18 0.18 0.18 0.18 0.33 0.33 0.33 0.33
  • nfs write to nvme on BPI-R4:
dd if=/dev/zero of=./test.img bs=10M count=1000 status=progress oflag=dsync
10454302720 bytes (10 GB, 9.7 GiB) copied, 45 s, 232 MB/s
loadavg per second: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.32 0.32 0.32 0.32 0.32 0.46 0.46 0.46 0.46 0.46 0.66 0.66 0.66 0.66 0.66 0.93 0.93 0.93 0.93 0.93 1.09 1.09 1.09 1.09 1.09 1.41 1.41 1.41 1.41 1.41 1.37 1.37 1.37 1.37 1.37 1.34 1.34
  • nfs read from nvme on BPI-R4:
dd if=./test.img of=/dev/null bs=10M status=progress iflag=direct
10297016320 bytes (10 GB, 9.6 GiB) copied, 13 s, 791 MB/s
loadavg per second: 0.02 0.02 0.18 0.18 0.18 0.18 0.18 0.81 0.81 0.81 0.81 0.81 1.39 1.39 1.39 1.39
  • nfs write to raid0 on BPI-R4:
dd if=/dev/zero of=./test.img bs=10M count=1000 status=progress oflag=dsync
10412359680 bytes (10 GB, 9.7 GiB) copied, 62 s, 168 MB/s
loadavg per second: 0.01 0.01 0.01 0.01 0.01 0.09 0.09 0.09 0.09 0.09 0.16 0.16 0.16 0.16 0.16 0.23 0.23 0.23 0.23 0.23 0.37 0.37 0.37 0.37 0.37 0.42 0.42 0.42 0.42 0.42 0.55 0.55 0.55 0.55 0.55 0.58 0.58 0.58 0.58 0.62 0.62 0.62 0.62 0.62 0.81 0.81 0.81 0.81 0.81 0.82 0.82 0.82 0.82 0.82 1.08 1.08 1.08 1.08 1.08 1.23 1.23 1.23 1.23 1.23 1.13 1.13
  • nfs read from raid0 on BPI-R4:
dd if=./test.img of=/dev/null bs=10M status=progress iflag=direct
10181672960 bytes (10 GB, 9.5 GiB) copied, 15 s, 678 MB/s
loadavg per second: 0.02 0.02 0.02 0.90 0.90 0.90 0.90 0.90 1.31 1.31 1.31 1.31 1.31 2.16 2.16 2.16 2.16

I am compiling @frank-w branch 6.16-rsslro right now. Tests results will be added today. Any other results will be appreciated. :slight_smile:

1 Like

In 6.16 i added a readme which points to some particularities.

  • iperf3 must not be debian default (version had bug with rss)
  • lro seems to need iperf2…not completely sure why iperf3 (newest version) did not reach the 9.4gbit/s

Btw. Use code-tags instead of pre because it not scrollable on mobile devices

Both needs some commands on userspace to be active/working.

1 Like

Additional tests on the bpi-r4 with 6.12 kernel directly:

  • nvme read:
dd if=/dev/nvme0n1 of=/dev/null bs=10M count=1000 status=progress iflag=direct
10412359680 bytes (10 GB, 9.7 GiB) copied, 13 s, 800 MB/s
cpu load: 0.2
  • nvme write:
dd if=/dev/zero of=/mnt/x/test.img bs=10M count=1000 status=progress oflag=dsync
10297016320 bytes (10 GB, 9.6 GiB) copied, 34 s, 303 MB/s
cpu load: 0.6
  • raid0 read:
dd if=/dev/md/data of=/dev/null bs=10M count=1000 status=progress iflag=direct
10087301120 bytes (10 GB, 9.4 GiB) copied, 13 s, 776 MB/s
cpu load: 0.26
  • raid0 write:
dd if=/dev/zero of=/mnt/data/test.img bs=10M count=1000 status=progress oflag=dsync
10443816960 bytes (10 GB, 9.7 GiB) copied, 60 s, 174 MB/s
cpu load: 0.68

The BPI-R4’s SFP 10GbE interface only supporting an MTU of 2000,in high-throughput environments—like large file transfers, virtual machine migration, or iSCSI storage—limiting MTU to 2000 may lead to slightly higher CPU utilization and lower throughput compared to systems that support larger MTUs (e.g., 9000).

Thanks for sharing!

I would avaits that it is just sw limit, and therefore it could be changed in the linux kernel. But as I read it again, it looks like it is limit of the hw itself - it is?

As I had some complications with the 6.16 kernel (I am the cause here, to be honest :smiley: ) I did not completed the tests. Based on informations from @frank-w , rsslro should provide much better performance than 6.12.32. Hope I could publish here the results very soon, hope tommorow…

Afaik mt7988 does support 9k frames,but driver does not configure this yet. Rss splits cpu load across all cores if configured (smp-affinity with rps disabled) and lro offloads packets for a target so not much cpu is used.

This patch tried to support it,but i was not able to get working:

https://github.com/frank-w/BPI-Router-Linux/commit/e7abe158cf1c97062ea95ee6338e2b3b73cb5335

Afair they are based on older rss/lro patches so not directly applicable

Just compiled and successfully booted 6.16.0-rc1-bpi-r4-rsslro from @frank-w GitHub - frank-w/BPI-Router-Linux at 6.16-rsslro .

I am using on bpi-r4:

  • 2x minipciEx to 4xSATA controller: ASMedia Technology Inc. ASM1064 Serial ATA Controller (rev 02)
  • m.2 to 5xSATA controller: JMicron Technology Corp. JMB58x AHCI SATA controller
    • Because of this one I Had to enable AHCI drivers in kernel config

Now I change the JMB58x for the nvme drive and perform the same tests.

  • iperf3 --client
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  10.9 GBytes  9.40 Gbits/sec  153             sender
[  5]   0.00-10.00  sec  10.9 GBytes  9.40 Gbits/sec                  receiver
loadavg per second: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
  • iperf3 --reverse --client
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  5.65 GBytes  4.86 Gbits/sec  128             sender
[  5]   0.00-10.00  sec  5.65 GBytes  4.85 Gbits/sec                  receiver
loadavg per second: 0.00 0.00 0.00 0.16 0.16 0.16 0.16 0.16 0.31 0.31 0.31 0.31
  • iperf3 --bidir --client
[ ID][Role] Interval           Transfer     Bitrate         Retr
[  5][TX-C]   0.00-10.00  sec  11.0 GBytes  9.41 Gbits/sec  194             sender
[  5][TX-C]   0.00-10.00  sec  11.0 GBytes  9.41 Gbits/sec                  receiver
[  7][RX-C]   0.00-10.00  sec  1.79 GBytes  1.54 Gbits/sec   52             sender
[  7][RX-C]   0.00-10.00  sec  1.79 GBytes  1.54 Gbits/sec                  receiver
loadavg per second: 0.08 0.08 0.08 0.08 0.08 0.15 0.15 0.15 0.15 0.15 0.14
  • iperf --client
TCP window size: 16.0 KByte (default) 
(icwnd/mss/irtt=14/1448/195)
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-10.0051 sec  10.9 GBytes  9.40 Gbits/sec
loadavg per second: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
  • iperf --reverse --client
TCP window size: 16.0 KByte (default)
(reverse) (icwnd/mss/irtt=14/1448/417)
[ ID] Interval       Transfer     Bandwidth
[ *1] 0.0000-10.0065 sec  5.71 GBytes  4.90 Gbits/sec
loadavg per second: 0.00 0.00 0.00 0.00 0.08 0.08 0.08 0.08 0.08 0.23 0.23
  • iperf --full-duplex --client
TCP window size: 16.0 KByte (default)
(full-duplex) (icwnd/mss/irtt=14/1448/241)
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-10.0003 sec  5.30 GBytes  4.55 Gbits/sec
[ *1] 0.0000-10.0073 sec  4.63 GBytes  3.97 Gbits/sec
[SUM] 0.0000-10.0074 sec  9.92 GBytes  8.52 Gbits/sec
loadavg per second: 0.00 0.00 0.00 0.24 0.24 0.24 0.24 0.24 0.47 0.47 0.47 0.47

Have you executed the commands for rss and maybe lro? E.g. set smp affinity and disable rps for rss.

But you have to set affinity for the last 4 ethernet irqs to all 4 cpu.

Set-up is done. To be able to enable lro, I dissmissed the bridge and set-up sfp1 as standalone interface. It looks like there is better performance, but also much more cpu load.

  • iperf -P 4 -t 30 --client
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-30.0117 sec  32.9 GBytes  9.41 Gbits/sec

  • iperf -P 4 -t 30 --reverse --client
[ ID] Interval       Transfer     Bandwidth
[ *1] 0.0000-30.0031 sec  27.6 GBytes  7.89 Gbits/sec

  • iperf -P 4 -t 30 --full-duplex --client
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-30.0003 sec  24.6 GBytes  7.05 Gbits/sec
[ *1] 0.0000-30.0038 sec  18.9 GBytes  5.42 Gbits/sec
[SUM] 0.0000-30.0038 sec  43.5 GBytes  12.5 Gbits/sec

I am not sure what about the

you have to set affinity for the last 4 ethernet irqs to all 4 cpu.

Should I set smp_affinity to more irq’s than mantioned in the readme.md? How can I figure out which irq’s are the right one? Or it is ok to set up it on all of them which have smp_affinity? And what value should be set? Should I continue in the row 1,2,4,8,16,32,64,…?

Here are the hw_lro_stats of the –full-duplex test:

cat /proc/mtketh/hw_lro_stats
HW LRO statistic dump:
Cnt:   RING4 | RING5 | RING6 | RING7 Total
 0 :      0        0        0        0        0
 1 :      5910        0        0        0        5910
 2 :      15178        0        0        0        15178
 3 :      5929        0        0        0        5929
 4 :      9046        0        0        0        9046
 5 :      26270        0        0        0        26270
 6 :      4357        0        0        0        4357
 7 :      8935        0        0        0        8935
 8 :      1217872        0        0        0        1217872
 9 :      323        0        0        0        323
 10 :      0        0        0        0        0
 11 :      0        0        0        0        0
 12 :      0        0        0        0        0
 13 :      0        0        0        0        0
 14 :      0        0        0        0        0
 15 :      0        0        0        0        0
 16 :      0        0        0        0        0
 17 :      0        0        0        0        0
 18 :      0        0        0        0        0
 19 :      0        0        0        0        0
 20 :      0        0        0        0        0
 21 :      0        0        0        0        0
 22 :      0        0        0        0        0
 23 :      0        0        0        0        0
 24 :      0        0        0        0        0
 25 :      0        0        0        0        0
 26 :      0        0        0        0        0
 27 :      0        0        0        0        0
 28 :      0        0        0        0        0
 29 :      0        0        0        0        0
 30 :      0        0        0        0        0
 31 :      0        0        0        0        0
 32 :      0        0        0        0        0
 33 :      0        0        0        0        0
 34 :      0        0        0        0        0
 35 :      0        0        0        0        0
 36 :      0        0        0        0        0
 37 :      0        0        0        0        0
 38 :      0        0        0        0        0
 39 :      0        0        0        0        0
 40 :      0        0        0        0        0
 41 :      0        0        0        0        0
 42 :      0        0        0        0        0
 43 :      0        0        0        0        0
 44 :      0        0        0        0        0
 45 :      0        0        0        0        0
 46 :      0        0        0        0        0
 47 :      0        0        0        0        0
 48 :      0        0        0        0        0
 49 :      0        0        0        0        0
 50 :      0        0        0        0        0
 51 :      0        0        0        0        0
 52 :      0        0        0        0        0
 53 :      0        0        0        0        0
 54 :      0        0        0        0        0
 55 :      0        0        0        0        0
 56 :      0        0        0        0        0
 57 :      0        0        0        0        0
 58 :      0        0        0        0        0
 59 :      0        0        0        0        0
 60 :      0        0        0        0        0
 61 :      0        0        0        0        0
 62 :      0        0        0        0        0
 63 :      0        0        0        0        0
 64 :      0        0        0        0        0
Total agg:   RING4 | RING5 | RING6 | RING7 Total
                10056157      0      0      0      10056157
Total flush:   RING4 | RING5 | RING6 | RING7 Total
                1293820      0      0      0      1293820
Avg agg:   RING4 | RING5 | RING6 | RING7 Total
                7      0      0      0      7
HW LRO flush pkt len:
 Length  | RING4  | RING5  | RING6  | RING7 Total
0~5000: 27379      0      0      0      27379
5000~10000: 40003      0      0      0      40003
10000~15000: 1226438      0      0      0      1226438
15000~20000: 0      0      0      0      0
20000~25000: 0      0      0      0      0
25000~30000: 0      0      0      0      0
30000~35000: 0      0      0      0      0
35000~40000: 0      0      0      0      0
40000~45000: 0      0      0      0      0
45000~50000: 0      0      0      0      0
50000~55000: 0      0      0      0      0
55000~60000: 0      0      0      0      0
60000~65000: 0      0      0      0      0
65000~70000: 0      0      0      0      0
70000~75000: 0      0      0      0      0
Flush reason:   RING4 | RING5 | RING6 | RING7 Total
AGG timeout:      664      0      0      0      664
AGE timeout:      0      0      0      0      0
Not in-sequence:  1372      0      0      0      1372
Timestamp:        0      0      0      0      0
No LRO rule:      73879      0      0      0      73879

You have 6 irqs in system tagged with ethernet

cat /proc/interrupts | grep eth

The last 4 can be used for rss…these are routed to cpu number defined by bits (1=cpu0,2=cpu1,4=cpu2,8=cpu3).you do not need higher numbers as you only have 4 cores. You see also counters on cpus per irq.

Which bridge? You do not have to setup a bridge only this should match your ip/ifname of sfp interface

ethtool -N eth2 flow-type tcp4 dst-ip 192.168.1.1 action 0 loc 0 ethtool -K eth2 lro on ethtool -k eth2

At least rss needs 4+ streams (e.g. in iperf) to get full throughput.

OK, I have new results. But, I did not set-up affinity for ethernet TX. These results are with RX smp_affinity set-up only.

  • iperf -P 4 -t 30 --client 192.168.253.130
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-30.0145 sec  32.9 GBytes  9.41 Gbits/sec

  • iperf -P 4 -t 30 --reverse --client 192.168.253.130
[ ID] Interval       Transfer     Bandwidth
[ *1] 0.0000-30.0043 sec  27.9 GBytes  8.00 Gbits/sec

  • iperf -P 4 -t 30 --full-duplex --client 192.168.253.130
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-30.0003 sec  22.4 GBytes  6.42 Gbits/sec
[ *1] 0.0000-30.0071 sec  20.8 GBytes  5.94 Gbits/sec
[SUM] 0.0000-30.0071 sec  43.2 GBytes  12.4 Gbits/sec

When I try set-up TX, I did:

$ cat /proc/interrupts | grep "ethernet TX"
102:     391749          0      86191     343525    GICv3 229 Level     15100000.ethernet TX

and then:

echo 8 > /proc/irq/102/smp_affinity

OR

echo 4 > /proc/irq/102/smp_affinity

iperf result were in both of cases simmilar to:

$ iperf -P 4 -t 30 --full-duplex --client 192.168.253.130
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-30.0000 sec  30.4 GBytes  8.70 Gbits/sec
Barrier timeout per full duplex traffic
[ *1] 0.0000-41.5427 sec  4.12 GBytes   852 Mbits/sec
[SUM] 0.0000-41.5427 sec  34.5 GBytes  7.13 Gbits/sec
double free or corruption (out)
Aborted

And I am not able to figure it out why.

About the bridge I mentioned, I had it set-up before testing rsslro, we can forget it. I just learned, that it can not be enabled on the bridged interface.

Do you know why not?

For rss only search for ethernet and use the last 4 irqs (you should see 6) and give them each another cpu

You do not enable rss on interface…it is done for rx rings in driver. Lro afaik is configured on the mac,but possibly it does not work when ip is assigned to bridge. But lro is configured for mac and if ip is locally (ending on the right mac) it can work.but have not tested that.

Yes, I did it, and these 4 IRQs are ethernet RX.

I was trying to follow

moving tx frame-engine irq to different cpu (here 3rd) echo 4 > /proc/irq/103/smp_affinity

from the readme BPI-Router-Linux/README.md at 6.16-rsslro · frank-w/BPI-Router-Linux · GitHub

No, I do not know why.

ah, ok, this is optional step as first cpu is mostly the one with most load :wink: but it looks like you set it to cpu2 and later to 3 (or vice versa, as i see interrupts on tx for cpu0,2 and 3)

I did not attach statistics for the tests on the 6.16 kernel yet, I was thinking to figure out what I should do with the smp_affinity for ethernet TX IRQ which is not working as expected for me at this time. I do not understand why TX is able to use just only 1 CPU core, as it limits TX to 8Gbps.

Here are the test results with the IRQ stats without set-up of smp_affinity for the ethernet TX irq:

  • iperf -P 4 -t 30 --full-duplex --client
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-30.0003 sec  22.5 GBytes  6.45 Gbits/sec
[ *1] 0.0000-30.0082 sec  20.4 GBytes  5.83 Gbits/sec
[SUM] 0.0000-30.0082 sec  42.9 GBytes  12.3 Gbits/sec

And here are the stats:

cat /proc/mtketh/hw_lro_stats
HW LRO statistic dump:
Cnt:   RING4 | RING5 | RING6 | RING7 Total
 0 :      0        0        0        0        0
 1 :      9321        0        0        0        9321
 2 :      22946        0        0        0        22946
 3 :      7955        0        0        0        7955
 4 :      15414        0        0        0        15414
 5 :      35881        0        0        0        35881
 6 :      6330        0        0        0        6330
 7 :      14123        0        0        0        14123
 8 :      1818869        0        0        0        1818869
 9 :      283        0        0        0        283
 10 :      0        0        0        0        0
 11 :      0        0        0        0        0
 12 :      0        0        0        0        0
 13 :      0        0        0        0        0
 14 :      0        0        0        0        0
 15 :      0        0        0        0        0
 16 :      0        0        0        0        0
 17 :      0        0        0        0        0
 18 :      0        0        0        0        0
 19 :      0        0        0        0        0
 20 :      0        0        0        0        0
 21 :      0        0        0        0        0
 22 :      0        0        0        0        0
 23 :      0        0        0        0        0
 24 :      0        0        0        0        0
 25 :      0        0        0        0        0
 26 :      0        0        0        0        0
 27 :      0        0        0        0        0
 28 :      0        0        0        0        0
 29 :      0        0        0        0        0
 30 :      0        0        0        0        0
 31 :      0        0        0        0        0
 32 :      0        0        0        0        0
 33 :      0        0        0        0        0
 34 :      0        0        0        0        0
 35 :      0        0        0        0        0
 36 :      0        0        0        0        0
 37 :      0        0        0        0        0
 38 :      0        0        0        0        0
 39 :      0        0        0        0        0
 40 :      0        0        0        0        0
 41 :      0        0        0        0        0
 42 :      0        0        0        0        0
 43 :      0        0        0        0        0
 44 :      0        0        0        0        0
 45 :      0        0        0        0        0
 46 :      0        0        0        0        0
 47 :      0        0        0        0        0
 48 :      0        0        0        0        0
 49 :      0        0        0        0        0
 50 :      0        0        0        0        0
 51 :      0        0        0        0        0
 52 :      0        0        0        0        0
 53 :      0        0        0        0        0
 54 :      0        0        0        0        0
 55 :      0        0        0        0        0
 56 :      0        0        0        0        0
 57 :      0        0        0        0        0
 58 :      0        0        0        0        0
 59 :      0        0        0        0        0
 60 :      0        0        0        0        0
 61 :      0        0        0        0        0
 62 :      0        0        0        0        0
 63 :      0        0        0        0        0
 64 :      0        0        0        0        0
Total agg:   RING4 | RING5 | RING6 | RING7 Total
                15010479      0      0      0      15010479
Total flush:   RING4 | RING5 | RING6 | RING7 Total
                1931122      0      0      0      1931122
Avg agg:   RING4 | RING5 | RING6 | RING7 Total
                7      0      0      0      7
HW LRO flush pkt len:
 Length  | RING4  | RING5  | RING6  | RING7 Total
0~5000: 40665      0      0      0      40665
5000~10000: 58007      0      0      0      58007
10000~15000: 1832450      0      0      0      1832450
15000~20000: 0      0      0      0      0
20000~25000: 0      0      0      0      0
25000~30000: 0      0      0      0      0
30000~35000: 0      0      0      0      0
35000~40000: 0      0      0      0      0
40000~45000: 0      0      0      0      0
45000~50000: 0      0      0      0      0
50000~55000: 0      0      0      0      0
55000~60000: 0      0      0      0      0
60000~65000: 0      0      0      0      0
65000~70000: 0      0      0      0      0
70000~75000: 0      0      0      0      0
Flush reason:   RING4 | RING5 | RING6 | RING7 Total
AGG timeout:      1252      0      0      0      1252
AGE timeout:      0      0      0      0      0
Not in-sequence:  2638      0      0      0      2638
Timestamp:        0      0      0      0      0
No LRO rule:      108465      0      0      0      108465

This configuration is for me the best one. The optimalisation of TX would be nice, there is about 1,4Gbps unused, which is about 14%.