[BPI-R4] SFP rss/lro performance

Have you executed the commands for rss and maybe lro? E.g. set smp affinity and disable rps for rss.

But you have to set affinity for the last 4 ethernet irqs to all 4 cpu.

Set-up is done. To be able to enable lro, I dissmissed the bridge and set-up sfp1 as standalone interface. It looks like there is better performance, but also much more cpu load.

  • iperf -P 4 -t 30 --client
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-30.0117 sec  32.9 GBytes  9.41 Gbits/sec

  • iperf -P 4 -t 30 --reverse --client
[ ID] Interval       Transfer     Bandwidth
[ *1] 0.0000-30.0031 sec  27.6 GBytes  7.89 Gbits/sec

  • iperf -P 4 -t 30 --full-duplex --client
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-30.0003 sec  24.6 GBytes  7.05 Gbits/sec
[ *1] 0.0000-30.0038 sec  18.9 GBytes  5.42 Gbits/sec
[SUM] 0.0000-30.0038 sec  43.5 GBytes  12.5 Gbits/sec

I am not sure what about the

you have to set affinity for the last 4 ethernet irqs to all 4 cpu.

Should I set smp_affinity to more irq’s than mantioned in the readme.md? How can I figure out which irq’s are the right one? Or it is ok to set up it on all of them which have smp_affinity? And what value should be set? Should I continue in the row 1,2,4,8,16,32,64,…?

Here are the hw_lro_stats of the –full-duplex test:

cat /proc/mtketh/hw_lro_stats
HW LRO statistic dump:
Cnt:   RING4 | RING5 | RING6 | RING7 Total
 0 :      0        0        0        0        0
 1 :      5910        0        0        0        5910
 2 :      15178        0        0        0        15178
 3 :      5929        0        0        0        5929
 4 :      9046        0        0        0        9046
 5 :      26270        0        0        0        26270
 6 :      4357        0        0        0        4357
 7 :      8935        0        0        0        8935
 8 :      1217872        0        0        0        1217872
 9 :      323        0        0        0        323
 10 :      0        0        0        0        0
 11 :      0        0        0        0        0
 12 :      0        0        0        0        0
 13 :      0        0        0        0        0
 14 :      0        0        0        0        0
 15 :      0        0        0        0        0
 16 :      0        0        0        0        0
 17 :      0        0        0        0        0
 18 :      0        0        0        0        0
 19 :      0        0        0        0        0
 20 :      0        0        0        0        0
 21 :      0        0        0        0        0
 22 :      0        0        0        0        0
 23 :      0        0        0        0        0
 24 :      0        0        0        0        0
 25 :      0        0        0        0        0
 26 :      0        0        0        0        0
 27 :      0        0        0        0        0
 28 :      0        0        0        0        0
 29 :      0        0        0        0        0
 30 :      0        0        0        0        0
 31 :      0        0        0        0        0
 32 :      0        0        0        0        0
 33 :      0        0        0        0        0
 34 :      0        0        0        0        0
 35 :      0        0        0        0        0
 36 :      0        0        0        0        0
 37 :      0        0        0        0        0
 38 :      0        0        0        0        0
 39 :      0        0        0        0        0
 40 :      0        0        0        0        0
 41 :      0        0        0        0        0
 42 :      0        0        0        0        0
 43 :      0        0        0        0        0
 44 :      0        0        0        0        0
 45 :      0        0        0        0        0
 46 :      0        0        0        0        0
 47 :      0        0        0        0        0
 48 :      0        0        0        0        0
 49 :      0        0        0        0        0
 50 :      0        0        0        0        0
 51 :      0        0        0        0        0
 52 :      0        0        0        0        0
 53 :      0        0        0        0        0
 54 :      0        0        0        0        0
 55 :      0        0        0        0        0
 56 :      0        0        0        0        0
 57 :      0        0        0        0        0
 58 :      0        0        0        0        0
 59 :      0        0        0        0        0
 60 :      0        0        0        0        0
 61 :      0        0        0        0        0
 62 :      0        0        0        0        0
 63 :      0        0        0        0        0
 64 :      0        0        0        0        0
Total agg:   RING4 | RING5 | RING6 | RING7 Total
                10056157      0      0      0      10056157
Total flush:   RING4 | RING5 | RING6 | RING7 Total
                1293820      0      0      0      1293820
Avg agg:   RING4 | RING5 | RING6 | RING7 Total
                7      0      0      0      7
HW LRO flush pkt len:
 Length  | RING4  | RING5  | RING6  | RING7 Total
0~5000: 27379      0      0      0      27379
5000~10000: 40003      0      0      0      40003
10000~15000: 1226438      0      0      0      1226438
15000~20000: 0      0      0      0      0
20000~25000: 0      0      0      0      0
25000~30000: 0      0      0      0      0
30000~35000: 0      0      0      0      0
35000~40000: 0      0      0      0      0
40000~45000: 0      0      0      0      0
45000~50000: 0      0      0      0      0
50000~55000: 0      0      0      0      0
55000~60000: 0      0      0      0      0
60000~65000: 0      0      0      0      0
65000~70000: 0      0      0      0      0
70000~75000: 0      0      0      0      0
Flush reason:   RING4 | RING5 | RING6 | RING7 Total
AGG timeout:      664      0      0      0      664
AGE timeout:      0      0      0      0      0
Not in-sequence:  1372      0      0      0      1372
Timestamp:        0      0      0      0      0
No LRO rule:      73879      0      0      0      73879

You have 6 irqs in system tagged with ethernet

cat /proc/interrupts | grep eth

The last 4 can be used for rss…these are routed to cpu number defined by bits (1=cpu0,2=cpu1,4=cpu2,8=cpu3).you do not need higher numbers as you only have 4 cores. You see also counters on cpus per irq.

Which bridge? You do not have to setup a bridge only this should match your ip/ifname of sfp interface

ethtool -N eth2 flow-type tcp4 dst-ip 192.168.1.1 action 0 loc 0 ethtool -K eth2 lro on ethtool -k eth2

At least rss needs 4+ streams (e.g. in iperf) to get full throughput.

OK, I have new results. But, I did not set-up affinity for ethernet TX. These results are with RX smp_affinity set-up only.

  • iperf -P 4 -t 30 --client 192.168.253.130
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-30.0145 sec  32.9 GBytes  9.41 Gbits/sec

  • iperf -P 4 -t 30 --reverse --client 192.168.253.130
[ ID] Interval       Transfer     Bandwidth
[ *1] 0.0000-30.0043 sec  27.9 GBytes  8.00 Gbits/sec

  • iperf -P 4 -t 30 --full-duplex --client 192.168.253.130
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-30.0003 sec  22.4 GBytes  6.42 Gbits/sec
[ *1] 0.0000-30.0071 sec  20.8 GBytes  5.94 Gbits/sec
[SUM] 0.0000-30.0071 sec  43.2 GBytes  12.4 Gbits/sec

When I try set-up TX, I did:

$ cat /proc/interrupts | grep "ethernet TX"
102:     391749          0      86191     343525    GICv3 229 Level     15100000.ethernet TX

and then:

echo 8 > /proc/irq/102/smp_affinity

OR

echo 4 > /proc/irq/102/smp_affinity

iperf result were in both of cases simmilar to:

$ iperf -P 4 -t 30 --full-duplex --client 192.168.253.130
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-30.0000 sec  30.4 GBytes  8.70 Gbits/sec
Barrier timeout per full duplex traffic
[ *1] 0.0000-41.5427 sec  4.12 GBytes   852 Mbits/sec
[SUM] 0.0000-41.5427 sec  34.5 GBytes  7.13 Gbits/sec
double free or corruption (out)
Aborted

And I am not able to figure it out why.

About the bridge I mentioned, I had it set-up before testing rsslro, we can forget it. I just learned, that it can not be enabled on the bridged interface.

Do you know why not?

For rss only search for ethernet and use the last 4 irqs (you should see 6) and give them each another cpu

You do not enable rss on interface…it is done for rx rings in driver. Lro afaik is configured on the mac,but possibly it does not work when ip is assigned to bridge. But lro is configured for mac and if ip is locally (ending on the right mac) it can work.but have not tested that.

Yes, I did it, and these 4 IRQs are ethernet RX.

I was trying to follow

moving tx frame-engine irq to different cpu (here 3rd) echo 4 > /proc/irq/103/smp_affinity

from the readme BPI-Router-Linux/README.md at 6.16-rsslro · frank-w/BPI-Router-Linux · GitHub

No, I do not know why.

ah, ok, this is optional step as first cpu is mostly the one with most load :wink: but it looks like you set it to cpu2 and later to 3 (or vice versa, as i see interrupts on tx for cpu0,2 and 3)

I did not attach statistics for the tests on the 6.16 kernel yet, I was thinking to figure out what I should do with the smp_affinity for ethernet TX IRQ which is not working as expected for me at this time. I do not understand why TX is able to use just only 1 CPU core, as it limits TX to 8Gbps.

Here are the test results with the IRQ stats without set-up of smp_affinity for the ethernet TX irq:

  • iperf -P 4 -t 30 --full-duplex --client
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-30.0003 sec  22.5 GBytes  6.45 Gbits/sec
[ *1] 0.0000-30.0082 sec  20.4 GBytes  5.83 Gbits/sec
[SUM] 0.0000-30.0082 sec  42.9 GBytes  12.3 Gbits/sec

And here are the stats:

cat /proc/mtketh/hw_lro_stats
HW LRO statistic dump:
Cnt:   RING4 | RING5 | RING6 | RING7 Total
 0 :      0        0        0        0        0
 1 :      9321        0        0        0        9321
 2 :      22946        0        0        0        22946
 3 :      7955        0        0        0        7955
 4 :      15414        0        0        0        15414
 5 :      35881        0        0        0        35881
 6 :      6330        0        0        0        6330
 7 :      14123        0        0        0        14123
 8 :      1818869        0        0        0        1818869
 9 :      283        0        0        0        283
 10 :      0        0        0        0        0
 11 :      0        0        0        0        0
 12 :      0        0        0        0        0
 13 :      0        0        0        0        0
 14 :      0        0        0        0        0
 15 :      0        0        0        0        0
 16 :      0        0        0        0        0
 17 :      0        0        0        0        0
 18 :      0        0        0        0        0
 19 :      0        0        0        0        0
 20 :      0        0        0        0        0
 21 :      0        0        0        0        0
 22 :      0        0        0        0        0
 23 :      0        0        0        0        0
 24 :      0        0        0        0        0
 25 :      0        0        0        0        0
 26 :      0        0        0        0        0
 27 :      0        0        0        0        0
 28 :      0        0        0        0        0
 29 :      0        0        0        0        0
 30 :      0        0        0        0        0
 31 :      0        0        0        0        0
 32 :      0        0        0        0        0
 33 :      0        0        0        0        0
 34 :      0        0        0        0        0
 35 :      0        0        0        0        0
 36 :      0        0        0        0        0
 37 :      0        0        0        0        0
 38 :      0        0        0        0        0
 39 :      0        0        0        0        0
 40 :      0        0        0        0        0
 41 :      0        0        0        0        0
 42 :      0        0        0        0        0
 43 :      0        0        0        0        0
 44 :      0        0        0        0        0
 45 :      0        0        0        0        0
 46 :      0        0        0        0        0
 47 :      0        0        0        0        0
 48 :      0        0        0        0        0
 49 :      0        0        0        0        0
 50 :      0        0        0        0        0
 51 :      0        0        0        0        0
 52 :      0        0        0        0        0
 53 :      0        0        0        0        0
 54 :      0        0        0        0        0
 55 :      0        0        0        0        0
 56 :      0        0        0        0        0
 57 :      0        0        0        0        0
 58 :      0        0        0        0        0
 59 :      0        0        0        0        0
 60 :      0        0        0        0        0
 61 :      0        0        0        0        0
 62 :      0        0        0        0        0
 63 :      0        0        0        0        0
 64 :      0        0        0        0        0
Total agg:   RING4 | RING5 | RING6 | RING7 Total
                15010479      0      0      0      15010479
Total flush:   RING4 | RING5 | RING6 | RING7 Total
                1931122      0      0      0      1931122
Avg agg:   RING4 | RING5 | RING6 | RING7 Total
                7      0      0      0      7
HW LRO flush pkt len:
 Length  | RING4  | RING5  | RING6  | RING7 Total
0~5000: 40665      0      0      0      40665
5000~10000: 58007      0      0      0      58007
10000~15000: 1832450      0      0      0      1832450
15000~20000: 0      0      0      0      0
20000~25000: 0      0      0      0      0
25000~30000: 0      0      0      0      0
30000~35000: 0      0      0      0      0
35000~40000: 0      0      0      0      0
40000~45000: 0      0      0      0      0
45000~50000: 0      0      0      0      0
50000~55000: 0      0      0      0      0
55000~60000: 0      0      0      0      0
60000~65000: 0      0      0      0      0
65000~70000: 0      0      0      0      0
70000~75000: 0      0      0      0      0
Flush reason:   RING4 | RING5 | RING6 | RING7 Total
AGG timeout:      1252      0      0      0      1252
AGE timeout:      0      0      0      0      0
Not in-sequence:  2638      0      0      0      2638
Timestamp:        0      0      0      0      0
No LRO rule:      108465      0      0      0      108465

This configuration is for me the best one. The optimalisation of TX would be nice, there is about 1,4Gbps unused, which is about 14%.

Which iperf version do you have? I had limited throughput with default from debian. LRO needed iperf2 in my tests to reach the 9.4 gbit (max)

$ iperf --version: iperf version 2.1.8 (12 August 2022) pthreads

After dealing with the nfs error:

$ journalctl -b -u nfs-server
Jul 22 06:14:45 nas systemd[1]: Dependency failed for nfs-server.service - NFS server and services.
Jul 22 06:14:45 nas systemd[1]: nfs-server.service: Job nfs-server.service/start failed with result 'dependency'.

I was re-compile kernel with the nfs directly in the kernel instead of module, and nfs-server started working.


  • dd if=/dev/zero of=/mnt/nvme/test.img bs=10M count=1000 oflag=dsync

10370416640 bytes (10 GB, 9.7 GiB) copied, 39 s, 266 MB/s

9961472000 bytes (10 GB, 9.3 GiB) copied, 14 s, 711 MB/s

  • dd if=/mnt/raid0/test.img of=/dev/null bs=10M iflag=direct

9856614400 bytes (9.9 GB, 9.2 GiB) copied, 13 s, 757 MB/s


Results between 6.12.32 without rss/lro and 6.16.0 with rss/lro looks not as big as expected, but, I thing that in case of using BPI-R4 as a NAS server:

  • writing to NAS means reading data by ethernet/sfp which utilise rss/lro functionality, but is probably limited by
    • write to the NVME and SATA drives is limited because of the technology - these drivers have to clear the space before writing a new data which makes writing slower, and if I remember it correctly, it reads the data to verify succesfull write.
  • reading from NAS drives is much more faster, as expected, and I am little confused here, because I did not set-up smp_affinity for TX frames.

It should make sense, but I have to ask: Does this means, that linux core utilise all cores for TX frames by default?


edit: nfs-server error added

I did test with the 3GB of ram drive:

  • dd if=/dev/zero of=/mnt/ramdrive/test.img bs=10M count=300 oflag=dsync

2883584000 bytes (2.9 GB, 2.7 GiB) copied, 5 s, 575 MB/s

  • dd if=/mnt/ramdrive/test.img of=/dev/null bs=10M iflag=direct

2170552320 bytes (2.2 GB, 2.0 GiB) copied, 2 s, 1.1 GB/s


Why is writing “only” 50% of the maximum speed?

I assume that nobody knows even an idea. I will perform similar tests on new compiled kernels in the future.

Hello again @frank-w ,

I am wondering how to persist rss & lro setup? I created a script to automate it and I can create “one shot” service to execute it, but are there any other better options?

When I was playing with it, I realize, that in my performance tests dd with the parameter --oflag=dsync lower the performance a bit, therefore I will update these test soon.

I don’t know any specific setting via systemd or any other init. So i guess creating a acript and adding it as postup/predown into the networkd units is the best way.

ls /sys/class/net/sfp-lan/queues/ rx-0 tx-0 tx-1 tx-10 tx-11 tx-12 tx-13 tx-14 tx-15 tx-2 tx-3 tx-4 tx-5 tx-6 tx-7 tx-8 tx-9 one rx vs 10 tx

1 Like