Have you executed the commands for rss and maybe lro? E.g. set smp affinity and disable rps for rss.
But you have to set affinity for the last 4 ethernet irqs to all 4 cpu.
Have you executed the commands for rss and maybe lro? E.g. set smp affinity and disable rps for rss.
But you have to set affinity for the last 4 ethernet irqs to all 4 cpu.
Set-up is done. To be able to enable lro, I dissmissed the bridge and set-up sfp1 as standalone interface. It looks like there is better performance, but also much more cpu load.
[ ID] Interval Transfer Bandwidth
[ 1] 0.0000-30.0117 sec 32.9 GBytes 9.41 Gbits/sec
[ ID] Interval Transfer Bandwidth
[ *1] 0.0000-30.0031 sec 27.6 GBytes 7.89 Gbits/sec
[ ID] Interval Transfer Bandwidth
[ 1] 0.0000-30.0003 sec 24.6 GBytes 7.05 Gbits/sec
[ *1] 0.0000-30.0038 sec 18.9 GBytes 5.42 Gbits/sec
[SUM] 0.0000-30.0038 sec 43.5 GBytes 12.5 Gbits/sec
I am not sure what about the
you have to set affinity for the last 4 ethernet irqs to all 4 cpu.
Should I set smp_affinity to more irq’s than mantioned in the readme.md? How can I figure out which irq’s are the right one? Or it is ok to set up it on all of them which have smp_affinity? And what value should be set? Should I continue in the row 1,2,4,8,16,32,64,…?
Here are the hw_lro_stats of the –full-duplex test:
cat /proc/mtketh/hw_lro_stats
HW LRO statistic dump:
Cnt: RING4 | RING5 | RING6 | RING7 Total
0 : 0 0 0 0 0
1 : 5910 0 0 0 5910
2 : 15178 0 0 0 15178
3 : 5929 0 0 0 5929
4 : 9046 0 0 0 9046
5 : 26270 0 0 0 26270
6 : 4357 0 0 0 4357
7 : 8935 0 0 0 8935
8 : 1217872 0 0 0 1217872
9 : 323 0 0 0 323
10 : 0 0 0 0 0
11 : 0 0 0 0 0
12 : 0 0 0 0 0
13 : 0 0 0 0 0
14 : 0 0 0 0 0
15 : 0 0 0 0 0
16 : 0 0 0 0 0
17 : 0 0 0 0 0
18 : 0 0 0 0 0
19 : 0 0 0 0 0
20 : 0 0 0 0 0
21 : 0 0 0 0 0
22 : 0 0 0 0 0
23 : 0 0 0 0 0
24 : 0 0 0 0 0
25 : 0 0 0 0 0
26 : 0 0 0 0 0
27 : 0 0 0 0 0
28 : 0 0 0 0 0
29 : 0 0 0 0 0
30 : 0 0 0 0 0
31 : 0 0 0 0 0
32 : 0 0 0 0 0
33 : 0 0 0 0 0
34 : 0 0 0 0 0
35 : 0 0 0 0 0
36 : 0 0 0 0 0
37 : 0 0 0 0 0
38 : 0 0 0 0 0
39 : 0 0 0 0 0
40 : 0 0 0 0 0
41 : 0 0 0 0 0
42 : 0 0 0 0 0
43 : 0 0 0 0 0
44 : 0 0 0 0 0
45 : 0 0 0 0 0
46 : 0 0 0 0 0
47 : 0 0 0 0 0
48 : 0 0 0 0 0
49 : 0 0 0 0 0
50 : 0 0 0 0 0
51 : 0 0 0 0 0
52 : 0 0 0 0 0
53 : 0 0 0 0 0
54 : 0 0 0 0 0
55 : 0 0 0 0 0
56 : 0 0 0 0 0
57 : 0 0 0 0 0
58 : 0 0 0 0 0
59 : 0 0 0 0 0
60 : 0 0 0 0 0
61 : 0 0 0 0 0
62 : 0 0 0 0 0
63 : 0 0 0 0 0
64 : 0 0 0 0 0
Total agg: RING4 | RING5 | RING6 | RING7 Total
10056157 0 0 0 10056157
Total flush: RING4 | RING5 | RING6 | RING7 Total
1293820 0 0 0 1293820
Avg agg: RING4 | RING5 | RING6 | RING7 Total
7 0 0 0 7
HW LRO flush pkt len:
Length | RING4 | RING5 | RING6 | RING7 Total
0~5000: 27379 0 0 0 27379
5000~10000: 40003 0 0 0 40003
10000~15000: 1226438 0 0 0 1226438
15000~20000: 0 0 0 0 0
20000~25000: 0 0 0 0 0
25000~30000: 0 0 0 0 0
30000~35000: 0 0 0 0 0
35000~40000: 0 0 0 0 0
40000~45000: 0 0 0 0 0
45000~50000: 0 0 0 0 0
50000~55000: 0 0 0 0 0
55000~60000: 0 0 0 0 0
60000~65000: 0 0 0 0 0
65000~70000: 0 0 0 0 0
70000~75000: 0 0 0 0 0
Flush reason: RING4 | RING5 | RING6 | RING7 Total
AGG timeout: 664 0 0 0 664
AGE timeout: 0 0 0 0 0
Not in-sequence: 1372 0 0 0 1372
Timestamp: 0 0 0 0 0
No LRO rule: 73879 0 0 0 73879
You have 6 irqs in system tagged with ethernet
cat /proc/interrupts | grep eth
The last 4 can be used for rss…these are routed to cpu number defined by bits (1=cpu0,2=cpu1,4=cpu2,8=cpu3).you do not need higher numbers as you only have 4 cores. You see also counters on cpus per irq.
Which bridge? You do not have to setup a bridge only this should match your ip/ifname of sfp interface
ethtool -N eth2 flow-type tcp4 dst-ip 192.168.1.1 action 0 loc 0 ethtool -K eth2 lro on ethtool -k eth2
At least rss needs 4+ streams (e.g. in iperf) to get full throughput.
OK, I have new results. But, I did not set-up affinity for ethernet TX. These results are with RX smp_affinity set-up only.
[ ID] Interval Transfer Bandwidth
[ 1] 0.0000-30.0145 sec 32.9 GBytes 9.41 Gbits/sec
[ ID] Interval Transfer Bandwidth
[ *1] 0.0000-30.0043 sec 27.9 GBytes 8.00 Gbits/sec
[ ID] Interval Transfer Bandwidth
[ 1] 0.0000-30.0003 sec 22.4 GBytes 6.42 Gbits/sec
[ *1] 0.0000-30.0071 sec 20.8 GBytes 5.94 Gbits/sec
[SUM] 0.0000-30.0071 sec 43.2 GBytes 12.4 Gbits/sec
When I try set-up TX, I did:
$ cat /proc/interrupts | grep "ethernet TX"
102: 391749 0 86191 343525 GICv3 229 Level 15100000.ethernet TX
and then:
echo 8 > /proc/irq/102/smp_affinity
OR
echo 4 > /proc/irq/102/smp_affinity
iperf result were in both of cases simmilar to:
$ iperf -P 4 -t 30 --full-duplex --client 192.168.253.130
[ ID] Interval Transfer Bandwidth
[ 1] 0.0000-30.0000 sec 30.4 GBytes 8.70 Gbits/sec
Barrier timeout per full duplex traffic
[ *1] 0.0000-41.5427 sec 4.12 GBytes 852 Mbits/sec
[SUM] 0.0000-41.5427 sec 34.5 GBytes 7.13 Gbits/sec
double free or corruption (out)
Aborted
And I am not able to figure it out why.
About the bridge I mentioned, I had it set-up before testing rsslro, we can forget it. I just learned, that it can not be enabled on the bridged interface.
Do you know why not?
For rss only search for ethernet and use the last 4 irqs (you should see 6) and give them each another cpu
You do not enable rss on interface…it is done for rx rings in driver. Lro afaik is configured on the mac,but possibly it does not work when ip is assigned to bridge. But lro is configured for mac and if ip is locally (ending on the right mac) it can work.but have not tested that.
Yes, I did it, and these 4 IRQs are ethernet RX.
I was trying to follow
moving tx frame-engine irq to different cpu (here 3rd) echo 4 > /proc/irq/103/smp_affinity
from the readme BPI-Router-Linux/README.md at 6.16-rsslro · frank-w/BPI-Router-Linux · GitHub
No, I do not know why.
ah, ok, this is optional step as first cpu is mostly the one with most load but it looks like you set it to cpu2 and later to 3 (or vice versa, as i see interrupts on tx for cpu0,2 and 3)
I did not attach statistics for the tests on the 6.16 kernel yet, I was thinking to figure out what I should do with the smp_affinity for ethernet TX IRQ which is not working as expected for me at this time. I do not understand why TX is able to use just only 1 CPU core, as it limits TX to 8Gbps.
Here are the test results with the IRQ stats without set-up of smp_affinity for the ethernet TX irq:
[ ID] Interval Transfer Bandwidth
[ 1] 0.0000-30.0003 sec 22.5 GBytes 6.45 Gbits/sec
[ *1] 0.0000-30.0082 sec 20.4 GBytes 5.83 Gbits/sec
[SUM] 0.0000-30.0082 sec 42.9 GBytes 12.3 Gbits/sec
And here are the stats:
cat /proc/mtketh/hw_lro_stats
HW LRO statistic dump:
Cnt: RING4 | RING5 | RING6 | RING7 Total
0 : 0 0 0 0 0
1 : 9321 0 0 0 9321
2 : 22946 0 0 0 22946
3 : 7955 0 0 0 7955
4 : 15414 0 0 0 15414
5 : 35881 0 0 0 35881
6 : 6330 0 0 0 6330
7 : 14123 0 0 0 14123
8 : 1818869 0 0 0 1818869
9 : 283 0 0 0 283
10 : 0 0 0 0 0
11 : 0 0 0 0 0
12 : 0 0 0 0 0
13 : 0 0 0 0 0
14 : 0 0 0 0 0
15 : 0 0 0 0 0
16 : 0 0 0 0 0
17 : 0 0 0 0 0
18 : 0 0 0 0 0
19 : 0 0 0 0 0
20 : 0 0 0 0 0
21 : 0 0 0 0 0
22 : 0 0 0 0 0
23 : 0 0 0 0 0
24 : 0 0 0 0 0
25 : 0 0 0 0 0
26 : 0 0 0 0 0
27 : 0 0 0 0 0
28 : 0 0 0 0 0
29 : 0 0 0 0 0
30 : 0 0 0 0 0
31 : 0 0 0 0 0
32 : 0 0 0 0 0
33 : 0 0 0 0 0
34 : 0 0 0 0 0
35 : 0 0 0 0 0
36 : 0 0 0 0 0
37 : 0 0 0 0 0
38 : 0 0 0 0 0
39 : 0 0 0 0 0
40 : 0 0 0 0 0
41 : 0 0 0 0 0
42 : 0 0 0 0 0
43 : 0 0 0 0 0
44 : 0 0 0 0 0
45 : 0 0 0 0 0
46 : 0 0 0 0 0
47 : 0 0 0 0 0
48 : 0 0 0 0 0
49 : 0 0 0 0 0
50 : 0 0 0 0 0
51 : 0 0 0 0 0
52 : 0 0 0 0 0
53 : 0 0 0 0 0
54 : 0 0 0 0 0
55 : 0 0 0 0 0
56 : 0 0 0 0 0
57 : 0 0 0 0 0
58 : 0 0 0 0 0
59 : 0 0 0 0 0
60 : 0 0 0 0 0
61 : 0 0 0 0 0
62 : 0 0 0 0 0
63 : 0 0 0 0 0
64 : 0 0 0 0 0
Total agg: RING4 | RING5 | RING6 | RING7 Total
15010479 0 0 0 15010479
Total flush: RING4 | RING5 | RING6 | RING7 Total
1931122 0 0 0 1931122
Avg agg: RING4 | RING5 | RING6 | RING7 Total
7 0 0 0 7
HW LRO flush pkt len:
Length | RING4 | RING5 | RING6 | RING7 Total
0~5000: 40665 0 0 0 40665
5000~10000: 58007 0 0 0 58007
10000~15000: 1832450 0 0 0 1832450
15000~20000: 0 0 0 0 0
20000~25000: 0 0 0 0 0
25000~30000: 0 0 0 0 0
30000~35000: 0 0 0 0 0
35000~40000: 0 0 0 0 0
40000~45000: 0 0 0 0 0
45000~50000: 0 0 0 0 0
50000~55000: 0 0 0 0 0
55000~60000: 0 0 0 0 0
60000~65000: 0 0 0 0 0
65000~70000: 0 0 0 0 0
70000~75000: 0 0 0 0 0
Flush reason: RING4 | RING5 | RING6 | RING7 Total
AGG timeout: 1252 0 0 0 1252
AGE timeout: 0 0 0 0 0
Not in-sequence: 2638 0 0 0 2638
Timestamp: 0 0 0 0 0
No LRO rule: 108465 0 0 0 108465
This configuration is for me the best one. The optimalisation of TX would be nice, there is about 1,4Gbps unused, which is about 14%.
Which iperf version do you have? I had limited throughput with default from debian. LRO needed iperf2 in my tests to reach the 9.4 gbit (max)
$ iperf --version:
iperf version 2.1.8 (12 August 2022) pthreads
After dealing with the nfs error:
$ journalctl -b -u nfs-server
Jul 22 06:14:45 nas systemd[1]: Dependency failed for nfs-server.service - NFS server and services.
Jul 22 06:14:45 nas systemd[1]: nfs-server.service: Job nfs-server.service/start failed with result 'dependency'.
I was re-compile kernel with the nfs directly in the kernel instead of module, and nfs-server started working.
10370416640 bytes (10 GB, 9.7 GiB) copied, 39 s, 266 MB/s
dd if=/dev/zero of=/mnt/raid0/test.img bs=10M count=1000 oflag=dsync
10349445120 bytes (10 GB, 9.6 GiB) copied, 61 s, 169 MB/s
dd if=/mnt/nvme/test.img of=/dev/null bs=10M iflag=direct
9961472000 bytes (10 GB, 9.3 GiB) copied, 14 s, 711 MB/s
9856614400 bytes (9.9 GB, 9.2 GiB) copied, 13 s, 757 MB/s
Results between 6.12.32 without rss/lro and 6.16.0 with rss/lro looks not as big as expected, but, I thing that in case of using BPI-R4 as a NAS server:
It should make sense, but I have to ask: Does this means, that linux core utilise all cores for TX frames by default?
edit: nfs-server error added
I did test with the 3GB of ram drive:
2883584000 bytes (2.9 GB, 2.7 GiB) copied, 5 s, 575 MB/s
2170552320 bytes (2.2 GB, 2.0 GiB) copied, 2 s, 1.1 GB/s
Why is writing “only” 50% of the maximum speed?
I assume that nobody knows even an idea. I will perform similar tests on new compiled kernels in the future.
Hello again @frank-w ,
I am wondering how to persist rss & lro setup? I created a script to automate it and I can create “one shot” service to execute it, but are there any other better options?
When I was playing with it, I realize, that in my performance tests dd with the parameter --oflag=dsync lower the performance a bit, therefore I will update these test soon.
I don’t know any specific setting via systemd or any other init. So i guess creating a acript and adding it as postup/predown into the networkd units is the best way.
ls /sys/class/net/sfp-lan/queues/ rx-0 tx-0 tx-1 tx-10 tx-11 tx-12 tx-13 tx-14 tx-15 tx-2 tx-3 tx-4 tx-5 tx-6 tx-7 tx-8 tx-9 one rx vs 10 tx