Set-up is done. To be able to enable lro, I dissmissed the bridge and set-up sfp1 as standalone interface.
It looks like there is better performance, but also much more cpu load.
you have to set affinity for the last 4 ethernet irqs to all 4 cpu.
Should I set smp_affinity to more irq’s than mantioned in the readme.md? How can I figure out which irq’s are the right one? Or it is ok to set up it on all of them which have smp_affinity?
And what value should be set? Should I continue in the row 1,2,4,8,16,32,64,…?
The last 4 can be used for rss…these are routed to cpu number defined by bits (1=cpu0,2=cpu1,4=cpu2,8=cpu3).you do not need higher numbers as you only have 4 cores.
You see also counters on cpus per irq.
Which bridge? You do not have to setup a bridge only this should match your ip/ifname of sfp interface
ethtool -N eth2 flow-type tcp4 dst-ip 192.168.1.1 action 0 loc 0 ethtool -K eth2 lro on ethtool -k eth2
At least rss needs 4+ streams (e.g. in iperf) to get full throughput.
About the bridge I mentioned, I had it set-up before testing rsslro, we can forget it. I just learned, that it can not be enabled on the bridged interface.
For rss only search for ethernet and use the last 4 irqs (you should see 6) and give them each another cpu
You do not enable rss on interface…it is done for rx rings in driver. Lro afaik is configured on the mac,but possibly it does not work when ip is assigned to bridge. But lro is configured for mac and if ip is locally (ending on the right mac) it can work.but have not tested that.
ah, ok, this is optional step as first cpu is mostly the one with most load but it looks like you set it to cpu2 and later to 3 (or vice versa, as i see interrupts on tx for cpu0,2 and 3)
I did not attach statistics for the tests on the 6.16 kernel yet, I was thinking to figure out what I should do with the smp_affinity for ethernet TX IRQ which is not working as expected for me at this time. I do not understand why TX is able to use just only 1 CPU core, as it limits TX to 8Gbps.
Here are the test results with the IRQ stats without set-up of smp_affinity for the ethernet TX irq:
$ journalctl -b -u nfs-server
Jul 22 06:14:45 nas systemd[1]: Dependency failed for nfs-server.service - NFS server and services.
Jul 22 06:14:45 nas systemd[1]: nfs-server.service: Job nfs-server.service/start failed with result 'dependency'.
I was re-compile kernel with the nfs directly in the kernel instead of module, and nfs-server started working.
Results between 6.12.32 without rss/lro and 6.16.0 with rss/lro looks not as big as expected, but, I thing that in case of using BPI-R4 as a NAS server:
writing to NAS means reading data by ethernet/sfp which utilise rss/lro functionality, but is probably limited by
write to the NVME and SATA drives is limited because of the technology - these drivers have to clear the space before writing a new data which makes writing slower, and if I remember it correctly, it reads the data to verify succesfull write.
reading from NAS drives is much more faster, as expected, and I am little confused here, because I did not set-up smp_affinity for TX frames.
It should make sense, but I have to ask: Does this means, that linux core utilise all cores for TX frames by default?
I am wondering how to persist rss & lro setup? I created a script to automate it and I can create “one shot” service to execute it, but are there any other better options?
When I was playing with it, I realize, that in my performance tests dd with the parameter --oflag=dsync lower the performance a bit, therefore I will update these test soon.
I don’t know any specific setting via systemd or any other init. So i guess creating a acript and adding it as postup/predown into the networkd units is the best way.
It is not related to the RSS/LRO, but as I am testing everything on nvme or raid arrays, and next tests only on this raid 6 array from these days, I would like share with everyone interested thinking about using bpi-r4 as its own nas.
Here is the on-going re-check performance of my raid6 array:
I have to say, that (as I created the array about more than 10 years ago) the array have probably NOT well optimalised stripe and stride, therefore performance of re-sync and working in degraded mode is lower. But I think that re-check is going quite fine.