Which iperf version do you have? I had limited throughput with default from debian. LRO needed iperf2 in my tests to reach the 9.4 gbit (max)
$ iperf --version:
iperf version 2.1.8 (12 August 2022) pthreads
After dealing with the nfs error:
$ journalctl -b -u nfs-server
Jul 22 06:14:45 nas systemd[1]: Dependency failed for nfs-server.service - NFS server and services.
Jul 22 06:14:45 nas systemd[1]: nfs-server.service: Job nfs-server.service/start failed with result 'dependency'.
I was re-compile kernel with the nfs directly in the kernel instead of module, and nfs-server started working.
- dd if=/dev/zero of=/mnt/nvme/test.img bs=10M count=1000 oflag=dsync
10370416640 bytes (10 GB, 9.7 GiB) copied, 39 s, 266 MB/s
-
dd if=/dev/zero of=/mnt/raid0/test.img bs=10M count=1000 oflag=dsync
10349445120 bytes (10 GB, 9.6 GiB) copied, 61 s, 169 MB/s
-
dd if=/mnt/nvme/test.img of=/dev/null bs=10M iflag=direct
9961472000 bytes (10 GB, 9.3 GiB) copied, 14 s, 711 MB/s
- dd if=/mnt/raid0/test.img of=/dev/null bs=10M iflag=direct
9856614400 bytes (9.9 GB, 9.2 GiB) copied, 13 s, 757 MB/s
Results between 6.12.32 without rss/lro and 6.16.0 with rss/lro looks not as big as expected, but, I thing that in case of using BPI-R4 as a NAS server:
- writing to NAS means reading data by ethernet/sfp which utilise rss/lro functionality, but is probably limited by
-
- write to the NVME and SATA drives is limited because of the technology - these drivers have to clear the space before writing a new data which makes writing slower, and if I remember it correctly, it reads the data to verify succesfull write.
- reading from NAS drives is much more faster, as expected, and I am little confused here, because I did not set-up smp_affinity for TX frames.
It should make sense, but I have to ask: Does this means, that linux core utilise all cores for TX frames by default?
edit: nfs-server error added
I did test with the 3GB of ram drive:
- dd if=/dev/zero of=/mnt/ramdrive/test.img bs=10M count=300 oflag=dsync
2883584000 bytes (2.9 GB, 2.7 GiB) copied, 5 s, 575 MB/s
- dd if=/mnt/ramdrive/test.img of=/dev/null bs=10M iflag=direct
2170552320 bytes (2.2 GB, 2.0 GiB) copied, 2 s, 1.1 GB/s
Why is writing “only” 50% of the maximum speed?
I assume that nobody knows even an idea. I will perform similar tests on new compiled kernels in the future.
Hello again @frank-w ,
I am wondering how to persist rss & lro setup? I created a script to automate it and I can create “one shot” service to execute it, but are there any other better options?
When I was playing with it, I realize, that in my performance tests dd with the parameter --oflag=dsync lower the performance a bit, therefore I will update these test soon.
I don’t know any specific setting via systemd or any other init. So i guess creating a acript and adding it as postup/predown into the networkd units is the best way.
ls /sys/class/net/sfp-lan/queues/ rx-0 tx-0 tx-1 tx-10 tx-11 tx-12 tx-13 tx-14 tx-15 tx-2 tx-3 tx-4 tx-5 tx-6 tx-7 tx-8 tx-9 one rx vs 10 tx
It is not related to the RSS/LRO, but as I am testing everything on nvme or raid arrays, and next tests only on this raid 6 array from these days, I would like share with everyone interested thinking about using bpi-r4 as its own nas. Here is the on-going re-check performance of my raid6 array:
$ cat /proc/mdstat
Personalities : [raid4] [raid5] [raid6]
md127 : active raid6 sde[4] sdf[6] sdd[2] sdb[5] sdc[0] sda[1] sdg[7]
14650675200 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/7] [UUUUUUU]
[==============>......] check = 74.8% (2193851400/2930135040) finish=117.3min speed=104553K/sec
bitmap: 0/22 pages [0KB], 65536KB chunk
I have to say, that (as I created the array about more than 10 years ago) the array have probably NOT well optimalised stripe and stride, therefore performance of re-sync and working in degraded mode is lower. But I think that re-check is going quite fine.
Loadavg is about 2:
And kernel info:
$ uname -a
Linux nas 6.16.0-rc1-bpi-r4-rsslro #1 SMP Wed Aug 13 11:33:51 CEST 2025 aarch64 GNU/Linux