I have a quick update on this. Problem went away in kernel 6.9, but unfortunately it resurfaced with kernel 6.10 and 6.11. I was able to do more testing recently and I now belive that the problem is related to checksum offload calculalations on the BPI-R3. The iptables “fix” from above seems to have been just a lucky positive interaction.
My new way of addressing this issue is to disable tx offloading:
/sbin/ethtool -K eth0 tx off
I haven’t noticed any speed reduction (probably because the bandwidth bottleneck is caused by the PowerLan adapters), but if anyone needs encounters this problem in the future they might try this.
Yes, tso gets turned off by tx off, and cannot be turned on. With just tso off I still get the error. The following is the only config that works currently.
~ # ethtool -k eth0 | grep ": on"
rx-checksumming: on
scatter-gather: on
tx-scatter-gather: on
generic-segmentation-offload: on
generic-receive-offload: on
tx-vlan-offload: on
hw-tc-offload: on
I have only tried turning things off from default. Should have also probably mentioned that eth0 is the generic switch (dsa) interface.
I also have a weird issue since months with BPI-R3. Don’t know if it’s related to the issue discussed here.
If it’s related I would have a very quick way to reproduce the problem.
Basically my BPI-R3 is running Debian 12.7 configured as firewall routing packets between different subnets.
I can reproduce my issue very quickly. It does even happen when running only two 'rsync’s across ‘lan0’ and ‘lan2’
with 3 machines involved. One machine must use a 100Mbit LAN interface. The others all have 1Gbit.
Packets between ‘lan0’ and ‘lan2’ are forwarded/firewalled with iptables rules.
Within a few seconds after starting the rsync test I get:
probably because it gets a tcp-reset.
I experimented a lot to workaround the issue (with kernels between 6.8 - 6.11) but nothing helped.
Except replacing ‘lan0’ by an externel USB-ethernet adapter
I tried to check if
/sbin/ethtool -K lan0 tx off
could help me. But the command is not executed:
# ethtool -K lan0 tx off
Actual changes:
tx-checksum-ipv4: on [requested off]
tx-checksum-ipv6: on [requested off]
is this because the settings are all ‘fixed’?
# ethtool -k lan0 | less
Features for lan0:
rx-checksumming: on [fixed]
tx-checksumming: on
tx-checksum-ipv4: on [fixed]
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: on [fixed]
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: on
tx-scatter-gather: on [fixed]
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
tx-tcp-segmentation: on [fixed]
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp-mangleid-segmentation: on [fixed]
tx-tcp6-segmentation: on [fixed]
[...]
I currently run a kernel (6.10.0-bpi-r3-main) built after:
git clone [email protected]:frank-w/BPI-Router-Linux.git
cd BPI-Router-Linux
git checkout 6.10-main
both ‘lan0’ and ‘lan2’ first are connected to simple 1Gbit switches. So basically the BPI-R3 sees only 1Gbit ports.
‘traffic between lan0 and lan2’ means: the BPI-R3 router forwards packets between ‘lan0’ and ‘lan2’ that are located on different subnets (aka 192.168.140.0/24 and 192.168.150.0/24)
there are no ‘dmesg’ messages at all on any machine involved at the time of error. But I will try to disable autoneg.
To ‘turn off tx-checksumming’ would be much more interesting IMHO. It appears not to work on the current kernel/drivers (plse see above). Do you know what to do to turn it off?
This is the same error I was having. You need to run it on the switch interface, i.e. part after @, like @eth0. Really curious to see if it fixes the issue for you too.
Hmm. I can describe the setup that was failing the most ‘reliably’ for me. Connect two BPI-R3 with an ethernet cable. Use a PC/device to connect to one of the BPI-R3s over wlan. Do some intensive TCP data transfers from the PC with the BPI-R3 you are not directly connected to. I am using the unison tool to backup my files, which uses rsync over ssh. This would fail to finish 99% of the time, with the error @sparkie printed above.
I can reproduce my issue within seconds in my LAN environment. So I tried to toggle between ‘tx on’ and ‘tx off’ on the interface originally named ‘eth0’ (thanks @meehien for hinting me). The actual test runs between interfaces originally named ‘lan0’ and ‘lan2’ though.
My issue does no longer appear with ‘tx off’. But instantly reappears after setting to ‘tx on’
setting ‘tx’ to ‘off’ impacts network performance.
with ‘tx on’ (the default) ‘iftop’ utility shows stunning ‘117MB’ when running a simple ‘netcat’ between ‘desktop A’ and ‘desktop B’. Excellent for a truly routing/firewalling device.
alas with ‘tx off’ (workaround) ‘iftop’ utility shows no more than about ‘94MB’ for the same
thank you for the tip. I changed ssh configuration on client and both servers as suggested.
Unfortunately the error still strikes within 20 - 30seconds after start of the reproduction test.
I do not copy large files. The error mostly appears when both machines copy files of a size of about 100kB. I run both rsyncs with verbose flag so I can see the output flying by rapidly