[BPI-R3] weird networking issue (and weirder "solution")

meehien · September 8, 2024, 3:59pm

Hi all

I have a quick update on this. Problem went away in kernel 6.9, but unfortunately it resurfaced with kernel 6.10 and 6.11. I was able to do more testing recently and I now belive that the problem is related to checksum offload calculalations on the BPI-R3. The iptables “fix” from above seems to have been just a lucky positive interaction.

My new way of addressing this issue is to disable tx offloading:

/sbin/ethtool -K eth0 tx off

I haven’t noticed any speed reduction (probably because the bandwidth bottleneck is caused by the PowerLan adapters), but if anyone needs encounters this problem in the future they might try this.

frank-w · September 8, 2024, 4:49pm

Had you tso set to on? This seems ro cause the watchdog tx queue timeouts…i guess this is also turned off when doing tx off

meehien · September 8, 2024, 6:09pm

Yes, tso gets turned off by tx off, and cannot be turned on. With just tso off I still get the error. The following is the only config that works currently.

 ~ # ethtool -k eth0 | grep ": on"
rx-checksumming: on
scatter-gather: on
   tx-scatter-gather: on
generic-segmentation-offload: on
generic-receive-offload: on
tx-vlan-offload: on
hw-tc-offload: on

I have only tried turning things off from default. Should have also probably mentioned that eth0 is the generic switch (dsa) interface.

frank-w · September 8, 2024, 8:43pm

Have you a way i can reproduce your behaviour?

sparkie · September 9, 2024, 5:28am

I also have a weird issue since months with BPI-R3. Don’t know if it’s related to the issue discussed here. If it’s related I would have a very quick way to reproduce the problem.

Basically my BPI-R3 is running Debian 12.7 configured as firewall routing packets between different subnets. I can reproduce my issue very quickly. It does even happen when running only two 'rsync’s across ‘lan0’ and ‘lan2’ with 3 machines involved. One machine must use a 100Mbit LAN interface. The others all have 1Gbit. Packets between ‘lan0’ and ‘lan2’ are forwarded/firewalled with iptables rules. Within a few seconds after starting the rsync test I get:

client_loop: send disconnect: Broken pipe
rsync: [sender] write error: Broken pipe (32)
rsync error: error in socket IO (code 10) at io.c(823) [sender=3.2.3]

probably because it gets a tcp-reset. I experimented a lot to workaround the issue (with kernels between 6.8 - 6.11) but nothing helped. Except replacing ‘lan0’ by an externel USB-ethernet adapter

I tried to check if

/sbin/ethtool -K lan0 tx off

could help me. But the command is not executed:

# ethtool -K lan0 tx off
Actual changes:
tx-checksum-ipv4: on [requested off]
tx-checksum-ipv6: on [requested off]

is this because the settings are all ‘fixed’?

# ethtool -k lan0 | less
Features for lan0:
rx-checksumming: on [fixed]
tx-checksumming: on
        tx-checksum-ipv4: on [fixed]
        tx-checksum-ip-generic: off [fixed]
        tx-checksum-ipv6: on [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on [fixed]
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
        tx-tcp-segmentation: on [fixed]
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: on [fixed]
        tx-tcp6-segmentation: on [fixed]
[...]

I currently run a kernel (6.10.0-bpi-r3-main) built after:

git clone [email protected]:frank-w/BPI-Router-Linux.git
cd BPI-Router-Linux
git checkout 6.10-main

Any ideas how to turn off tx-checksumming?

frank-w · September 9, 2024, 6:20am

I guess it is different issue…there were some reports with 100mbit devices for r3/r4

E.g. for r4: [BPI-R4] bad switch performance in upload

I tested with manual setting speed which worked,but it could be other issue.

When you say traffic between lan0 and lan2 is forwarded i expect they are in different lan segments (different subnet),right?

Have you tried disabling autoneg and set speed manually?

Maybe it is an autoneg issue.

Similar to this: BPI-R4: 100Mbit broken

sparkie · September 9, 2024, 6:45am

both ‘lan0’ and ‘lan2’ first are connected to simple 1Gbit switches. So basically the BPI-R3 sees only 1Gbit ports.

‘traffic between lan0 and lan2’ means: the BPI-R3 router forwards packets between ‘lan0’ and ‘lan2’ that are located on different subnets (aka 192.168.140.0/24 and 192.168.150.0/24)

there are no ‘dmesg’ messages at all on any machine involved at the time of error. But I will try to disable autoneg.

To ‘turn off tx-checksumming’ would be much more interesting IMHO. It appears not to work on the current kernel/drivers (plse see above). Do you know what to do to turn it off?

meehien · September 9, 2024, 7:19am

This is the same error I was having. You need to run it on the switch interface, i.e. part after @, like @eth0. Really curious to see if it fixes the issue for you too.

meehien · September 9, 2024, 7:27am

Hmm. I can describe the setup that was failing the most ‘reliably’ for me. Connect two BPI-R3 with an ethernet cable. Use a PC/device to connect to one of the BPI-R3s over wlan. Do some intensive TCP data transfers from the PC with the BPI-R3 you are not directly connected to. I am using the unison tool to backup my files, which uses rsync over ssh. This would fail to finish 99% of the time, with the error @sparkie printed above.

sparkie · September 9, 2024, 7:46am

it’s phantastic!

I can reproduce my issue within seconds in my LAN environment. So I tried to toggle between ‘tx on’ and ‘tx off’ on the interface originally named ‘eth0’ (thanks @meehien for hinting me). The actual test runs between interfaces originally named ‘lan0’ and ‘lan2’ though.

My issue does no longer appear with ‘tx off’. But instantly reappears after setting to ‘tx on’

sparkie · September 9, 2024, 8:29am

after lots of experimenting this finally is the easiest setup I found to reproduce the issue within seconds:

hardware setup:

desktop A (Gbit) connected to BPI-R3 (lan0) via Gbit switch A (in 192.168.140.0/24)
desktop B (Gbit) connected to BPI-R3 (lan2) via Gbit switch B (in 192.168.150.0/24)
RaspberryPi (100Mbit) connected to BPI-R3 (lan2) via Gbit switch B (in 192.168.150.0/24)

some illustrating ASCII art:

    ------------------------
    desktop A (debian 11.11)
    ------------------------
                |
           -------------
           Gbit switch A
           -------------
                |
     -------------------------
      lan0 (192.168.140.0/24)
        BPI-R3 (debian 12.7)
      lan2 (192.168.150.0/24)
     -------------------------
                |
           -------------
           Gbit switch B
           -------------
                |        \
                |         \
-------------------------- \
Raspberry Pi (debian 11.9)  \
--------------------------   \
                              \
                             -----------------------
                             desktop B (debian 12.5)
                             -----------------------

software setup:

basically only 2 concurrent ‘rsyncs’ are needed copying some files around. All commands are started from desktop A

desktopA# ssh raspberrypi rm -frv /tmp/YYYYY; rsync --delete -vaX --numeric-ids source_dir raspberrypi:/tmp/YYYYY
desktopA# ssh desktopB rm -frv /tmp/YYYYY; rsync --delete -vaX --numeric-ids source_dir desktopB:/tmp/YYYYY

(started in different shells concurrently)

error symptoms:

in case of error the ‘rsync’ running between ‘desktop A’ and ‘desktop B’ breaks with:

client_loop: send disconnect: Broken pipe
rsync: [sender] write error: Broken pipe (32)
rsync error: unexplained error (code 255) at io.c(823) [sender=3.2.3]

the other ‘rsync’ running between ‘desktop A’ and ‘raspberrypi’ is mostly not affected

successful workaround:

ethtool -K eth0 tx off

thanks to @meehien for providing this

caveats:

setting ‘tx’ to ‘off’ impacts network performance.

with ‘tx on’ (the default) ‘iftop’ utility shows stunning ‘117MB’ when running a simple ‘netcat’ between ‘desktop A’ and ‘desktop B’. Excellent for a truly routing/firewalling device.

alas with ‘tx off’ (workaround) ‘iftop’ utility shows no more than about ‘94MB’ for the same

desktopA# netcat desktopB 9000 < /dev/zero
desktopB# netcat -l 9000 > /dev/zero

without workaround:

desktopA# iftop -B

238MB          477MB         715MB          954MB    1.16GB
└─────────────┴──────────────┴─────────────┴──────────────┴──────────────
desktopA              <=> desktopB                 117MB   117MB   116MB

with workaround:

desktopA# iftop -B

238MB          477MB         715MB          954MB    1.16GB
└─────────────┴──────────────┴─────────────┴──────────────┴──────────────
desktopA              <=> desktopB                94.6MB  94.2MB  93.0MB

sparkie · September 12, 2024, 6:35am

Is there anyone who feels responsible for fixing the problem? As far as we now know, it’s a firmware bug.

Since I can easily recreate the issue within seconds, I could always test new drivers/firmwarez if you so desire.

Thanks · September 12, 2024, 6:29pm

Maybe installing the latest OpenWrt SNAPSHOT firmware will solve your problem:

OpenWrt Firmware Selector

meehien · September 13, 2024, 7:39am

It actually doesn’t. This problem is also present on OpenWrt.

sparkie · October 11, 2024, 7:23am

Any news on this? None of the newest driver and firmware releases did fix it

ericwoud · October 11, 2024, 8:02am

I have been testing networking extensively these last couple of weeks in the R3.

However, this was on the BPI-R3 running archlinuxarm, connected to a R3mini, a rk3588 and a R64 (all on archlinuxarm), in all sorts of setups.

Except for a some retries on iperf3, I do not experience any networking problem…

ericwoud · October 11, 2024, 2:25pm

Did you try the solution as here:

https://superuser.com/questions/1355421/rsync-stopped-working-and-returns-rsync-error-unexplained-error-code-255-at

sparkie · October 11, 2024, 2:43pm

thank you for the tip. I changed ssh configuration on client and both servers as suggested.

Unfortunately the error still strikes within 20 - 30seconds after start of the reproduction test. I do not copy large files. The error mostly appears when both machines copy files of a size of about 100kB. I run both rsyncs with verbose flag so I can see the output flying by rapidly

setting

    ethtool -K fth7 tx off

“fixes” the issue as outlined above

ericwoud · October 11, 2024, 7:09pm

And how does iperf3 perform? Also with the -R option (reverse direction)?

Any difference with the different setting tx on/off

sparkie · October 12, 2024, 4:55am

I run the iperf3 test according to my setup from above between desktop A (client) and desktop B (server)

configure router to: /sbin/ethtool -K eth0 tx off

 desktopA# iperf3 -fM -c desktopB
 Connecting to host desktopB, port 5201
 [  5] local 192.168.140.196 port 60240 connected to 192.168.150.150 port 5201
 [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
 [  5]   0.00-1.00   sec  91.4 MBytes  91.3 MBytes/sec    0   2.90 MBytes
 [  5]   1.00-2.00   sec  91.2 MBytes  91.2 MBytes/sec    0   2.90 MBytes
 [  5]   2.00-3.00   sec  90.0 MBytes  90.0 MBytes/sec    0   2.90 MBytes
 [  5]   3.00-4.00   sec  91.2 MBytes  91.2 MBytes/sec    0   2.90 MBytes
 [  5]   4.00-5.00   sec  90.0 MBytes  90.0 MBytes/sec    0   2.90 MBytes
 [  5]   5.00-6.00   sec  90.0 MBytes  90.0 MBytes/sec    0   2.90 MBytes
 [  5]   6.00-7.00   sec  91.2 MBytes  91.2 MBytes/sec    0   2.90 MBytes
 [  5]   7.00-8.00   sec  87.5 MBytes  87.5 MBytes/sec    0   2.90 MBytes
 [  5]   8.00-9.00   sec  88.8 MBytes  88.8 MBytes/sec    0   2.90 MBytes
 [  5]   9.00-10.00  sec  87.5 MBytes  87.5 MBytes/sec    0   2.90 MBytes
 - - - - - - - - - - - - - - - - - - - - - - - - -
 [ ID] Interval           Transfer     Bitrate         Retr
 [  5]   0.00-10.00  sec   899 MBytes  89.9 MBytes/sec    0             sender
 [  5]   0.00-10.02  sec   898 MBytes  89.7 MBytes/sec                  receiver

 iperf Done.

 desktopA# iperf3 -R -fM -c desktopB
 Connecting to host desktopB, port 5201
 Reverse mode, remote host desktopB is sending
 [  5] local 192.168.140.196 port 36590 connected to 192.168.150.150 port 5201
 [ ID] Interval           Transfer     Bitrate
 [  5]   0.00-1.00   sec  63.7 MBytes  63.7 MBytes/sec
 [  5]   1.00-2.00   sec  64.4 MBytes  64.4 MBytes/sec
 [  5]   2.00-3.00   sec  63.6 MBytes  63.6 MBytes/sec
 [  5]   3.00-4.00   sec  64.2 MBytes  64.2 MBytes/sec
 [  5]   4.00-5.00   sec  63.8 MBytes  63.8 MBytes/sec
 [  5]   5.00-6.00   sec  64.3 MBytes  64.3 MBytes/sec
 [  5]   6.00-7.00   sec  64.6 MBytes  64.6 MBytes/sec
 [  5]   7.00-8.00   sec  64.3 MBytes  64.3 MBytes/sec
 [  5]   8.00-9.00   sec  63.4 MBytes  63.4 MBytes/sec
 [  5]   9.00-10.00  sec  64.3 MBytes  64.3 MBytes/sec
 - - - - - - - - - - - - - - - - - - - - - - - - -
 [ ID] Interval           Transfer     Bitrate         Retr
 [  5]   0.00-10.02  sec   644 MBytes  64.3 MBytes/sec    0             sender
 [  5]   0.00-10.00  sec   641 MBytes  64.1 MBytes/sec                  receiver

 iperf Done.

configure router to: /sbin/ethtool -K eth0 tx on

 desktopA# iperf3 -fM -c desktopB
 Connecting to host desktopB, port 5201
 [  5] local 192.168.140.196 port 45992 connected to 192.168.150.150 port 5201
 [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
 [  5]   0.00-1.00   sec   115 MBytes   115 MBytes/sec    0   1.07 MBytes
 [  5]   1.00-2.00   sec   111 MBytes   111 MBytes/sec    0   1.61 MBytes
 [  5]   2.00-3.00   sec   111 MBytes   111 MBytes/sec    0   2.47 MBytes
 [  5]   3.00-4.00   sec   111 MBytes   111 MBytes/sec    0   2.70 MBytes
 [  5]   4.00-5.00   sec   112 MBytes   113 MBytes/sec    0   2.76 MBytes
 [  5]   5.00-6.00   sec   110 MBytes   110 MBytes/sec    0   2.84 MBytes
 [  5]   6.00-7.00   sec   108 MBytes   108 MBytes/sec    0   2.84 MBytes
 [  5]   7.00-8.00   sec   109 MBytes   109 MBytes/sec    0   2.84 MBytes
 [  5]   8.00-9.00   sec   110 MBytes   110 MBytes/sec    0   2.84 MBytes
 [  5]   9.00-10.00  sec   111 MBytes   111 MBytes/sec    0   2.84 MBytes
 - - - - - - - - - - - - - - - - - - - - - - - - -
 [ ID] Interval           Transfer     Bitrate         Retr
 [  5]   0.00-10.00  sec  1.08 GBytes   111 MBytes/sec    0             sender
 [  5]   0.00-10.01  sec  1.08 GBytes   111 MBytes/sec                  receiver

 iperf Done.

 desktopA# iperf3 -R -fM -c desktopB
 Connecting to host desktopB, port 5201
 Reverse mode, remote host desktopB is sending
 [  5] local 192.168.140.196 port 58096 connected to 192.168.150.150 port 5201
 [ ID] Interval           Transfer     Bitrate
 [  5]   0.00-1.00   sec  73.9 MBytes  73.9 MBytes/sec
 [  5]   1.00-2.00   sec  74.5 MBytes  74.5 MBytes/sec
 [  5]   2.00-3.00   sec  74.0 MBytes  74.0 MBytes/sec
 [  5]   3.00-4.00   sec  73.6 MBytes  73.6 MBytes/sec
 [  5]   4.00-5.00   sec  75.5 MBytes  75.5 MBytes/sec
 [  5]   5.00-6.00   sec  74.3 MBytes  74.3 MBytes/sec
 [  5]   6.00-7.00   sec  73.2 MBytes  73.2 MBytes/sec
 [  5]   7.00-8.00   sec  74.1 MBytes  74.1 MBytes/sec
 [  5]   8.00-9.00   sec  73.9 MBytes  73.9 MBytes/sec
 [  5]   9.00-10.00  sec  73.7 MBytes  73.7 MBytes/sec
 - - - - - - - - - - - - - - - - - - - - - - - - -
 [ ID] Interval           Transfer     Bitrate         Retr
 [  5]   0.00-10.01  sec   744 MBytes  74.2 MBytes/sec    0             sender
 [  5]   0.00-10.00  sec   741 MBytes  74.1 MBytes/sec                  receiver

 iperf Done.