The weird problem. For the last year or so I have been having this weird networking issue: large https streams (github, file downloads, online gaming, streaming, etc.) randomly fail with a SSL decryption error and need to be restarted, leading to connection interuptions (see below at [1] and [2]). This problem seems to only happen for eth (wired) connections that operate around 300 Mbps. Faster connections (or different network adapters) do not seem to have this issue. I have been able to reproduce the problem consistently between (1) two BPI-R3s connected with Powerlan Adapters (both Devolo and TP-Link); (2) a BPI-R3 and any PC connected through the PowerLan and, most importantly (3) BPI-R3 directly connected to a Cable Matters USB to Ethernet Adapter [3]. Additional relevant system specs below @[4].
The weird solution. As I was trying to diagnose this, I “mistakenly” used the following iptables-nft rule (note the missing --tcp-flags option) which seems to “fix” the problem.
As soon as the rule is removed, or the --tcp-flags/--syn is added the problem reappears. Problem is also manifesting, at all times, with for native nftables rules (probably because, as far as I can tell, there is no way to create a “partial rule” such as the above).
Help requested.
Does anyone have any idea/hunch why the above rule addresses the problem?
So far I have found it quite difficult to debug the issue. Can anyone suggest how they would approach debugging this (tried traffic dumps but not sure exactly what to look for)?
At this point I am suspecting it might be a driver related issue. Is anyone aware of any patches that are relevant?
Thanks. I can provide additional details if needed.
Extra details:
[1]
$ wget -O /dev/null "https://software.download.prss.microsoft.com/dbazure/Win11_23H2_EnglishInternational_x64v2.iso?t=<RANDOM_TOKEN>"
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving software.download.prss.microsoft.com (software.download.prss.microsoft.com)... 152.199.21.175, 2606:2800:233:1cb7:261b:1f9c:2074:3c
Connecting to software.download.prss.microsoft.com (software.download.prss.microsoft.com)|152.199.21.175|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6797694976 (6.3G) [application/octet-stream]
Saving to: ‘/dev/null’
/dev/null 0%[ ] 41.30M 30.1MB/s in 1.4s
2024-02-23 10:23:29 (30.1 MB/s) - Read error at byte 43302901/6797694976 (Decryption has failed.). Retrying.
[2]
curl -o /dev/null "https://software.download.prss.microsoft.com/dbazure/Win11_23H2_English_x64v2.iso"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
15 6497M 15 1015M 0 0 28.8M 0 0:03:45 0:00:35 0:03:10 31.4M
curl: (56) OpenSSL SSL_read: OpenSSL/3.2.1: error:0A000119:SSL routines::decryption failed or bad record mac, errno 0
That said, yes, on my device the kernel module is present. I have modified the kernel config to include all netfilter/iptables modules. You can probably do the same if you need the MSS functionality. Also, ttbomk, clamping should go in the mangle table, not filter.
Also, in the particular situation I described I want to apply the rule only for the affected interface i.e. brlan.
I am not trying to do MSS clamping, but rather figure out how to prevent the https connections resetting. The “malformed” iptables rule given seems to be the only thing that fixes this (in case anyone visits this forum), but I don’t understand why, nor what the actual cause of the problem is to begin with (and it is not the clamping/MSS, as far as I can tell).
I have obtained another clue for this. It would seem that whenever the tcp connection reset happens the calculated window size abruptly changes from 68096 to 4175104 (see attached pics). Not sure what to make of this and would appreciate some help.
Not really…have you a trace like the one reported on openwrt (transmit queue timeout)? Is this in both directions or only one? You can test this using iperf3…is this only with r3 (test with both ends different device and same kernel),maybe it is an underlying bug and not mediatek/r3 specific.
This is a bpi-r3 related issue. I am not observing this with any of my other devices. I don’t have any openwrt systems currently at hand but might set up something. Also, the above error might not be related, see below.
In the meantime, I have some new observations.
I have been unable to reproduce the error on a direct iperf3 connection: e.g. device to bpi-r3.
I have been unable to reproduce the error on a remote iperf3 connection: e.g. device to external server e.g. iperf3.moji.fr
So it would seam it has something to do with http/https. For this I have set up a local https and retried the wget test. I have not observed the error (mostly, need some more thorough tests for this). The error however comes up as soon as I try to fetch something external.
I did however noticed a new sympthom. Whenever the error occurs my ssh connection to the bpi-r3 is stalled. To me this seems like a buffe-like problem related to data shuffling between the wan port and the (e.g.) lan0 switch port.
I would like to try and use the spf2 port, the one connected to the other eth interface, for some more tests. Is it possible to connect a standard cat eth sfp to this one or it only supports fiber (which I don’t have )? Also, related, are all the possible pairs on the main switch (eth0) equivalent, or is the wan port special in some way?
Also, does anyone know where the mediatek switch is implemented in the kernel, maybe I could get a glimpse of how the shuffling between ports is done from there?
There are several sfp modules. At 1Gbps, you could use a module where it is reported that it supports direct phy access through i2c. These modules usually have hardware inside, that is supported in the kernel. Get one with a 88E1111 inside. At 2.5Gbps, I’ve been working on the rtl8221b modules. There are now 2 known modules. I’m sending something upstream soon.
Other rj45 modules, the phy is not accessible and control over the hardware is very limited. They present themselves as optical modules and are treated as such by the kernel.
I have a quick update on this. Problem went away in kernel 6.9, but unfortunately it resurfaced with kernel 6.10 and 6.11. I was able to do more testing recently and I now belive that the problem is related to checksum offload calculalations on the BPI-R3. The iptables “fix” from above seems to have been just a lucky positive interaction.
My new way of addressing this issue is to disable tx offloading:
/sbin/ethtool -K eth0 tx off
I haven’t noticed any speed reduction (probably because the bandwidth bottleneck is caused by the PowerLan adapters), but if anyone needs encounters this problem in the future they might try this.
Yes, tso gets turned off by tx off, and cannot be turned on. With just tso off I still get the error. The following is the only config that works currently.
~ # ethtool -k eth0 | grep ": on"
rx-checksumming: on
scatter-gather: on
tx-scatter-gather: on
generic-segmentation-offload: on
generic-receive-offload: on
tx-vlan-offload: on
hw-tc-offload: on
I have only tried turning things off from default. Should have also probably mentioned that eth0 is the generic switch (dsa) interface.
I also have a weird issue since months with BPI-R3. Don’t know if it’s related to the issue discussed here.
If it’s related I would have a very quick way to reproduce the problem.
Basically my BPI-R3 is running Debian 12.7 configured as firewall routing packets between different subnets.
I can reproduce my issue very quickly. It does even happen when running only two 'rsync’s across ‘lan0’ and ‘lan2’
with 3 machines involved. One machine must use a 100Mbit LAN interface. The others all have 1Gbit.
Packets between ‘lan0’ and ‘lan2’ are forwarded/firewalled with iptables rules.
Within a few seconds after starting the rsync test I get:
probably because it gets a tcp-reset.
I experimented a lot to workaround the issue (with kernels between 6.8 - 6.11) but nothing helped.
Except replacing ‘lan0’ by an externel USB-ethernet adapter
I tried to check if
/sbin/ethtool -K lan0 tx off
could help me. But the command is not executed:
# ethtool -K lan0 tx off
Actual changes:
tx-checksum-ipv4: on [requested off]
tx-checksum-ipv6: on [requested off]
is this because the settings are all ‘fixed’?
# ethtool -k lan0 | less
Features for lan0:
rx-checksumming: on [fixed]
tx-checksumming: on
tx-checksum-ipv4: on [fixed]
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: on [fixed]
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: on
tx-scatter-gather: on [fixed]
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
tx-tcp-segmentation: on [fixed]
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp-mangleid-segmentation: on [fixed]
tx-tcp6-segmentation: on [fixed]
[...]
I currently run a kernel (6.10.0-bpi-r3-main) built after:
git clone [email protected]:frank-w/BPI-Router-Linux.git
cd BPI-Router-Linux
git checkout 6.10-main
both ‘lan0’ and ‘lan2’ first are connected to simple 1Gbit switches. So basically the BPI-R3 sees only 1Gbit ports.
‘traffic between lan0 and lan2’ means: the BPI-R3 router forwards packets between ‘lan0’ and ‘lan2’ that are located on different subnets (aka 192.168.140.0/24 and 192.168.150.0/24)
there are no ‘dmesg’ messages at all on any machine involved at the time of error. But I will try to disable autoneg.
To ‘turn off tx-checksumming’ would be much more interesting IMHO. It appears not to work on the current kernel/drivers (plse see above). Do you know what to do to turn it off?
This is the same error I was having. You need to run it on the switch interface, i.e. part after @, like @eth0. Really curious to see if it fixes the issue for you too.
Hmm. I can describe the setup that was failing the most ‘reliably’ for me. Connect two BPI-R3 with an ethernet cable. Use a PC/device to connect to one of the BPI-R3s over wlan. Do some intensive TCP data transfers from the PC with the BPI-R3 you are not directly connected to. I am using the unison tool to backup my files, which uses rsync over ssh. This would fail to finish 99% of the time, with the error @sparkie printed above.
I can reproduce my issue within seconds in my LAN environment. So I tried to toggle between ‘tx on’ and ‘tx off’ on the interface originally named ‘eth0’ (thanks @meehien for hinting me). The actual test runs between interfaces originally named ‘lan0’ and ‘lan2’ though.
My issue does no longer appear with ‘tx off’. But instantly reappears after setting to ‘tx on’
setting ‘tx’ to ‘off’ impacts network performance.
with ‘tx on’ (the default) ‘iftop’ utility shows stunning ‘117MB’ when running a simple ‘netcat’ between ‘desktop A’ and ‘desktop B’. Excellent for a truly routing/firewalling device.
alas with ‘tx off’ (workaround) ‘iftop’ utility shows no more than about ‘94MB’ for the same