[BPI-R3] weird networking issue (and weirder "solution")

Hi all

The weird problem. For the last year or so I have been having this weird networking issue: large https streams (github, file downloads, online gaming, streaming, etc.) randomly fail with a SSL decryption error and need to be restarted, leading to connection interuptions (see below at [1] and [2]). This problem seems to only happen for eth (wired) connections that operate around 300 Mbps. Faster connections (or different network adapters) do not seem to have this issue. I have been able to reproduce the problem consistently between (1) two BPI-R3s connected with Powerlan Adapters (both Devolo and TP-Link); (2) a BPI-R3 and any PC connected through the PowerLan and, most importantly (3) BPI-R3 directly connected to a Cable Matters USB to Ethernet Adapter [3]. Additional relevant system specs below @[4].

The weird solution. As I was trying to diagnose this, I “mistakenly” used the following iptables-nft rule (note the missing --tcp-flags option) which seems to “fix” the problem.

iptables-nft -t mangle -A FORWARD -o brlan -p tcp -j TCPMSS --clamp-mss-to-pmtu

As soon as the rule is removed, or the --tcp-flags/--syn is added the problem reappears. Problem is also manifesting, at all times, with for native nftables rules (probably because, as far as I can tell, there is no way to create a “partial rule” such as the above).

Help requested.

  1. Does anyone have any idea/hunch why the above rule addresses the problem?
  2. So far I have found it quite difficult to debug the issue. Can anyone suggest how they would approach debugging this (tried traffic dumps but not sure exactly what to look for)?
  3. At this point I am suspecting it might be a driver related issue. Is anyone aware of any patches that are relevant?

Thanks. I can provide additional details if needed.

Extra details:

[1]

$ wget -O /dev/null "https://software.download.prss.microsoft.com/dbazure/Win11_23H2_EnglishInternational_x64v2.iso?t=<RANDOM_TOKEN>"
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving software.download.prss.microsoft.com (software.download.prss.microsoft.com)... 152.199.21.175, 2606:2800:233:1cb7:261b:1f9c:2074:3c
Connecting to software.download.prss.microsoft.com (software.download.prss.microsoft.com)|152.199.21.175|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6797694976 (6.3G) [application/octet-stream]
Saving to: ‘/dev/null’

/dev/null                       0%[                                                  ]  41.30M  30.1MB/s    in 1.4s

2024-02-23 10:23:29 (30.1 MB/s) - Read error at byte 43302901/6797694976 (Decryption has failed.). Retrying.

[2]

curl -o /dev/null "https://software.download.prss.microsoft.com/dbazure/Win11_23H2_English_x64v2.iso"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 15 6497M   15 1015M    0     0  28.8M      0  0:03:45  0:00:35  0:03:10 31.4M
curl: (56) OpenSSL SSL_read: OpenSSL/3.2.1: error:0A000119:SSL routines::decryption failed or bad record mac, errno 0

[3] Amazon.co.uk

[4] ArchLinuxArm system with variuous 6.x kernels from:

  1. GitHub - ericwoud/linux: Linux kernel source tree
  2. GitHub - frank-w/BPI-Router-Linux: Linux kernel 4.14+ for BPI-R2, 5.4+ for R64, 6.1+ for R2Pro and R3

is the kernel module present at all? I would like to use --clamp-mss-to-pmtu together with ‘iptables’ but in 'frank-w’s kernel no such module exists :sleepy: :

    # /sbin/iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss- to-pmtu
    Warning: Extension TCPMSS revision 0 not supported, missing kernel module?
    iptables v1.8.9 (nf_tables):  RULE_APPEND failed (No such file or directory): rule in chain FORWARD

why do you specify an interface ’ -o brlan’ at all? Simply skip this to clamp mss for forward and backward packet flow

Hi @sparkie, please don’t hijack my thread.

That said, yes, on my device the kernel module is present. I have modified the kernel config to include all netfilter/iptables modules. You can probably do the same if you need the MSS functionality. Also, ttbomk, clamping should go in the mangle table, not filter.

Also, in the particular situation I described I want to apply the rule only for the affected interface i.e. brlan.

I am not trying to do MSS clamping, but rather figure out how to prevent the https connections resetting. The “malformed” iptables rule given seems to be the only thing that fixes this (in case anyone visits this forum), but I don’t understand why, nor what the actual cause of the problem is to begin with (and it is not the clamping/MSS, as far as I can tell).

I have obtained another clue for this. It would seem that whenever the tcp connection reset happens the calculated window size abruptly changes from 68096 to 4175104 (see attached pics). Not sure what to make of this and would appreciate some help.

1 Like

Good catch,maybe related…

Such huge jump looks for like some kind of overflow…not from c integer datatype but some register in hardware.

Or it is simply calculation issue based on some wrong value read out of it.

Any idea where to start looking?

Not really…have you a trace like the one reported on openwrt (transmit queue timeout)? Is this in both directions or only one? You can test this using iperf3…is this only with r3 (test with both ends different device and same kernel),maybe it is an underlying bug and not mediatek/r3 specific.

This is a bpi-r3 related issue. I am not observing this with any of my other devices. I don’t have any openwrt systems currently at hand but might set up something. Also, the above error might not be related, see below.

In the meantime, I have some new observations.

  1. I have been unable to reproduce the error on a direct iperf3 connection: e.g. device to bpi-r3.
  2. I have been unable to reproduce the error on a remote iperf3 connection: e.g. device to external server e.g. iperf3.moji.fr

So it would seam it has something to do with http/https. For this I have set up a local https and retried the wget test. I have not observed the error (mostly, need some more thorough tests for this). The error however comes up as soon as I try to fetch something external.

I did however noticed a new sympthom. Whenever the error occurs my ssh connection to the bpi-r3 is stalled. To me this seems like a buffe-like problem related to data shuffling between the wan port and the (e.g.) lan0 switch port.

I would like to try and use the spf2 port, the one connected to the other eth interface, for some more tests. Is it possible to connect a standard cat eth sfp to this one or it only supports fiber (which I don’t have :frowning: )? Also, related, are all the possible pairs on the main switch (eth0) equivalent, or is the wan port special in some way?

Also, does anyone know where the mediatek switch is implemented in the kernel, maybe I could get a glimpse of how the shuffling between ports is done from there?

There are several sfp modules. At 1Gbps, you could use a module where it is reported that it supports direct phy access through i2c. These modules usually have hardware inside, that is supported in the kernel. Get one with a 88E1111 inside. At 2.5Gbps, I’ve been working on the rtl8221b modules. There are now 2 known modules. I’m sending something upstream soon.

Other rj45 modules, the phy is not accessible and control over the hardware is very limited. They present themselves as optical modules and are treated as such by the kernel.