fewer interrupts may cause buffer-overflow
So there is some other interesting behaviors here, at least on 5.9.0. When the “transmit timeout” opps is triggered the MAC gets reinitialized and everything resumes functioning properly. Other times there are no logs and the broken frames start happening. More interestingly sockets to clients don’t disconnect in the case of the opps, I guess because the kernel never sees the PHYs go away? (that is probably deeper into the sockets plumbing than either of us cares to go) however those little hiccups might not be such a big deal if TCP connections don’t die, and its as infrequent as it appears. I might try to turn down the multiplier on that watchdog and see if I can make it trigger more often to reset the MAC. Assuming the two issues are even related, I don’t actually know they are perhaps the thing to do is try to catch the timeout condition earlier reset the MAC more often before it upsets the communication with the switch. It will be a few days though before I can try it. I really wish I had a better understanding of why longer frames stay broken.
Any pointers on documentation as far as internlink between the MAC and switch?
Also it looks like they are still struggling with the same issues on Mediatek hardware over in OpenWRT land: https://forum.openwrt.org/t/mtk-soc-eth-watchdog-timeout-after-r11573
but this depends on packet steering which seems to cause the problem…imho the rps should be disabled by default, but i’m unsure how to check this
No luck trying to make the watchdog more agressive as a workaround. I changed the multiplier from 5 to 2 * Hz and I still hit the broken packet condition before triggering the time out oops.
Rene give me also the tip to disable the pase but not much more (where exactly the issue happens). As far as i read it seems to be a problem in dma handling of tx packets over gmac.
Just a hypothesis: Maybe pause frames on gmac stall complete traffic (pause frame going in on dsa port get dsa-tag,going over gmac and dsa-tag is removed on cpu-port. So cpu-port “sees” a pause-frame and stops sending out data also to other dsa-ports which maybe causes the timeout). If we disable flow-control pause frames on cpu-port are simply ignored so this stall does not happen. But if you generate more traffic than cpu-port can handle,some packets are dropped resulting in retransmitts
The trouble with that theory is it does not explain why the broken packet issues happens even when we feed the kernel a dts with no pause frames enabled. Unless that does not really disable pause fully. I was able to trigger both the timeout opps and broken frame issue with pause off on both the switch and gmac in the dts. Its not just retransmits either, once the broken frames start they stay broken and only short payloads like icmp messages work.
The tests I have done seem to say to me, that one or more of the following is true:
- removing pause, in the dts does not fully disable pause frame handling.
- pause frames are not the trigger of the broken frames issue.
- pause frames are not the only trigger of the broken frames issue.
The transmit timeout oops, while likely related, appears to be a separate and recoverable condition. In that following that event correctly formed packets can again be sent from the switch ports without a reboot. It does appear that the watchdog triggers the hardware init code again.
It will be good to know how the “broken” packets look like in comparision to the original packet. Is somewhere bytes cut off (e.g. because of adding tag for gmac),overridden or completely scrambled
That is a good question. I’ll get some remote side packet captures when I get home tonight.
I guess we need to modify code to log packet coming from dsa-port, going out on gmac (packet +mtk-tag) and received on cpu-side (tag removed) and then reverse direction.
Well it worth doing a packet capture again because I think I learned something probably important. I was working from some of my past assumptions and old information before. I captured packets at the client before running the samba abuse test to break the thing. The test case being connect to simple netcat listener on the PC running on port 3000.
If you open the pcap in wireshark it will show SYN from the R2 to the PC, with a correct TCP checksum. Followed by SYN,ACK to the R2 with a bad checksum as expected because the pc does checksum offload).
After running the abuse test until the network stops functioning. You see at the client a SYN from the R2 with a bad checksum (but every thing else about the packets is identical except for the identification and checksums), the client does not respond because the checksum is bad and the R2 retransmits the SYNs, each time sending a packet with a bad TCP checksum. ICMP ECHO/ECHOREPLY work because they don’t valid the checksums. So what we actually appear to have is some condition where TCP checksum calculations stop working.
I might explore in code if its possible to disable TSO, and see if that lets things work dependably.
So I looked at the driver code and found it supported disabling the checksum offload already. I used
ethtool -K eth0 rx off tx off
Its been runing my abuse test all night. Not conclusive that it won’t break but pretty possitive. disabling the checksum offload does impact the CPU load a bit. I’ll see how bad the damage is with ipref in terms of thruput a bit later once satisfied it stops the failures.
any ideas why checksums would start be calculated wrong and get stuck that way? I looked at the packet captures again from the client. Hopful to see something like the same checksum value on every bad packet, like the hardware justt stopped updating a register or something, but does not seem to be the case.
i guess the data at checksum position is no more the checksum
(H-Header,D-Data,T-DSA-Tag,C-Checksum) - ethernet-frame, very simplified (based on fixed Frame-size)
before: HHDDDDDDDDCC after: TTHHDDDDDDDD
normally frame needs to be splitted to make space for Checksum, so basicly 2 Frames with this structure
Thanks for the reply but I am not quite sure I follow. Here are simple SYN frames as received at the client. Both to open 192.168.16.10:3000 from the R2
A frame before the error condition
0000 18 a9 05 d5 4c 33 de ad be ef 00 ff 08 00 45 00 0010 00 3c 02 3f 40 00 3f 06 95 21 c0 a8 13 01 c0 a8 0020 10 0a 8b b8 0b b8 f4 86 fd 29 00 00 00 00 a0 02 0030 fa f0 33 f8 00 00 02 04 05 b4 04 02 08 0a e3 42 0040 08 58 00 00 00 00 01 03 03 06
A frame with a bad checksum at 0032
0000 18 a9 05 d5 4c 33 de ad be ef 00 ff 08 00 45 00 0010 00 3c d8 15 40 00 3f 06 bf 4a c0 a8 13 01 c0 a8 0020 10 0a a0 4c 0b b8 e9 1f 17 7d 00 00 00 00 a0 02 0030 fa f0 16 30 00 00 02 04 05 b4 04 02 08 0a 6b a0 0040 cd b0 00 00 00 00 01 03 03 06
I don’t see any differences besides the checksum and the timestamps. All of the fields appear identical and correctly aligned to their respective offsets. So it seems like a calculation issue rather than packing issue.
Checksum have to be different as e.g. 0x12 & 0x22 with following bytes are also different
I don’t know where checksum is located from scratch…
Right the check will be different because the id and timestamps change. My point was simply the only thing wrong with the the ‘bad’ packet that I can tell disecting it in wireshark is the checksum value. Everything things is right where it should be, and appears correct in terms of values.
Ok did another experiment.
rebooted the R2 and run the abuse test until the network stopped working with tx checksum offload enabled. I used ethtool to turn off checksum offload without a reboot, and the network recovered. I used ethtool turn checksum offload back on and the network breaks immediately. I rebooted without power cycling and the result is a working network again, with checksum offload enabled according to ethtool
How was result of running test without checksum offload? Did you get broken packets or a timeout?
I am going fire it back up here, just in case. However I have not been able to produce a failure or the oops starting from rx off tx off.
So far unable to cause the network failures with checksum disabled. Pretty sure this is root cause at this point. Its not such lousy performance I can’t live with it, but I am not quite ready to give up on trying to figureout why. The problem sure is strange though, given how long it can often run before things break. Hopefully this weekend I can explore a little more.
did you got timeouts with enabled pause on gmac and disabled crc-offload?
ethtool -K eth0 rx off tx off