BPI-R2 slow ethernet speed

frank-w · October 15, 2020, 7:35am

Retransmitts may came from missing flowcontrol on switch/other side. Show ethtool -S eth0/br0/lanX after getting these retransmitts

the pause-patch for gmac is not yet in 5.4 as it will be merged to 5.10…only in my 5.4…that can also cause the Problem (but afair you use my repo)…just for others not using my repo

gwalton · October 15, 2020, 2:05pm

Hi Frank,

I am still seeing some issues I think are related to what is going on in this thread. I updated to your 5.4-main branch (at 5.4.70) from about two days ago. I get the “eth0: transmit timed out” error in dmesg and the LANx interfaces stop working until the unit is restarted. There is a stack trace as well in dmesg but it does not look to contain anything helpful.

Its not easy to reproduce. Running things ipref do not seem to trigger it, you have to create some CPU load on the R2 as well. Running samba on the R2, sharing a volume on a large SATA disk I can trigger the failure by mounting the volume on a client and doinging something like “cat * > /dev/null” (on the client) in a directory with a few 100GB of files. Some times it will run for six or seven hours, sometimes it will die in 15min.

I realize this is perhaps not a helpful bug report, but I only just got a second unit to play with and cloned my images to it. Now that I have separate test-pig setup I can try things out and iterate faster.

frank-w · October 15, 2020, 2:09pm

I know these rx timeout problems,but thought they are fixed by phylink conversion (5.2/3).

But anywhere i saw a fix for timeouts,but do not remember if it was for 5.4…maybe in openwrt-master or any fork

gwalton · October 15, 2020, 2:10pm

is your 5.4-main phylink?

frank-w · October 15, 2020, 2:14pm

5.4 mainline is phylink…it got merged with 5.3 or 5.2

Can you give me the exact error-message/stack trace

gwalton · October 15, 2020, 2:18pm

Ok, that is what I thought. I appologize i have been away from this project for a bit. I tried to get caught up but the way this forum is organized its a bit challenging, especially with the 64 bit board topics all in the same place here.

So if that is the case, I am curious if anyone else is still seeing the timeout issue. I don’t know if others here are not hitting because they are not trying to use the r2 for any NAS functions. I defenitly have not hit the problem just routing/NATing thru it. I had 5.4.2 up for like 60 days with samba and nfs disabled.

Where there any other changes related to timeout that went in with phylink? I read some stuff about a gmac pause patch. If there is a timeout or wait maybe it needs to be more agressive/pessimistic?

frank-w · October 15, 2020, 2:21pm

The missing pause was added in my 5.4-main long time ago,but with 5.6/5.7 i stumbled again over it so i posted it to mainline,but it is not in 5.4 yet (about to be merged in 5.10)

frank-w · October 15, 2020, 2:52pm

Maybe one of these Patches fix your Problem

https://git.openwrt.org/?p=openwrt/staging/nbd.git;a=commit;h=59d236f11df7539cfd9524a8d7857faefe40f74b

Maybe this one patchwork

Found in this thread: https://forum.openwrt.org/t/mtk-soc-eth-watchdog-timeout-after-r11573/50000/301

gwalton · October 15, 2020, 3:32pm

thanks I’ll take a look

gwalton · October 15, 2020, 5:08pm

patches look interesting but mtk_eth_soc.c looks to have changed a lot before 4.14 and 5.4.70. Going to take some effort here to see if this interrupt vs poll order isssue is even still present.

frank-w · October 15, 2020, 5:35pm

Seems the fe_poll_rx/tx is openwrt specific…i have not found these functions not in 4.14 too…mhm.

gwalton · October 15, 2020, 5:38pm

https://git.openwrt.org/?p=openwrt/staging/nbd.git;a=commitdiff;h=59d236f11df7539cfd9524a8d7857faefe40f74b;hp=318c5d25556268f512bc0c2870d6d411ddc02690#patch13

Looking at this one for clues now

frank-w · October 15, 2020, 5:41pm

This one is named eth1-tx-timeout…currently we have only eth0…or do you have eth1 enabled?

gwalton · October 15, 2020, 5:54pm

eth0 only; just trying to understand the logic, hoping maybe the same issue applies to both.

With only one eth0 used; form the looks of things that patch won’t change the behavior.

gwalton · October 15, 2020, 11:03pm

Another little update here. I figured I’d change the DTS and test things without pause frame enabled. I’d done this on 4.14, but I had so many other issues back then it was tough to isolate anything. This test was asy enough to do on my 5.4.70 build (frank-w’s 5.4 main tree) since I could just recompile device tree and make an new uImage.

Interestingly I have been running a file copy via smbd now for about 4 1/2 hours. Certainly not the longest its ever held up but well past where it usually crashes.

I am going to leave running here for some more hours to see if its really solid without the pause frames. However assuming it is, does that man we are looking at an underun eth0 -> switch that the mtk_eth_soc driver isn’t handling? I also wonder if it really makes sense to have pause enabled on both eth0 and the switch. I might repeat my experiment wiith it enabled on the switch only.

Still curious if anyone else is hitting this. Again use case would be high data volume, at least some CPU load, main direction of flow being R2 -> client. Where the R2 is generating the packets not forwarding them some other interface where the receive side might be determining the timing.

frank-w · October 16, 2020, 1:43pm

Interesting…you have only disabled pause on both positions (switch+gmac) in dts and it works better?

gwalton · October 16, 2020, 3:25pm

Correct, only changes are removing pause from the dts, on both the swtich and the gmac. I left it running over night. More than 300GB trasnfered via smbd and the network is still working this morning. Pretty sur that is a record.

frank-w · October 16, 2020, 3:39pm

Have made iperf? Is speed ~940mbit/s from/to r2?

gwalton · October 16, 2020, 6:08pm

Well I started another experiment to see if its stable with pause enabled on the switch, but not on the gmac. I’ll have to revert to do ipfref tests with it off on both. However before I run my burn in test again, I did run ipref. R2 as a client about 700mbit/s, R2 as the server 860mbit/s. That is pause on the switch, not on the gmac

So hitting your 940; but not failing is way more important here; not that don’t want all the speed I can get too.

I also want to note test was run with the mac directly cabled to the R2 no other switch in the works.

frank-w · October 17, 2020, 6:44am

One pause results in retransmitts,that i have fixed with adding the second pause that was missing on mainline. Btw. My r2 is running 5.4.51 6 15h days now without any timeout (had replaced my ssd 1 week ago so i had powered off manually, no crash). The only message i see ist the mt76 airtime warning i patched out later and some messages from netlink/hostapd because i’m still on stretch