Retransmitts may came from missing flowcontrol on switch/other side. Show ethtool -S eth0/br0/lanX after getting these retransmitts
the pause-patch for gmac is not yet in 5.4 as it will be merged to 5.10…only in my 5.4…that can also cause the Problem (but afair you use my repo)…just for others not using my repo
I am still seeing some issues I think are related to what is going on in this thread. I updated to your 5.4-main branch (at 5.4.70) from about two days ago. I get the “eth0: transmit timed out” error in dmesg and the LANx interfaces stop working until the unit is restarted. There is a stack trace as well in dmesg but it does not look to contain anything helpful.
Its not easy to reproduce. Running things ipref do not seem to trigger it, you have to create some CPU load on the R2 as well. Running samba on the R2, sharing a volume on a large SATA disk I can trigger the failure by mounting the volume on a client and doinging something like “cat * > /dev/null” (on the client) in a directory with a few 100GB of files. Some times it will run for six or seven hours, sometimes it will die in 15min.
I realize this is perhaps not a helpful bug report, but I only just got a second unit to play with and cloned my images to it. Now that I have separate test-pig setup I can try things out and iterate faster.
Ok, that is what I thought. I appologize i have been away from this project for a bit. I tried to get caught up but the way this forum is organized its a bit challenging, especially with the 64 bit board topics all in the same place here.
So if that is the case, I am curious if anyone else is still seeing the timeout issue. I don’t know if others here are not hitting because they are not trying to use the r2 for any NAS functions. I defenitly have not hit the problem just routing/NATing thru it. I had 5.4.2 up for like 60 days with samba and nfs disabled.
Where there any other changes related to timeout that went in with phylink? I read some stuff about a gmac pause patch. If there is a timeout or wait maybe it needs to be more agressive/pessimistic?
The missing pause was added in my 5.4-main long time ago,but with 5.6/5.7 i stumbled again over it so i posted it to mainline,but it is not in 5.4 yet (about to be merged in 5.10)
patches look interesting but mtk_eth_soc.c looks to have changed a lot before 4.14 and 5.4.70. Going to take some effort here to see if this interrupt vs poll order isssue is even still present.
Another little update here. I figured I’d change the DTS and test things without pause frame enabled. I’d done this on 4.14, but I had so many other issues back then it was tough to isolate anything. This test was asy enough to do on my 5.4.70 build (frank-w’s 5.4 main tree) since I could just recompile device tree and make an new uImage.
Interestingly I have been running a file copy via smbd now for about 4 1/2 hours. Certainly not the longest its ever held up but well past where it usually crashes.
I am going to leave running here for some more hours to see if its really solid without the pause frames. However assuming it is, does that man we are looking at an underun eth0 -> switch that the mtk_eth_soc driver isn’t handling? I also wonder if it really makes sense to have pause enabled on both eth0 and the switch. I might repeat my experiment wiith it enabled on the switch only.
Still curious if anyone else is hitting this. Again use case would be high data volume, at least some CPU load, main direction of flow being R2 -> client. Where the R2 is generating the packets not forwarding them some other interface where the receive side might be determining the timing.
Correct, only changes are removing pause from the dts, on both the swtich and the gmac. I left it running over night. More than 300GB trasnfered via smbd and the network is still working this morning. Pretty sur that is a record.
Well I started another experiment to see if its stable with pause enabled on the switch, but not on the gmac. I’ll have to revert to do ipfref tests with it off on both. However before I run my burn in test again, I did run ipref. R2 as a client about 700mbit/s, R2 as the server 860mbit/s. That is pause on the switch, not on the gmac
So hitting your 940; but not failing is way more important here; not that don’t want all the speed I can get too.
I also want to note test was run with the mac directly cabled to the R2 no other switch in the works.
One pause results in retransmitts,that i have fixed with adding the second pause that was missing on mainline.
Btw. My r2 is running 5.4.51 6 15h days now without any timeout (had replaced my ssd 1 week ago so i had powered off manually, no crash). The only message i see ist the mt76 airtime warning i patched out later and some messages from netlink/hostapd because i’m still on stretch