That is interesting. Here is the summary of results so far
pause enable on switch and gmac -> I can consistently can trigger the timeout, somewhere between 15min and 8hrs of load testing
No pause enabled -> No Timeout
Pause enabled on switch only -> No Timeout
Pause enabled on GMAC only -> No Timeout so far… looking good but needs to run longer to be sure
ipref - Nopause -> Need to test
ipref - pause on switch -> ~860 mbps
ipref - pause on gmac -> ~820 mpbs
I am testing with a cable directly from a macbook to the r2. Now the ethernet on the macbook is connected via the USB bus; so I don’t know how good it really is. I’ll need to cable up to my network and test beetween the macbook and a PC to see if that isn’t the ~800mbs bottleneck. It could be.
Other notes. I don’t have the wifi enabled at all. I am still using a Slackware ARM user space a little after the 14.2 release. I really don’t think user space stuff figures into this much but its worth noting. I am compiling the kernel on the device, which is still running what is at this point an older gcc 7.3.0.
Maybe a new gcc builds code that runs a little quicker for the interrupt handlers? (speculating wildly)
Can I ask what you use to load test? My timeout trigger is running samba on the pie, sharing out a sata-sdd directory without about 40GB of files in it, mounting it on the client and doing cat * > /dev/null.
on my main-router (which is now running 7 days) i make normal traffic…on my test-r2 i use iperf3 for testing (r2 server or client connected over switch to my laptop). no samba or similar installed…only “router-tools” on main device and some additional things in lxc containers (vnc-server,web-server+mysql)
I was running samba in a contain, passing the lan1 port in as PHY, but I installed the package on the “host” and shut down all the containers just to take more testing variables out.
Well this is disapointing. I thought maybe I’d found a solid workaround with the pause frames; but appareently not. Alhough I did not get the kernel oops I did trigger the broken frames issues on anothere abuse test.
I really though it was working with pause both removed. I had not been able to produce a crash so I left it running in a loop over the weekend; same job just keep cat’ing a smbshare full of files to dev/null.
While there was no oops, and no “eth0: timed out” in the log it did start producing broken packets. Ping would still work but not TCP. So the answer is with no pause enabled, it seems more stable but can still break.
So I guess while pause frames might create the conditons that often trigger the failure, they are not causal.
Could you try if 5.9-main is also affected? It has additional patch for supporting mtu-setting. Maybe packets are stripped off due to additional dsa-tag if they grow to max. Ping packets are not big,but tcp packets can go upto max-mtu and may be damaged if adding additional tag which results in packets larger than mtu-max.
I upgraded to the 5.9-main branch in your repo. So far same behaviors. The abuse test ran for hours followed by the system only being able to send short frames until a restart. Nothing in dmesg output, no opps, just not working.
I have not tried lowering the MTU yet. I started the test again with /proc/sys/net/core/dev_weight doubled from 64 to 128, in case its an interrupt handling thing. Assuming that does not work (I dont expect it will) is there an MTU value you’d suggest, I know you have spent a lot of time studying the hardware, any specific transmission size related corner case you are trying to avoid with that suggestion?
I see the default MTU on the eth0 MAC is 1504… while on the ports its 1500. The extra four bytes on the MAC are to handle the port tags and still allow normal 1500 size ethernet frames? What MTUs are you interested in my trying to lower first the MAC or the ports both? My unscientific instinct is maybe set the MAC to 1500 and the ports to either 1496 or handful of bytes fewer.
Not set MTU manually,i just wanted that driver is capable to handle setting mtu-change request (maybe from dsa core). Mhm then i guess it is not caused by MTU setting (is fixed in 5.4). Then i’m still digging in the dark…also because i do not see your errors on my machines. Is all correctly soldered?
It hard to hit the errors, which is why its so hard to even know if a fix is actually a fix. Sometimes I have to keep it running an smb copy for 8+ hours before it breaks. So I am not suprised you don’t see the issues. Its not just samba either you can do something like btrfs send through netpipes to backup the attached disk and that may or may not succeed, I had the other unit die on me after sending 50GB once.
I’ve reverted my R2 to original kernel 4.4.70-BPI-R2-Kernel (from https://github.com/BPI-SINOVOIP/BPI-R2-bsp.git)
And results are good - 900Mbps DL, 300Mbps UL. No problems on TCP traffic HTTPS download above 500Mbps. So it doesn’t look like HW issue.
In the future a plan to test SINOVOIP’s kernels 4.14. and 5.4
not being able to use the swtich ports individually is sorta of a deal breaker for me. At that point I’d have to put a swtich under it to break out vlans, and I might as well use raspbery pi machines with better vendor hardware support (sorry sinovoip but if not for Frank’s efforts your product would deliver on basically nothing it advertised.)
Sinovoips 4.14 and 5.4 are basicly mainline or parts taken from my repo…so i guess you will have same issues. 4.4 is too old,no port-separation and too much modified which is not documented…i had massive problems merging updates into 4.4 tree due to changed files
For now - it works. I don’t need port separation, however lack of the ‘swconfig’ is not good, but VLANS works as expected.
I was thinking about those lanX interfaces - I don’t know how they are made in the kernel, but I see the MTU impact here. I prefer not to do anything special, the eth0/eth1 is enough for me, I can live with VLAN on all ports.
my understanding is the parameter changes the amount of the time the system will spend running the interrupt handler draining packets from the queue. A higher dev_weight should me fewer interrupts that block longer. On a typical server machine this usually causes more latency but gets you a little more thruput.