Strange network behaviour with some R2 cards

I am using the R2 in a product as an embedded application procesor and have come across some very odd behaviour. Most cards (45 out of 50) work fine, but a few are showing some odd network problems.

The setup is wan is connected to the external network and gets a DHCP address - that works fine.

lan0 and lan1 are setup to interface to a couple of specialised boards using link local addreses on two separate networks - 169.254.0.1/24 and 169.254.1.1/24.

(Why? Because at boot up the internal boards both have the same fixed IP address and I need to addresses them independently to initialise them and persuade them to pick up a DHCP address from the r2).

lan0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 169.254.0.1 netmask 255.255.255.0 broadcast 169.254.0.255 inet6 fe80::280:6fff:fe10:2001 prefixlen 64 scopeid 0x20 ether 00:80:6f:10:20:01 txqueuelen 1000 (Ethernet) RX packets 727193 bytes 168838050 (161.0 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 1629 bytes 98071 (95.7 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lan1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 169.254.1.1 netmask 255.255.255.0 broadcast 169.254.1.255 inet6 fe80::280:6fff:fe10:3002 prefixlen 64 scopeid 0x20 ether 00:80:6f:10:30:02 txqueuelen 1000 (Ethernet) RX packets 615141 bytes 304152093 (290.0 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 1258 bytes 70711 (69.0 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

This usually works fine, but in 5 cases the networking looks messed up, strange APR entries, lost packets, and the product fails QA. BTW, the lost packets are not shown on the interface stats so I the packates are not being lost at the physical level.

Swapping the R2 for another board fixes this.

I can’t beleive this is a hardware problem, if I only use one of lan0/lan1 then it seems ok.

Running a Ubuntu based system with the 4.19.138-bpi-r2 kernel which I compile myself.

Could there be differences in hardware revs, or subtle differences between boards? Before I try different kernels does anyone have any thoughts?

Afair there are strange network issues before 5.4 as because trgmii was not setup properly,second gmac patches merged but were not stable and so on…so please try a newer kernel before trying mich other things :slight_smile:

Hi Frank, thanks for the suggestion. I’ve installed 5.4.46 and also tried 5.4.50 and the behaviour is the same. Lost packets.

I think its a problem with throughput on a particular batch of MT7530.

The cards connected to lan0 and lan1 output multicast video streams at around 10Mbps, so not huge. If I remove both LANs then an ICMP ping to the wan interface from my network gives 0.150 ms roud trip which is fine.

Plugging lan0 and lan1 into the external cards I get round trips of 150-250ms which is bad. I get lost packets from the encoder (UDP).

Unplugging lan0 I get back to the 0.150 msecs round trip. I still lose packets from the encoder card.

Forcing lan0 and lan1 to 100Mbps also shows packet loss.

This is only happening on R2 with MT7530 with batch 2028-BMSL /DTPSPB05 and 2010-BMSL/DTPRSH29.

Boards with 1913-BMSL/DTPN7S08 2036-BMSL/DTPSTP76 seem to work fine. Others may also be fine, but they are now installed in customer sites so not convenient to pull the chip markings.

Hi Frank, I see youve got a bpi-r2_5.10.38-hnat+gmac2 package on your google drive, do you have the kernel sources in git I could clone and test in my own workflow here?

Pete

of course

https://github.com/frank-w/BPI-R2-4.14/commits/5.10-hnat+gmac2

in my tests i saw some retransmitts, but this seems to be caused by my laptop if iperf3-server is running there…(have it without gmac-patches and on all interfaces)

Thanks Frank, I’ve tested the setup with your 5.4 and the 5.14-rc1 as well. these do seem to be better than the 4.19 but I’m still getting packet loss on the lan0/1 interfaces from the cards feeding those interfaces (the packets have sequence numbers so I can see the missing packets).

netstat -s and so on don’t show drops or errors.

As I said this is happening on a few boards, but I’m concerned that the next batch will be newer rev chips and this might be a bigger problem down the line.

I’ll reach out to the support guys directly.

A small update on this, there is at least a pettern emerging.

On the good boards both the 4.19.138 and all 5.4/5.14 builds work properly. I don’t get any dropped packets from the lan0/1 ports and the streams going out on wan are error free. The application can communicate with the cards without errors. This is what I have in production in about 100 units the field.

On the bad boards, 4.19 does not drop packets from the lan ports, but I get dropped ICMP pings on the lan ports and the application sometimes loses comminucation with the cards. On the bad boards using 5.4/5.14 I get dropped packets from the lans but ICMP and the application works ok.

I’ve tried a few changes mainly from this thread BPI-R2 slow ethernet speed (eg remove the pause) but no real difference on the bad boards. I can’t disable rx checksum offload though:

Cannot get device udp-fragmentation-offload settings: Operation not supported Cannot change rx-checksumming

There are differences between the two as far as some boot messages go and /proc/interrupts are different - tha bad boards have fewer entries than the good boards.

The big question for me is why identical boards are behaving differently, this is not encouraging from a production standpoint, and my production team are getting jumpy. Is this an efuse difference, hardware tolerances, chip stepping differences (though I can’t see any differences from my quick dig - chiprev registers seem the same for the mt7530)?

My next steps are to see if I can get any error stats from the drivers, though I’m struggling to get my head around how the drivers and device tree work together.

Oh, and maybe some response from Bananapi support themselves?

Well, I’m now out of ideas as to why I’ve got different behaviour on a small number of boards. I’m using the 5.14-rc1 kernel (thanks Frank) I now have identical interrrupts, almost identical boot messages apart from some minor timings, but still some boards are dropping packets.

From what I can see tcpdump on lan1 shows all the packets arrive at the interface (the MPEG sequence counters are all present and correct), but the packets being written to wan are not all there - dropped packets. I’ve increased the kernel buffers for udp and the application socket buffers, but no change. Also tried using a kernel level multicast router - smcroute - so nothing is going through user land processes - still the same problem.

Using 5.14 I also see terrible TCP download from the wan interface from a PC - 500Kbps!!! On a good board this is at 1Gbps line speed. Removing the traffic to the lan0/1 restores reasonable speed on the bad boards.

I’ve tried moving the external interface to lan2 instead of the wan interface - still dropped packets, so its not a wan interface problem.

Does anyone have any other thoughts on where I can look, add debug message and so on?

You could check soldering points of ethernet-ports. There were reports from broken connections between the ports and circuit board

you can try to add the patch to modify the network driver:

then compile and update the kernel

Second part cannot be applied in 5.4,as driver was rewritten with phylink conversion.

I guess best place is here after resd old value

OR with mcr_new

Thanks. I’ve applied this to my 4.19.138 kernel and it seems to have resolved the network issue I was having with two of my boards.

I still want to understand why this fixes 2 out of 70 boards have a problem though.

It has uncovered another problem though> I’m using /dev/ttyS1 as a serial port for an LCD panel and on the two boards with the network problems, I am getting timeouts, data errors etc. This software works on all other boards.

Thanks Frank, I’m staying with 4.19 for now, but if I get to test this on 5.4 or later I’ll let you know.