Problem with NAT/ip_forward

it should not crash :wink:

have reported it to the dsa-maintainers: https://lkml.org/lkml/2019/8/12/376

OK. Thanks. So now, we must wait for their response and back to test again after them work :slight_smile:

seems to be fixed by this commit:

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=58799865be84e2a895dab72de0e1b996ed943f22

have added it to my repo (5.3-rc) so you can fetch and cherry-pick it

git fetch
git cherry-pick 17494d3884cd0c5cf8367ae6e8219e00fa53983c

and build againā€¦

1 Like

In this repo you have mainline network driver as I remember. With this, NAT canā€™t work on each BPI-R2 board. I must use phylink network driver. Thanks for info about this commit.

You can get this commit with cherry-pick in your merged (phylink+hdmi) version

Great. Thanks for info and your time/help.

ed1: @frank-w I should do full kernel rebuild or will be enough to run ./build.sh build and then ./build.sh pack to have a kernel with applied this commit?

ed2: OK, I see that ./build.sh build and then ./build.sh pack is enough.

You can also use only build.sh (without param). This compiles (build-option) and let you choose to install on sdcard,pack,create deb or upload to tftp after compilation

This is the preferred way for interactive useā€¦the commands build,pack,ā€¦ are designed for automatic build (like travis-ci)

Seems the mdb-fix is merged shortly after rc4ā€¦i will rebase 5.3-rc on rc6 tomorrow

[answer delayed, as I was on holiday]

With applied mdb-fix (for bridge) on my merged branch phylink+hdmi seems that NAT works correctly. Iā€™m able do download 100mb test file from the internet on test-PC connected behind the BPI-R2 as a router.

One little thing: on this BPI-R2 on which NAT doesnā€™t work at all on kernel 4.16 and occurs problem with second gmac (switched ports) - some of bigger icmp packages are lost, the speed via NAT is lower then directly on BPI-R2:

On this BPI-R2:

root@slackarm:~# wget http://noc.pirx.pl/100mb.bin -O /dev/null
--2019-08-26 11:11:40--  http://noc.pirx.pl/100mb.bin
Resolving noc.pirx.pl (noc.pirx.pl)... 217.73.181.197
Connecting to noc.pirx.pl (noc.pirx.pl)|217.73.181.197|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 104857600 (100M) [application/octet-stream]
Saving to: ā€˜/dev/nullā€™

/dev/null           100%[===================>] 100.00M  9.23MB/s    in 10s     

2019-08-26 11:11:51 (9.71 MB/s) - ā€˜/dev/nullā€™ saved [104857600/104857600]

On test-PC behind the above BPI-R2:

root@slackware:~# /usr/bin/wget http://noc.pirx.pl/100mb.bin -O /dev/null
--2009-03-28 22:27:34--  http://noc.pirx.pl/100mb.bin
Resolving noc.pirx.pl (noc.pirx.pl)... 217.73.181.197
Connecting to noc.pirx.pl (noc.pirx.pl)|217.73.181.197|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 104857600 (100M) [application/octet-stream]
Saving to: '/dev/null'

/dev/null                                    100%[=============================================================================================>] 100.00M  4.13MB/s    in 24s     

2009-03-28 22:27:58 (4.10 MB/s) - '/dev/null' saved [104857600/104857600]

Of course, this is better than if not working at all :slight_smile:

ed1. Of course, I tested this kernel 5.3.0-rc1-bpi-r2-hdmi-phylink with applied patch at all problematic BPI-R2, which I used for tests before on older kernel version.

if i understand you right, you have only problems on 4.16?

5.3-phylink works without any issues?

phylink seems to get merged on 5.4 (next LTS) so we donā€™t need to focus on older kernel-version, at least not on non-lts 4.16

btw. 4.16 does not have second gmacā€¦

imho also when bridging ports all traffic goes to CPU and you have only 1GBit/s between switch and SOCā€¦so traffic from e.g. lan0 to lan1 is together 1GBit/s (if you try to make bi-directional 1GBit/s), iperf should work in both directions (1 Gbit/s 0->1 and 1 Gbit/s 1->0)

1Git/s = ~940 Mbit/s

I have problems on 4.16 and 4.19 which I have recompiled/prepared for my requirements and made tests on this kernel versions (for tests I use standard config to avoid some other issues :slight_smile: ). These problems are not see on kernel 4.4 (I had run this version to test NAT only).

Yes, on 5.3-phylink all NAT test all passed, so looks that NAT works well.

Thanks for this description.

On 4.19-main there is second gmac i have ported from 4.14,but afaik the switch-setup is wrong (thats the cause we need to set both gmac to trgmii and maybe this cause transmit-timeouts). You can try 4.19 without 2nd gmac (https://github.com/frank-w/BPI-R2-4.14/tree/4.19-without2ndgmac) if this has same problems.

Phylink makes right switch-setup (info from rene whoā€™s the author of these phylink-patches)

btw. 5.3-rc is now based on rc6

I think that tests on 4.19 are not needed, if we will have working later version of kernel - 5.3-rc or 5.4 :wink:

4.19 is also a LTS-Kernel while 5.4 is not out yetā€¦but i guess the problems are still in the non-gmac-4.19 because 4.16 (which does not have this problems) is also affected

Iā€™m very confused, if these are just software problems only and not hardware problems too.

if you have the problems only with mainline-driver and not with phylink it is a software-problem in mainline-driver (the part which is replaced by phylink)

If I read @mariaczi correctly problem is board-dependent. On one particular R2 things work fine but on some other R2 boards there are NAT failures. It leaves a big possibility for the problem to be hardware too. For example it might be some comminication line whose traces on board happen to be laid in a way so it experiences some EMI. EMI severity might differ from board to board leading to a situation when things seem to work fine on some boards but totally fail on other. As often happens HW problems might have a workaround or a fix using some software trick. With things like EMI it might be configuring PLLs to latch to slightly different freq or configuring hardware to use more robust communication protocol. Even things like slightly different HW usage pattern might be a game changer.

Offtop example

For example last week Iā€™ve been debugging RAM problems on my home workstation after performing CPU and MB upgrade. Four DDR4 2400 modules that were working fine together in previous MB started to sometimes produce errors in ā€œRandom patternā€ memtest86+ test. OS behaviour also became unstable sometimes failing with BSODs/kernel panics, but only on reboots. Tried testing each module on its own - no errors. Tried testing modules in pairs for each possible permutations - no errors as long as only one pair is used. Installed new CPU into old MB and tested four modules in there - no problems. Installed old CPU into new MB and tested all four modules for this variant - no errors. Installed new CPU back into new MB - got errors back. Iā€™ve spent several days trying to tweak memory timings, integrated memory controller base frequency, tried increasing memory and then CPU voltages - nothing helped. Decreasing RAM frequency down to 1866 made system stable but it was not a good option as this is even slower than typical minimum DDR4 freq - 2133. And then I tried to decrease memory voltage by a 0.05V (from standard 1.2V to 1.15V) and it was like a magic wand swing, fixed the problem and allowed these four modules to run stable even when overclocked to 2866. Main conclusion here is the fact that HW problems nowdays might be really obscure and dependent on the HW configuration as done from the software side of things.

@LeXa2 thanks for your comment. You understand me and problem described by me very, very good :slight_smile: