New netfilter flow table based HNAT

DeadMeat · April 10, 2021, 9:03am

I’ve found (i hope ) a typo in rhashtable_params definition that is likely causes halts and slow-downs.

For now i’m testing this patch - looks great after 40mins. Previously problems has started after 15-30 min, depends on traffic.

frank-w · April 10, 2021, 9:33am

Nice, head_offset was defined twice

DeadMeat · April 10, 2021, 9:47am

Also lookup should use key, not head, and it uses cookie.

Can you test it on 5.10? If not - i can, but a bit later. @graphine can you also test it?

P.S. my R2 uptime is 1:21 - far away beyond previous maximum.

frank-w · April 10, 2021, 10:34am

Is the patch above enough? I’ve added it to my repo in 5.12 and 5.10 (hnat-branches)

Btw. I rebased my 5.12 trees (add mt6625l patches to rc, updated it to rc6 and rebased hnat onto rc)

i see you did conversion from mutex to spinlock…but this does not fixed anything right? was this preinitialized? as i see no DEFINE_SPINLOCK() and no spin_lock_init()

had run 50 iterations of iperf3 without slowdown (100mbit only till now)

On 1gbit/s i see throughput of ~850 mbit with 10-30 retransmitts each round, no cpu usage on test-r2 (main-r2 where iperf3 server is running is ~ 150% cpu, ksoftirq~90%+iperf3~55%). sometimes i see connection refused (on 1G only) on starting new round (round 4,50)… iperf_hnat.log (59,9 KB)

seems like the connection reset is fixed by a sleep 1

( for i in {1..50};do echo "======== ROUND $i ========"; iperf3 -c 192.168.0.10;sleep 1;done ) 2>&1 | tee -a ~/iperf_hnat_sleep.log

iperf_hnat_sleep.log (62,3 KB)

the retransmitts can be there because of a ~25m cable between my switch and main-r2 (inside wall)

with second laptop i still see retransmitts (maybe switch-config) but a throughput of ~930Mbit/s, checked switchconfig, enabled flowcontrol-auto-neg for port of second laptop, but still same retransmitts, but no slowdown…after ~150 iterations (~930 Mbit,20-35 retransmitts each round), test-r2 is up now 3h30m

last 2 iperf-sessions iperf_hnat_sleep2.log (62,3 KB) iperf_hnat_sleep.log (187,1 KB)

if i do iperf3 between client-laptop and test-r2, i see no retransmitts, test-r2 to iperf-server-laptop, have <10 retransmitts in each full round. so retransmitts seem to caused by hnat, but not that much

DeadMeat · April 10, 2021, 5:41pm

It doesn’t fix anything, so we likely don’t need it, or mutex. I use spinlock defined in rhashtable, spin_lock_init is here: https://elixir.bootlin.com/linux/v5.12-rc6/source/lib/rhashtable.c#L1027

P.S R2 uptime is 9:12 - still alive

frank-w · April 10, 2021, 5:51pm

did 250 iperf3-interations (~42Min) after last log, still stable, no hang…always 930Mbit/s, ~20-35 Retransmitts on each run (i guess laptops network cards do more than r2, but this is ok)

root@bpi-r2:~# uname -a
Linux bpi-r2 5.10.25-bpi-r2-hnat #1 SMP Sat Apr 10 13:36:00 CEST 2021 armv7l GNU/Linux
root@bpi-r2:~# uptime
 19:37:19 up  5:04,  1 user,  load average: 0.00, 0.00, 0.00
root@bpi-r2:~# nft --version
nftables v0.9.8 (E.D.S.)
root@bpi-r2:~#

iperf_hnat_sleep3.log.gz (22,0 KB)

so it looks like 5.10-hnat (including the 2 fixes: mutex+rhashtable_params) is stable enough for productive

root@bpi-r2:~# nft list ruleset
table ip filter {
        flowtable f {
                hook ingress priority filter + 10
                devices = { lan3, lan0, wan }
                flags offload;
        }

        chain input {
                type filter hook input priority filter; policy accept;
        }

        chain output {
                type filter hook output priority filter; policy accept;
        }

        chain forward {
                type filter hook forward priority filter; policy accept;
                ip protocol { tcp, udp } flow add @f
        }
}
table ip nat {
        chain post {
                type nat hook postrouting priority filter; policy accept;
                oifname "wan" masquerade
        }

        chain pre {
                type nat hook prerouting priority filter; policy accept;
        }
}
root@bpi-r2:~#

btw. i’ve noticed that nft reverts order of devices have defined it as

devices = { wan, lan0, lan3 }

in file, and loaded with nft -f file

but it seems that https-connections are not nat’ed

in firewall i see sometimes my client-laptops-ip (192.168.90.x instead of the one from test-r2’s wan i see in iperf3), http/80 (e.g. this forum) goes over correctly…very strange…already rebooted test-r2…maybe firewall/routing-related

graphine · April 10, 2021, 10:08pm

Applied to openwrt, seems previous issues are gone.

DeadMeat · April 11, 2021, 5:46am

How did you measured re transmits?

frank-w · April 11, 2021, 5:50am

See my iperf logs…

DeadMeat · April 11, 2021, 6:11am

I see…

I’ve also tested for it:

9 rounds with offload: 2 attempts with retransmit (56 and 1); 7 attempts without it.

perf_hnat.log (11.6 КБ)

(slow speed because of wifi on a server, but R2-NAT is connected to wan and lan0)

Both cables are ~2m both links 1G

UPD:

I’ve also deleted spinlock/mutex patches - after fixing rhashtable_param seems to be functional w/o locks. (I’ll test it for a while)

frank-w · April 11, 2021, 6:26am

You use 5.10 or 5.12? Afair 5.12 have patch for trgmii clock

Mhm,seems like i have it accidentally removed

https://patchwork.kernel.org/project/linux-mediatek/patch/[email protected]/

DeadMeat · April 11, 2021, 6:31am

5.12 - Your branch with mutex rolled-back.

P.S. sometimes there are retransmits, but i also may have it with offload off.

frank-w · April 17, 2021, 7:17am

Deng qingfang posted your fix

https://patchwork.kernel.org/project/linux-mediatek/patch/[email protected]/

And there is another fix for ppe

https://patchwork.kernel.org/project/linux-mediatek/patch/[email protected]/

frank-w · April 26, 2021, 2:51pm

tried to add flowtable to my router running 5.10-hnat

define iflan="lanbr0"
define ifwan="ppp8"
define ifoffload={$ifwan,$iflan}

table ip filter {
    flowtable ft {
        hook ingress priority filter + 10;
        devices = $ifoffload;
        flags offload;
    }
...

# nft.sh
/usr/local/sbin/firewall/ruleset_new_v4+v6.nft:89:12-13: Error: Could not process rule: Operation not supported
	flowtable ft {
	          ^^

if i remove the devices line, i got no error…i guess i need something to allow flowtable on ppp or bridge

but found nothing related ;(

if i hardcode the interfaces i got this:

/usr/local/sbin/firewall/ruleset_new_v4+v6.nft:93:14-19: Error: syntax error, unexpected quoted string, expecting string or '$'
		devices = {"ppp8","lanbr0"};

if i leave the quotes, i got the “operation not supported” again

DeadMeat · April 26, 2021, 5:03pm

There are mentions of PPPoE in mtk_offload code (do you use it?), may be you should try to use wan interface, which is used for ppp instead of ppp?

frank-w · April 26, 2021, 5:12pm

Traffic is forwarded to ppp8,not to vlan or wan-port. I see patches for pppoe push and bridge support

so i guess problem is more in nft command,but found no device type check

http://git.netfilter.org/nftables/commit/src?id=92911b362e9067a9a335ac1a63e15119fb69a47d

DeadMeat · April 26, 2021, 5:34pm

In my case i use bridge(lan0-3+wifi) for lan and wan, but also i use lan0-3 and wan in firewall:

Bridge:

r2-gentoo /dev # brctl show
bridge name     bridge id               STP enabled     interfaces
br0             8000.6ac4b308d134       no              lan0
                                                        lan1
                                                        lan2
                                                        lan3
                                                        wlp1s0

ip addr:

r2-gentoo /dev # ip a s br0
16: br0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 6a:c4:b3:08:d1:34 brd ff:ff:ff:ff:ff:ff
    inet 10.0.1.1/24 brd 10.0.1.255 scope global br0
       valid_lft forever preferred_lft forever
    inet6 fe80::68c4:b3ff:fe08:d134/64 scope link 
       valid_lft forever preferred_lft forever

firewall:

nft list ruleset
....

        flowtable f {
                hook ingress priority filter + 1
                devices = { lan2, lan1, lan0, wan }
                flags offload;
        }
....

All my ethernet devices are connected via lan0-3 which are members of br0.

I suppose ppp and bridge has no offload flags as they are virtual interfaces, and you still need to use physical interfaces in your firewall

P.S. Only in offload flowtable of cource you still need ppp8 for nat

frank-w · April 27, 2021, 10:11am

Thanks i try it…after thinking about it,basicly it makes sense as each virtual device is only software and only adds headers to packets (ppp,vlan) and hardware only knows hardware ports…

For ppp i have packets containing

Dot1q-header+ppp-header+payload

And sent out of wan

For bridge there are no additional headers as traffic are only grouped on software side. Hardware does not see it.

btw. where have the flow-table be linked before or after other rules in forward-chain?

chain FORWARD {
    type filter hook forward priority 0; policy drop;
    #ip protocol { tcp, udp } flow add @f
    oifname $ifexternal ip saddr $iprangesblocked jump REJECTED comment "block internal ip ranges to have only internal access"
    #ipv6 in ipv4 tunnel
    udp dport {41,43,44,58,59,60} jump REJECTED comment "block ipv6 in ipv4 tunnel"

    oifname $ifwan ip saddr 192.168.0.9 jump REJECTED comment "Block internet-access for cisco switch"
    oifname $ifwan tcp dport domain jump REJECTED comment "block external dns in forward"

seems at position i have prepared, it breaks forwarding, but imho accept/drop exits the chain and so never reach position of “flow add” defined.

so it looks like a bug in ppp-handling (maybe bug while porting to 5.10)

table ip filter {
    flowtable f {
        hook ingress priority 10
        #devices = $ifoffload;
        #use HW interfaces here!
        devices = { wan, lan0, lan1, lan2, lan3 }
        flags offload
    }

if i look into entries, i see only UNB, many with “new=0.0.0.0:0->0.0.0.0:0 eth=00:00:00:00:00:00->00:00:00:00:00:00 etype=0000 vlan=0,0”, but some (also with public ips) with mac-addresses and the vlan

01ac6 UNB IPv4 5T orig=192.168.0.21:52136->195.20.250.26:443 new=217.61.147.xx:57010->217.72.196.71:443 eth=02:11:02:03:04:05->00:00:5e:00:01:02 etype=0101 vlan=140,0 ib1=1000019d ib2=007ff020

i wonder why target-IP has changed…imho only source needs change to public-ip (IP and vlan 140 are correct)

Imho Ethtype should be 8100/88a8 for vlan or maybe 8863/4 for pppoe and not 101

i’ve found out, that i can install a pppoe-server in ubuntu for testing (apt install pppoe), but this needs some configuration of course

ok, seems only a etype-problem…activated it and made request to https://wiki.ubuntuusers.de/tcpdump/ (see packet with tcpdump on ppp-interface, not wan.140 or wan, but i see pppoe-packets with ip-adress-information, seems like port-filter does not work in this case)

# tcpdump -n -i ppp8 port 443
16:32:01.256873 IP 80.245.76.249.33630 > 213.95.41.4.443: Flags [.], ack 764, win 501, options [nop,nop,TS val 128775260 ecr 153299079], length 0
16:32:01.261032 IP 213.95.41.4.443 > 80.245.76.249.33634: Flags [P.], seq 4984:5298, ack 1985, win 262, options [nop,nop,TS val 153299081 ecr 128775254], length 314

# nslookup 213.95.41.4
4.41.95.213.in-addr.arpa	name = ha.ubuntu-eu.org.

cat /sys/kernel/debug/mtk_ppe/entries | grep BND | grep 140 | grep '213.95.41.4'
00c22 BND IPv4 5T orig=192.168.0.21:33676->213.95.41.4:443 new=80.245.76.249:33676->213.95.41.4:443 eth=02:11:02:03:04:05->00:00:5e:00:01:02 etype=0101 vlan=140,0 ib1=214949a7 ib2=007ff020

mac is the one for wan.140 (wan has another, as i need to set it for both vlans)

if etype in entries is really the ethernet-type it needs to be 8100/88XX (vlan/pppoe). i guess first ethernettype needs to be 8100 for vlan

mhm, on my previous test with afair wan+lan3 i got etype with bits 12/8 set, so again it looks like the ethtype is set to the dsa-port here and not the vlan…but i do not see if all headers are added

did some more tests with a local pppoe-server and it seems that is working with 5.12 and not with 5.10…so i guess i miss anything while porting

it seems flowtable itself is breaking, not the hw-offload (it is broken with disabled flags too, but working if “add flow” line is disabled), but offload with 5.10-hnat works without ppp…very strange

it looks like with 5.10 i have only a mtu problem with flowtable active, seems like flowtable breaks the normal Path-discovery/fragmentation behaviour. without flowtable i can access websites through ppp-tunnel (mtu 1492), with flowtable (also without offload) i got only connection refused. if i reduce mtu, it works with flowtable+offload. idk why i don’t get problems with 5.12 or without flowtable. i will try now mss-fix-settings

https://wiki.nftables.org/wiki-nftables/index.php/Mangling_packet_headers

i got answer about location of rule here: https://marc.info/?l=netfilter&m=162012856832116&w=2

flow add should be last in forward to process all other rules before. forward chain is only done for SYN and SYN-ACK (first 2 tcp packets per connection)

DeadMeat · May 5, 2021, 7:16am

Have You succeeded with tests? I’ve tested 5.10 + pppoe, with MSSFIX on the server’s side - works without any problems.

frank-w · May 5, 2021, 8:36am

not yet due to missing time timeslots are currently only max 30min, too short for setting up the complete test environment