ericwoud
(Eric W.)
August 4, 2021, 10:40am
41
Issue is now solved so IVL bit is always set.
Assisted learning enabled
In the following commit:
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d851798584ffd497b6cf0ae68f9ba75afced0ec3
Thanks to DENG Qingfang
1 Like
Hey, I came across a 4-year-old thread that seems to describe similar issues we’re facing on OpenWrt:
opened 06:07PM - 30 Dec 22 UTC
bug
### Describe the bug
Subject: Regression: DSA breaks roaming to WLAN bridged to… VLAN
# Summary
When bridging a WLAN with a VLAN on DSA-enabled OpenWRT versions, an erroneous FDB entry creates 100% packet loss for 300 seconds.
# Summary analysis
When a VLAN-tagged packet is received by the device from the an Ethernet (LAN) port side, 2 FDB entries for the source MAC address of the packet are created in the forwarding database (FDB) of the bridge:
1. one VLAN-tagged FDB entry and
2. one VLAN-untagged FDB entry
Both FDB entries tell the switch to forward packets whose destination is that MAC address to that Ethernet port.
When then a packet from the same source is received by the device from the WiFi (WLAN) side (e.g. after roaming a client device to that WiFi) and forwarded to the Ethernet (LAN) side (with a VLAN tag), only 1 of these 2 FDB entries are updated (namely the VLAN-tagged FDB entry). The other VLAN-untagged FDB entry remains unchanged, erroneously.
When then another VLAN-tagged packet (e.g. an ARP reply packet or a DHCP reply packet) is received by the device from the Ethernet (LAN) side and that packet's destination is the client device's MAC address (=the VLAN-untagged FDB entry's MAC address), then the switch believes that this packet ought to be forwarded to the Ethernet port (even although it is received from exactly this port), while it actually should be forwarded to the CPU of the router (and then further to the WLAN). This packet then gets dropped.
This packet loss continues until, after about 300 seconds (5 minutes), the erroneous FDB entry expires.
# Impact of this bug
This bug affects potentially all DSA-enabled platforms, but at least a sizable subset of all DSA-enabled platforms.
It has been verified to exist on:
1. OpenWRT 22.03 on BT HomeHub 5a (hh5a).
2. Turris OS 4.0 on [Turris Omnia](https://forum.turris.cz/t/omnia-vlan-on-dsa-port-breaks-arp-responses-tos-4-0-5/12584/37) already since the year 2020.
# How to reproduce
## Steps
1. Ensure that your OpenWRT device A and software supports Distributed Switch Architecture (DSA).
2. Setup your OpenWRT device A to have at least one WiFi interface and at least one ethernet port with VLAN support:
1. Create bridge device "br-switch", with the list of bridge ports only consisting of port "lan2".
2. Enable "VLAN filtering" for that bridge-device and create a VLAN with VLAN ID 31, local enabled and Egress tagged for port "lan2".
3. Create OpenWRT interface "users" with device "br-switch.31".
4. Create a OpenWRT WiFi interface with SSID "test". Ensure (under Interface Configuration/General Setup/Network) that it is connected to network "users".
5. Verify by running "brctl show" on the command line (e.g. using SSH) that there is a bridge called "br-switch" with at least 2 members (one member being "lan2" and the other member being a WiFi interface, for example "wlan1-1").
6. Reboot the OpenWRT device A.
3. Connect your OpenWRT device A to another WiFi router B
1. Ensure that WiFi router B also support VLANs
2. Connect an Ethernet cable to port "lan2" of OpenWRT device A and a suitable Ethernet port of WiFi router B.
3. Setup WiFi router B such that it has a VLAN with VLAN ID 31 which is available (tagged) at the Ethernet port of WiFi router B.
4. Setup a WiFi interface on WiFi router B with the same SSID "test". Similarly to OpenWRT device A, ensure that that this WiFi interface is bridged to VLAN ID 31.
4. Connect a client device C to OpenWRT device A.
1. Run "ping" from the client device C to the IP address of WiFi router B.
2. Observe that client device C receives ping replies.
5. Roam to WiFi router B.
1. Change the physical position of client device C to be close to WiFi router B and away from OpenWRT device A. (You may need to reduce the output power of OpenWRT device A if the devices are close to each other.)
2. Wait 15 seconds.
3. Verify that client device C is now associated with WiFi router B. (You may verify this looking at the UI of WiFi router B or looking at the output of running "iw dev wlan0 link" on client device C.)
4. Observe that client device C still receives ping replies although its WLAN association has changed.
6. Roam to OpenWRT device A again.
1. Change the physical position of client device C to be close to OpenWRT device A and away from WiFi router B . (You may need to reduce the output power of WiFi router B if the devices are close to each other.)
2. Wait 15 seconds.
3. Verify that client device C is now associated with OpenWRT device A. (You may verify this looking at the UI of WiFi router B or looking at the output of running "iw dev wlan0 link" on client device C.)
## Expected result
It is expected that client device C still receives ping replies although its WLAN association has changed again.
## Observed result
It can be observed that client device C does not receive ping replies once its WLAN association has changed again.
# How to analyze
1. Install OpenWRT packages "ip-bridge" on your OpenWRT device A.
2. Re-run the reproduction steps 4, 5, and 6. Assume that the MAC address of client device C is "02:ff:04:05:06:07".
3. Run "bridge fdb show | grep 02:03:04:05:06:07" after step 3 but before step 4. Observe that the output is empty.
4. Run "bridge fdb show | grep 02:03:04:05:06:07" after step 4 but before step 5. Observe that the output is akin
```
02:ff:04:05:06:07 dev wlan1-1 vlan 31 master br-switch
```
This shows that there is one FDB entry which says that if there is a packet which is tagged with VLAN 31 and which has the destination MAC address 02:03:04:05:06:07, then it should be forwarded through device "wlan1-1".
5. Run "bridge -statistics fdb show | grep 02:03:04:05:06:07" after step 5 but before step 6. Observe that the output is akin
```
02:ff:04:05:06:07 dev lan2 vlan 31 master br-switch
02:ff:04:05:06:07 dev lan2 self
```
This shows that there is one FDB entry which says that if there is a packet which is tagged with VLAN 31 and which has the destination MAC address 02:03:04:05:06:07, then it should be forwarded through device "lan2", and another FDB entry which says that if there is a packet which has the destination MAC address 02:03:04:05:06:07, then it should be forwarded through device "lan2".
6. Run "bridge -statistics fdb show | grep 02:03:04:05:06:07" after step 6. Observe that the output is akin
```
02:ff:04:05:06:07 dev lan2 self
02:ff:04:05:06:07 dev wlan1-1 vlan 31 master br-switch
```
This shows that there is one FDB entry which says that if there is a packet which is tagged with VLAN 31 and which has the destination MAC address 02:03:04:05:06:07, then it should be forwarded through device "wlan1-1", and another FDB entry which says that if there is a packet which has the destination MAC address 02:03:04:05:06:07, then it should be forwarded through device "lan2".
# Working workarounds
## Wait 5 minutes
After 5 minutes, the erroneous FDB entry expires automatically.
## Delete the erroneous FDB entry explicitly once
Run "bridge fdb del 02:ff:04:05:06:07 dev lan2". Once you do this, you will observe immediately that the packets are not dropped anymore.
## Delete the erroneous FDB entry explicitly automatically
This bug is so pervasive that [somehone has created a workaround](https://forum.turris.cz/t/omnia-vlan-on-dsa-port-breaks-arp-responses-tos-4-0-5/12584/44) which deletes the erroneous FDB entries automatically.
## Downgrade to swconfig-enabled-OpenWRT versions (instead of DSA-enabled OpenWRT versions)
This is possible and works perfectly. However, not upgrading is not viable in the long run.
# Non-working workarounds
## Enabling learning_sync
Running
```
bridge link set dev lan2 learning_sync on
```
has no effect on the bug, it still occurs.
## Disabling learning
Running
```
bridge link set dev lan2 learning off
```
fails with
```
RTNETLINK answers: Not supported
```
# Analysis
The root cause seems to be the confusion between whether VLAN-untagged FDB entries should also apply to VLAN-tagged packets.
This confusion results in unequal treatment for VLAN-tagged packets incoming through the Ethernet port and forwarded to the CPU on the one hand and VLAN-tagged packets outgoing from the CPU through the Ethernet port on the other hand:
1. When a VLAN-tagged packet is incoming, 2 FDB entries are created (or updated)
2. When a VLAN-tagged packet is outgoing, only 1 FDB entry is created (or updated)
## Solution 1: VLAN-untagged FDB entries should not apply to VLAN-tagged packets
In this case,
1. when a VLAN-tagged packet is incoming, only 1 FDB entries should be created (or updated),
2. when a VLAN-tagged packet is outgoing, only 1 FDB entry should be created (or updated).
## Solution 2: VLAN-untagged FDB entries should also apply to VLAN-tagged packets
In this case,
1. when a VLAN-tagged packet is incoming, 2 FDB entries should be created (or updated),
2. when a VLAN-tagged packet is outgoing, 2 FDB entry should be created (or updated).
This confusion is evidently software-based, as non-DSA versions of OpenWRT do not exhibit this bug.
The exact location of this confusion is (currently) unknown.
### OpenWrt version
r19803-9a599fee93
### OpenWrt target/subtarget
lantiq/xrx200
### Device
hh5a (BT HomeHub 5a)
### Image kind
Official downloaded image
### Steps to reproduce
1. Ensure that your OpenWRT device A and software supports Distributed Switch Architecture (DSA).
2. Setup your OpenWRT device A to have at least one WiFi interface and at least one ethernet port with VLAN support:
1. Create bridge device "br-switch", with the list of bridge ports only consisting of port "lan2".
2. Enable "VLAN filtering" for that bridge-device and create a VLAN with VLAN ID 31, local enabled and Egress tagged for port "lan2".
3. Create OpenWRT interface "users" with device "br-switch.31".
4. Create a OpenWRT WiFi interface with SSID "test". Ensure (under Interface Configuration/General Setup/Network) that it is connected to network "users".
5. Verify by running "brctl show" on the command line (e.g. using SSH) that there is a bridge called "br-switch" with at least 2 members (one member being "lan2" and the other member being a WiFi interface, for example "wlan1-1").
6. Reboot the OpenWRT device A.
3. Connect your OpenWRT device A to another WiFi router B
1. Ensure that WiFi router B also support VLANs
2. Connect an Ethernet cable to port "lan2" of OpenWRT device A and a suitable Ethernet port of WiFi router B.
3. Setup WiFi router B such that it has a VLAN with VLAN ID 31 which is available (tagged) at the Ethernet port of WiFi router B.
4. Setup a WiFi interface on WiFi router B with the same SSID "test". Similarly to OpenWRT device A, ensure that that this WiFi interface is bridged to VLAN ID 31.
4. Connect a client device C to OpenWRT device A.
1. Run "ping" from the client device C to the IP address of WiFi router B.
2. Observe that client device C receives ping replies.
5. Roam to WiFi router B.
1. Change the physical position of client device C to be close to WiFi router B and away from OpenWRT device A. (You may need to reduce the output power of OpenWRT device A if the devices are close to each other.)
2. Wait 15 seconds.
3. Verify that client device C is now associated with WiFi router B. (You may verify this looking at the UI of WiFi router B or looking at the output of running "iw dev wlan0 link" on client device C.)
4. Observe that client device C still receives ping replies although its WLAN association has changed.
6. Roam to OpenWRT device A again.
1. Change the physical position of client device C to be close to OpenWRT device A and away from WiFi router B . (You may need to reduce the output power of WiFi router B if the devices are close to each other.)
2. Wait 15 seconds.
3. Verify that client device C is now associated with OpenWRT device A. (You may verify this looking at the UI of WiFi router B or looking at the output of running "iw dev wlan0 link" on client device C.)
### Actual behaviour
It can be observed that client device C does not receive ping replies once its WLAN association has changed again.
### Expected behaviour
It is expected that client device C still receives ping replies although its WLAN association has changed again.
### Additional info
_No response_
### Diffconfig
_No response_
### Terms
- [X] I am reporting an issue for OpenWrt, not an unsupported fork.
The fixes mentioned there should already be included, but it looks like they haven’t been fully addressed in the current OpenWrt versions with the 6.6 kernel.
ericwoud
(Eric W.)
January 28, 2025, 9:33pm
43
Try disabling flow offloading.