[BPI-R3] Crash in sta_set_sinfo+0xa18

Has anyone seen a crash like this?

I’ve got two BPI-R3 running the latest snapshots of OpenWRT; there’s a mesh network between them (wpad-mesh-wolfssl to manage it; the hostapd in this trace is from that package)

It crashes frequently, I suspect when devices move between the two APs. It’s worse with 802.11r turned on.

[  127.035755] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
[  127.044543] Mem abort info:
[  127.047322]   ESR = 0x0000000096000005
[  127.051054]   EC = 0x25: DABT (current EL), IL = 32 bits
[  127.056356]   SET = 0, FnV = 0
[  127.059394]   EA = 0, S1PTW = 0
[  127.062518]   FSC = 0x05: level 1 translation fault
[  127.067387] Data abort info:
[  127.070254]   ISV = 0, ISS = 0x00000005
[  127.074078]   CM = 0, WnR = 0
[  127.077033] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000044ef3000
[  127.083452] [0000000000000008] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
[  127.092139] Internal error: Oops: 96000005 [#1] SMP
[  127.097000] Modules linked in: pppoe ppp_async nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet pppox ppp_generic nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject_bridge nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_meta_bridge nft_masq nft_log nft_limit nft_hash nft_fwd_netdev nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_dup_netdev nft_ct nft_counter nft_compat nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack_bridge nf_conntrack mt7915e mt76_connac_lib mt76 mac80211 iptable_mangle iptable_filter ipt_REJECT ip_tables cfg80211 xt_time xt_tcpudp xt_multiport xt_mark xt_mac xt_limit xt_comment xt_TCPMSS xt_LOG x_tables slhc sfp nfnetlink nf_reject_ipv6 nf_reject_ipv4 nf_log_syslog nf_dup_netdev nf_defrag_ipv6 nf_defrag_ipv4 mdio_i2c libcrc32c crc_ccitt compat crypto_safexcel pwm_fan hwmon i2c_gpio i2c_algo_bit tun sha1_generic seqiv md5 des_generic libdes authencesn authenc leds_gpio xhci_plat_hcd xhci_pci xhci_mtk_hcd
[  127.097141]  xhci_hcd gpio_button_hotplug usbcore usb_common
[  127.189709] CPU: 3 PID: 1566 Comm: hostapd Not tainted 5.15.104 #0
[  127.195870] Hardware name: Bananapi BPI-R3 (DT)
[  127.200382] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  127.207323] pc : sta_set_sinfo+0xa18/0xbb0 [mac80211]
[  127.212401] lr : sta_set_sinfo+0x5b8/0xbb0 [mac80211]
[  127.217450] sp : ffffffc00a1437a0
[  127.220748] x29: ffffffc00a1437a0 x28: 0000000000000001 x27: ffffffc00a143dd0
[  127.227864] x26: ffffff800007e880 x25: ffffff8006670900 x24: ffffff80049de735
[  127.234978] x23: 000000000000dad4 x22: ffffff8006452d00 x21: 0000058d1783bf82
[  127.242093] x20: ffffff80049de000 x19: ffffff800b4b1b00 x18: 0000000000000000
[  127.249209] x17: 00000000000015e0 x16: ffffffc008f05000 x15: 0000000000000af0
[  127.256324] x14: ffffff80049dedf8 x13: ffffff80049dedf8 x12: 0000000000000000
[  127.263439] x11: 0000000000000000 x10: ffffff80049dee00 x9 : 0000000000000000
[  127.270554] x8 : ffffff800b4b1c00 x7 : 0000000000002314 x6 : ffffff8003af03a0
[  127.277669] x5 : ffffff8003af0880 x4 : 0000000000000003 x3 : ffffff800b4b1b44
[  127.284784] x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000000004
[  127.291899] Call trace:
[  127.294332]  sta_set_sinfo+0xa18/0xbb0 [mac80211]
[  127.299038]  sta_set_sinfo+0xb10/0xbb0 [mac80211]
[  127.303741]  sta_info_destroy_addr_bss+0x4c/0x70 [mac80211]
[  127.309310]  ieee80211_color_change_finish+0x1bf8/0x1e80 [mac80211]
[  127.315573]  cfg80211_check_station_change+0x1384/0x4720 [cfg80211]
[  127.321834]  genl_family_rcv_msg_doit+0xb4/0x110
[  127.326437]  genl_rcv_msg+0xd0/0x1c0
[  127.329997]  netlink_rcv_skb+0x58/0x120
[  127.333816]  genl_rcv+0x34/0x50
[  127.336942]  netlink_unicast+0x1f0/0x2ec
[  127.340848]  netlink_sendmsg+0x19c/0x3d0
[  127.344753]  ____sys_sendmsg+0x21c/0x260
[  127.348660]  ___sys_sendmsg+0x80/0xf0
[  127.352307]  __sys_sendmsg+0x44/0xa0
[  127.355865]  __arm64_sys_sendmsg+0x20/0x30
[  127.359944]  invoke_syscall.constprop.0+0x4c/0xe0
[  127.364631]  do_el0_svc+0x40/0xd0
[  127.367930]  el0_svc+0x14/0x50
[  127.370973]  el0t_64_sync_handler+0xe0/0x110
[  127.375227]  el0t_64_sync+0x158/0x15c
[  127.378878] Code: d3441c42 12000c00 8b020cc2 f9409c42 (f9400446) 
[  127.384951] ---[ end trace b902af5b08d1a620 ]---
[  127.393978] Kernel panic - not syncing: Oops: Fatal exception
[  127.399708] SMP: stopping secondary CPUs
[  127.403615] Kernel Offset: disabled
[  127.407086] CPU features: 0x00000000,20000802
[  127.411427] Memory Limit: none
[  127.418816] Rebooting in 3 seconds..

While I’m at it, any way to get it to reboot into the same system? It wouldn’t be so bad if I didn’t have to go reboot the device manually afterward.

I do not know openwrt specific bootup enough to ensure boot to same system instead of recovery,but maybe i can help to fix the issue itself

https://elixir.bootlin.com/linux/latest/source/net/mac80211/sta_info.c#L2505

You can try to add some debug information in this function to check which pointer is Null

printk(KERN_ALERT "DEBUG: Passed %s %d val:0x%0x\n",__FUNCTION__,__LINE__,(unsigned int)val);

val needs to be replaced by the possible value which can be Null (without the last access)

E.g.

printk(KERN_ALERT "DEBUG: Passed %s %d sdata:0x%0x\n",__FUNCTION__,__LINE__,(unsigned int)sdata);
printk(KERN_ALERT "DEBUG: Passed %s %d sdata->local:0x%0x\n",__FUNCTION__,__LINE__,(unsigned int)sdata->local);
sinfo->generation = sdata->local->sta_generation;

Heh, that’s the easy part (I speak enough C to just do that) but I don’t actually know what the process for building and booting a custom kernel is in a way that would leave my existing configuration likely to work. Any tips?

It should be possible change it in the patched source of openwrt and recompile. But i don’t use openwrt so i do not know the build tools enough

https://openwrt.org/docs/guide-developer/start#using_the_toolchain

OpenWrt uses out-of-tree drivers for Wi-Fi build using compat-backport. That allows us to use bleeding-edge wifi drivers on top of Linux stable kernel. Hence, if you want to add printf’s to anywhere in the wifi drivers, it will have to happen via package/kernel/mac80211/patches/. The easiest and most convenient way is probably doing this using quilt, and that will roughly look like this:

make package/mac80211/{clean,prepare} QUILT=1
cd build_dir/target-*/linux-*/backports-*
quilt push -a
quilt new patches/subsys/999-my-custom-patch.patch
quilt add ${files_to_be_edited}
# now edit files
quilt refresh
cd ../../../..
make package/mac80211/compile V=99
# if it build successfully, proceed building the complete image
make -j$(nproc)
1 Like

Alrighty, let me give it a spin, thank you for the TL;DR version, that’s super helpful.

another user stumbled over this here: https://github.com/openwrt/openwrt/issues/12143#issuecomment-1740868809

@aredridel have you found out why this happens?

That’s right and like @aredridel I activated wifi roaming on my four BPiR3. I will try disable that on Monday.

This bug was tracked down by me, and is fixed as described in this thread:

Patch here:

Also, be cautious. The openwrt stack traces are a bit of a wild goose chase. Although the offsets are mostly the correct, the function names are wrong.

How to debug if the function names are wrong? This is the most popular information…imho that means the debug-symbols in kernel are wrong (i don’t think so) and gdb or addr2func is also wrong. But crash in one function does not mean the rootcause is also there.

The stack trace from openwrt is a little messed up.

Say we have, looking at an objdump of the .o file:

0x00000000: functionA (0x100 bytes long.

0x00000100: functionB (0x200 bytes long)

The openwrt stack trace might have something like this:

functionA+0x150/0x300

As you can see, that is outside of functionA.

you would need to translate it to:

functionB+0x50/0x200 to get the actual function that has the problem in it.

In this case functionB has the actual problem, even though the openwrt stacktrace called in functionA.

So, as long as you have the associated .o or .ko file that contains functionA and matches the kernel you were actually running, you can objdump the .o file to find where the actual function name and offset is.

So, as i say, its a bit messed up on openwrt.

I have not observed this problem yet with mainline kernels.