How to relax kernel PCIe problems?

Following up on my adventure here: https://forum.banana-pi.org/t/bpi-r4-how-to-activate-key-b-pcie2-11280000-solved/17295 I got a card in that slot, a ath10k one, single band single radio. I got it to boot correctly. It’s used to have a fallback channel next to the main one, in case the mt76 card would have some issues. The problem is: the card is unstable. This is somewhat as expected, as it’s a software hack on an old minipcie card mounted into a m.2 slot. the pstore error:

[49765.058844] ath10k_pci 0003:01:00.0: bss channel survey timed out
[49768.098858] ath10k_pci 0003:01:00.0: wmi command 36956 timeout, restarting hardware
[49768.106723] ath10k_pci 0003:01:00.0: failed to send pdev bss chan info request
[49768.163693] SError Interrupt on CPU1, code 0x00000000bf000002 -- SError
[49768.163699] CPU: 1 PID: 14557 Comm: kworker/u8:3 Tainted: G           O       6.6.47 #0
[49768.163705] Hardware name: Bananapi BPI-R4 (DT)
[49768.163707] Workqueue: ath10k_wq ath10k_core_restart [ath10k_core]
[49768.163739] pstate: 404000c5 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[49768.163744] pc : el1_interrupt+0x1c/0x4c
[49768.163752] lr : el1h_64_irq_handler+0x14/0x1c
[49768.163757] sp : ffffffc08295bb00
[49768.163758] x29: ffffffc08295bb00 x28: ffffff80c023a880 x27: 61c8864680b583eb
[49768.163763] x26: dead000000000122 x25: dead000000000100 x24: ffffffc079ce63a8
[49768.163768] x23: 0000000080400005 x22: ffffffc079d0cf48 x21: 0000000046cb4000
[49768.163772] x20: ffffffc080010080 x19: ffffffc08295bb30 x18: 0000000000000000
[49768.163776] x17: 0000000000000000 x16: 0000000000000000 x15: 0000007fba485c40
[49768.163781] x14: 0000001c60eb42b2 x13: 000072e3f63e4a90 x12: 0000000000000002
[49768.163785] x11: 00000000000003a0 x10: 00000000000015a8 x9 : 0000000000000000
[49768.163789] x8 : ffffff80d6bb75e0 x7 : 0000000000000000 x6 : 000000000000003f
[49768.163793] x5 : 000000000000002c x4 : 0000000000000001 x3 : ffffffc079d470d0
[49768.163797] x2 : 0000000000000000 x1 : 00000000000000c0 x0 : ffffffc08295bb30
[49768.163802] Kernel panic - not syncing: Asynchronous SError Interrupt
[49768.163804] CPU: 1 PID: 14557 Comm: kworker/u8:3 Tainted: G           O       6.6.47 #0
[49768.163807] Hardware name: Bananapi BPI-R4 (DT)
[49768.163809] Workqueue: ath10k_wq ath10k_core_restart [ath10k_core]
[49768.163827] Call trace:
[49768.163828]  dump_backtrace+0x9c/0xd8
[49768.163836]  show_stack+0x14/0x1c
[49768.163840]  dump_stack_lvl+0x44/0x58
[49768.163845]  dump_stack+0x14/0x1c
[49768.163849]  panic+0x2d0/0x330
[49768.163854]  nmi_panic+0x68/0x6c
[49768.163858]  arm64_serror_panic+0x68/0x78
[49768.163860]  do_serror+0x24/0x60
[49768.163863]  el1h_64_error_handler+0x2c/0x40
[49768.163867]  el1h_64_error+0x68/0x6c
[49768.163869]  el1_interrupt+0x1c/0x4c
[49768.163872]  el1h_64_irq_handler+0x14/0x1c
[49768.163876]  el1h_64_irq+0x68/0x6c
[49768.163878]  ath10k_ce_disable_interrupt.part.0+0x70/0x128 [ath10k_core]
[49768.163895]  ath10k_ce_disable_interrupts+0x44/0x64 [ath10k_core]
[49768.163910]  ath10k_pci_hif_stop+0x18/0x110 [ath10k_pci]
[49768.163919]  ath10k_core_stop+0x38/0x70 [ath10k_core]
[49768.163934]  ath10k_halt+0x1fc/0x2d8 [ath10k_core]
[49768.163950]  ath10k_core_restart+0x178/0x1f8 [ath10k_core]
[49768.163965]  process_one_work+0x178/0x394
[49768.163969]  worker_thread+0x2e8/0x4d0
[49768.163971]  kthread+0xd8/0xdc
[49768.163976]  ret_from_fork+0x10/0x20
[49768.163980] SMP: stopping secondary CPUs
[49768.163985] Kernel Offset: disabled
[49768.163986] CPU features: 0x0,00000010,20000000,1000400b
[49768.163990] Memory Limit: none
[49768.177633] pstore: backend (ramoops) writing error (-28)

it can be fixed by powering off the device (for > 10 seconds) and boot up again. simply rebooting or pressing the hard reset button is useless. here’s a panic after exiting recovery mode and rebooting to normal:

[   58.803901] bus: 'pci': really_probe: bound device 0001:01:00.0 to driver mt7915e
[   58.811448] bus: 'platform': add driver mt798x-wmac
[   58.831542] bus: 'pci': add driver ath10k_pci
[   58.835915] bus: 'pci': __driver_probe_device: matched device 0003:01:00.0 with driver ath10k_pci
[   58.844805] bus: 'pci': really_probe: probing driver ath10k_pci with device 0003:01:00.0
[   58.852905] ath10k_pci 0003:01:00.0: no default pinctrl state
[   58.907871] ath10k_pci 0003:01:00.0: assign IRQ: got 0
[   59.062877] ath10k_pci 0003:01:00.0: enabling device (0000 -> 0002)
[   59.218874] ath10k_pci 0003:01:00.0: enabling bus mastering
[   59.323866] SError Interrupt on CPU2, code 0x00000000bf000002 -- SError
[   59.323871] CPU: 2 PID: 1645 Comm: kmodloader Tainted: G           O       6.6.47 #0
[   59.323877] Hardware name: Bananapi BPI-R4 (DT)
[   59.323879] pstate: 804000c5 (Nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   59.323884] pc : ath10k_pci_wake_wait+0x30/0xf0 [ath10k_pci]
[   59.323899] lr : ath10k_pci_force_wake.part.0+0x88/0xd4 [ath10k_pci]
[   59.323909] sp : ffffffc084cfb880
[   59.323911] x29: ffffffc084cfb880 x28: 0000000000000000 x27: ffffff80c0e740b8
[   59.323917] x26: ffffff80d6b79260 x25: ffffff80c0e74000 x24: 0000000000000000
[   59.323922] x23: 000000000000752f x22: ffffff80d6b79260 x21: ffffff80d6b71fc0
[   59.323926] x20: 0000000000000000 x19: 0000000000000005 x18: 0000000000000d7c
[   59.323930] x17: 0000000000001180 x16: 0000000000001160 x15: ffffffc080d72890
[   59.323934] x14: 0000000000002874 x13: 0000000000000d7c x12: 00000000ffffffea
[   59.323938] x11: 00000000ffffefff x10: ffffffc080dca890 x9 : 0000000000000000
[   59.323942] x8 : 0000003d6b297000 x7 : ffffffc080ef89e8 x6 : ffffff80fffff108
[   59.323946] x5 : 0068000020200711 x4 : fffffffe03051900 x3 : 0000000000000000
[   59.323950] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000003
[   59.323954] Kernel panic - not syncing: Asynchronous SError Interrupt
[   59.323957] CPU: 2 PID: 1645 Comm: kmodloader Tainted: G           O       6.6.47 #0
[   59.323960] Hardware name: Bananapi BPI-R4 (DT)
[   59.323962] Call trace:
[   59.323964]  dump_backtrace+0x9c/0xd8
[   59.323973]  show_stack+0x14/0x1c
[   59.323977]  dump_stack_lvl+0x44/0x58
[   59.323984]  dump_stack+0x14/0x1c
[   59.323987]  panic+0x2d0/0x330
[   59.323992]  nmi_panic+0x68/0x6c
[   59.323996]  arm64_serror_panic+0x68/0x78
[   59.323998]  do_serror+0x24/0x60
[   59.324000]  el1h_64_error_handler+0x2c/0x40
[   59.324006]  el1h_64_error+0x68/0x6c
[   59.324008]  ath10k_pci_wake_wait+0x30/0xf0 [ath10k_pci]
[   59.324018]  ath10k_pci_force_wake.part.0+0x88/0xd4 [ath10k_pci]
[   59.324026]  ath10k_pci_probe+0x384/0x8bc [ath10k_pci]
[   59.324035]  pci_device_probe+0x94/0x12c
[   59.324040]  really_probe+0x174/0x378
[   59.324044]  __driver_probe_device+0xa4/0x150
[   59.324047]  driver_probe_device+0x3c/0xd4
[   59.324049]  __driver_attach+0x9c/0x1a4
[   59.324052]  bus_for_each_dev+0x60/0x9c
[   59.324055]  driver_attach+0x20/0x28
[   59.324057]  bus_add_driver+0xf4/0x214
[   59.324059]  driver_register+0x58/0x114
[   59.324062]  __pci_register_driver+0x48/0x50
[   59.324064]  __init_backport+0x28/0x1000 [ath10k_pci]
[   59.324073]  do_one_initcall+0x2c/0x218
[   59.324076]  do_init_module+0x54/0x1dc
[   59.324079]  load_module+0x19e4/0x1ad0
[   59.324082]  __do_sys_init_module+0x210/0x258
[   59.324084]  __arm64_sys_init_module+0x18/0x20
[   59.324086]  el0_svc_common.constprop.0+0x60/0x138
[   59.324090]  do_el0_svc+0x18/0x20
[   59.324094]  el0_svc+0x24/0x90
[   59.324097]  el0t_64_sync_handler+0x118/0x124
[   59.324101]  el0t_64_sync+0x150/0x154
[   59.324104] SMP: stopping secondary CPUs
[   59.324108] Kernel Offset: disabled
[   59.324109] CPU features: 0x0,00000010,20000000,1000400b
[   59.324113] Memory Limit: none
[   59.337006] pstore: backend (ramoops) writing error (-28)

so, how do I explain the kernel “when this card fails, don’t panic, simly take it offline and continue”? A better option would be off course to turn off the power supply to the card via the pin controller and turn it back on, but I guess that’s a no-go and fixing the link … yeah, I don’t believe that’s gonna happen