Hi,
as described here we have issues getting second pcie slot (CN8) working. i tried now my wifi-card in the other slot and loaded the ath10 kernel driver and it crashes:
details
[ 10.459496] ath10k_pci 0000:01:00.0: assign IRQ: got 140
[ 10.481601] pci 0000:00:00.0: enabling device (0000 -> 0002)
[ 10.488761] pci 0000:00:00.0: enabling bus mastering
[ 10.494914] ath10k_pci 0000:01:00.0: enabling device (0000 -> 0002)
[ 10.502656] ath10k_pci 0000:01:00.0: enabling bus mastering
[ 10.509667] Unable to handle kernel paging request at virtual address 0000000
[ 10.517607] Mem abort info:
[ 10.520450] ESR = 0x96000005
[ 10.523529] EC = 0x25: DABT (current EL), IL = 32 bits
[ 10.528855] SET = 0, FnV = 0
[ 10.531917] EA = 0, S1PTW = 0
[ 10.535066] Data abort info:
[ 10.537939] ISV = 0, ISS = 0x00000005
[ 10.541781] CM = 0, WnR = 0
[ 10.544759] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000079416000
[ 10.551209] [0000000400000040] pgd=0000000000000000, pud=0000000000000000
[ 10.558010] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[ 10.563576] Modules linked in: ath10k_pci(+) ath10k_core mt7622(+) ath mt76 t
[ 10.573673] CPU: 0 PID: 123 Comm: systemd-udevd Not tainted 5.4.0-r64-main #2
[ 10.580886] Hardware name: Bananapi BPI-R64 (DT)
[ 10.585496] pstate: 00000005 (nzcv daif -PAN -UAO)
[ 10.590287] pc : mutex_can_spin_on_owner+0x30/0x5c
[ 10.595070] lr : mutex_can_spin_on_owner+0x24/0x5c
[ 10.599851] sp : ffffffc010ceb580
[ 10.603156] x29: ffffffc010ceb580 x28: 0000000000080000
[ 10.608461] x27: ffffffc010ceb838 x26: 0000000000000001
[ 10.613766] x25: 000000000000008d x24: 0000000000000002
[ 10.619071] x23: 000000000000008d x22: ffffff803e155500
[ 10.624375] x21: ffffff803dc0bc00 x20: ffffffc010838000
[ 10.629680] x19: ffffff803e155500 x18: 000000000000000a
[ 10.634984] x17: 0000000000000000 x16: 0000000000000000
[ 10.640289] x15: 000000000000008d x14: ffffff8039792604
[ 10.645593] x13: ffffffffffffffff x12: 0000000000000010
[ 10.650898] x11: 0101010101010101 x10: ffffff803dc0bc68
[ 10.656203] x9 : 0000000000080000 x8 : ffffff803dc0bc60
[ 10.661507] x7 : ffffff803e155500 x6 : ffffff800336a400
[ 10.666811] x5 : 0000000400000003 x4 : 0000000400000003
[ 10.672116] x3 : ffffffc0108f6000 x2 : ffffff800336a400
[ 10.677420] x1 : ffffff800336a400 x0 : 0000000400000000
[ 10.682726] Call trace:
[ 10.685169] mutex_can_spin_on_owner+0x30/0x5c
[ 10.689608] __mutex_lock.isra.9+0x58/0x2a4
[ 10.693784] __mutex_lock_slowpath+0x10/0x18
[ 10.698047] mutex_lock+0x44/0x68
[ 10.701358] mtk_pcie_irq_domain_alloc+0x38/0xc8
[ 10.705970] irq_domain_alloc_irqs_hierarchy+0x14/0x1c
[ 10.711100] irq_domain_alloc_irqs_parent+0x14/0x24
[ 10.715970] msi_domain_alloc+0x90/0x130
[ 10.719886] irq_domain_alloc_irqs_hierarchy+0x14/0x1c
[ 10.725017] __irq_domain_alloc_irqs+0x140/0x2b4
[ 10.729626] msi_domain_alloc_irqs+0x134/0x2c4
[ 10.734063] pci_msi_setup_msi_irqs+0x28/0x38
[ 10.738412] __pci_enable_msi_range+0x208/0x30c
[ 10.742935] pci_enable_msi+0x18/0x28
[ 10.746604] ath10k_pci_probe+0x50c/0x6d8 [ath10k_pci]
[ 10.751739] pci_device_probe+0xb4/0x144
[ 10.755658] really_probe+0x238/0x3f8
[ 10.759314] driver_probe_device+0x114/0x124
[ 10.763577] device_driver_attach+0x40/0x68
[ 10.767753] __driver_attach+0x134/0x138
[ 10.771668] bus_for_each_dev+0x78/0xbc
[ 10.775498] driver_attach+0x20/0x28
[ 10.779066] bus_add_driver+0x1a8/0x1ec
[ 10.782894] driver_register+0xac/0xe4
[ 10.786638] __pci_register_driver+0x40/0x48
[ 10.790913] ath10k_pci_init+0x28/0x1000 [ath10k_pci]
[ 10.795959] do_one_initcall+0x74/0x178
[ 10.799790] do_init_module+0x58/0x2fc
[ 10.803533] load_module+0x113c/0x1608
[ 10.807276] __do_sys_finit_module+0xd0/0xf0
[ 10.811539] __arm64_sys_finit_module+0x18/0x20
[ 10.816063] el0_svc_common.constprop.1+0xfc/0x168
[ 10.820846] el0_svc_handler+0x44/0x70
[ 10.824586] el0_svc+0x8/0xc
[ 10.827465] Code: 94005b39 f9400260 f27df000 54000080 (b9404013)
[ 10.833552] ---[ end trace 3741f6ce457a2bec ]---
has anyone an idea if the problem is in pcie-driver/tphy or in ath10k?
used Kernel 5.4.0-r64-main, as i get similar crash on 4.19 i guess it’s also a pcie driver issue
[ 5.870751] Call trace:
[ 5.873272] __mutex_lock.isra.1+0x238/0x498
[ 5.877672] __mutex_lock_slowpath+0x10/0x18
[ 5.882070] mutex_lock+0x2c/0x34
[ 5.885486] mtk_pcie_irq_domain_alloc+0x38/0xc8
@sinovoip @ryder.lee @moore can you please try to use pcie-card on both slots with device driver (e.g. wifi driver) so that card is fully initialized and not only recognized with lspci?
as far as i currently know crash happens in mtk_pcie_irq_domain_alloc (drivers/pci/controller/pcie-mediatek.c) on try (or directly after) setting mutex-lock, port is not NULL
[ 11.530288] DEBUG: Passed mtk_pcie_irq_domain_alloc 441 port:0x00000000b3aaf77c
[ 11.537629] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000260
....
mtk_pcie_irq_domain_alloc+0x5c
(gdb) list *(mtk_pcie_irq_domain_alloc+0x5c)
0xffffffc0102e68cc is in mtk_pcie_irq_domain_alloc (drivers/pci/controller/pcie-mediatek.c:447).
442
443 WARN_ON(nr_irqs != 1);
444 mutex_lock(&port->lock);
445
446
447 printk(KERN_ALERT "DEBUG: Passed %s %d\n",__FUNCTION__,__LINE__);
448 bit = find_first_zero_bit(port->msi_irq_in_use, MTK_MSI_IRQS_NUM);
printk on 447 is not printed and i have no pointer dereference so we still in mutex_lock, i guess port->lock is not yet initialized…this is done in mtk_pcie_allocate_msi_domains which is done before (i had checked)…so i don’t understand the crash till now
i found out that the mutex is already locked…
if (mutex_is_locked(&port->lock)) printk(KERN_ALERT "DEBUG: %s mutex already locked\n",__FUNCTION__); //before mutex_lock()
[ 11.395077] DEBUG: mtk_pcie_irq_domain_alloc mutex already locked
so i added an mutex_unlock() in the condition and got no crash on bootup (but it seems that this function is not executed further), and later a rcu_preempt self-detected stall happens
log
[ 788.702325] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 788.708075] rcu: 1-....: (193565 ticks this GP) idle=a7e/1/0x4000000000000002 softirq=5105/5105 fqs=96767
[ 788.717806] (t=194286 jiffies g=805 q=7450)
[ 788.722068] Task dump for CPU 1:
[ 788.725288] systemd-udevd R running task 0 122 119 0x0000002b
[ 788.732331] Call trace:
[ 788.734773] dump_backtrace+0x0/0x160
[ 788.738428] show_stack+0x14/0x1c
[ 788.741738] sched_show_task+0xf8/0x130
[ 788.745566] dump_cpu_task+0x40/0x114
[ 788.749222] rcu_dump_cpu_stacks+0xc8/0xd4
[ 788.753310] rcu_sched_clock_irq+0x31c/0x7d4
[ 788.757574] update_process_times+0x2c/0x50
[ 788.761750] tick_sched_handle.isra.12+0x3c/0x44
[ 788.766360] tick_sched_timer+0x54/0x94
[ 788.770187] __hrtimer_run_queues+0xe4/0x13c
[ 788.774448] hrtimer_interrupt+0xb8/0x1c0
[ 788.778452] arch_timer_handler_phys+0x28/0x3c
[ 788.782893] handle_percpu_devid_irq+0x58/0xf8
[ 788.787328] generic_handle_irq+0x18/0x2c
[ 788.791330] __handle_domain_irq+0x94/0x98
[ 788.795418] gic_handle_irq+0x70/0xac
[ 788.799072] el1_irq+0xb8/0x180
[ 788.802207] __cmpwait_case_32+0x18/0x1c
[ 788.806121] do_raw_spin_lock+0x48/0x6c
[ 788.809952] _raw_spin_lock+0x20/0x2c
[ 788.813608] __mutex_unlock_slowpath.isra.19+0x70/0x114
[ 788.818825] mutex_unlock+0x2c/0x34
[ 788.822307] mtk_pcie_irq_domain_alloc+0x7c/0x148
[ 788.827004] irq_domain_alloc_irqs_hierarchy+0x14/0x1c
[ 788.832134] irq_domain_alloc_irqs_parent+0x14/0x24
[ 788.837005] msi_domain_alloc+0x90/0x130
[ 788.840919] irq_domain_alloc_irqs_hierarchy+0x14/0x1c
[ 788.846050] __irq_domain_alloc_irqs+0x140/0x2b4
[ 788.850658] msi_domain_alloc_irqs+0x134/0x2c4
[ 788.855094] pci_msi_setup_msi_irqs+0x28/0x38
[ 788.859443] __pci_enable_msi_range+0x208/0x30c
[ 788.863966] pci_enable_msi+0x18/0x28
[ 788.867633] ath10k_pci_probe+0x50c/0x6d8 [ath10k_pci]
[ 788.872765] pci_device_probe+0xb4/0x144
[ 788.876682] really_probe+0x238/0x3f8
[ 788.880336] driver_probe_device+0x114/0x124
[ 788.884598] device_driver_attach+0x40/0x68
[ 788.888774] __driver_attach+0x134/0x138
[ 788.892688] bus_for_each_dev+0x78/0xbc
[ 788.896515] driver_attach+0x20/0x28
[ 788.900082] bus_add_driver+0x1a8/0x1ec
[ 788.903911] driver_register+0xac/0xe4
[ 788.907652] __pci_register_driver+0x40/0x48
[ 788.911920] ath10k_pci_init+0x28/0x1000 [ath10k_pci]
[ 788.916964] do_one_initcall+0x74/0x178
[ 788.920793] do_init_module+0x58/0x2fc
[ 788.924535] load_module+0x113c/0x1608
[ 788.928277] __do_sys_finit_module+0xd0/0xf0
[ 788.932538] __arm64_sys_finit_module+0x18/0x20
[ 788.937061] el0_svc_common.constprop.1+0xfc/0x168
[ 788.941843] el0_svc_handler+0x44/0x70
[ 788.945583] el0_svc+0x8/0xc
[ 795.618344] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... } 195971 jiffies s: 21 root: 0x2/.
[ 795.629067] rcu: blocking rcu_node structures:
[ 795.634072] Task dump for CPU 1:
[ 795.637567] systemd-udevd R running task 0 122 119 0x0000002b
[ 795.644842] Call trace:
[ 795.647530] __switch_to+0xcc/0x118
[ 795.651238] 0xffffffc010808960
after poweroff and some reboots this seems not happen anymore…but strange why mutex is locked on entering mtk_pcie_irq_domain_alloc
mutex_lock is only used in mtk_pcie_irq_domain_alloc / mtk_pcie_irq_domain_free and there is no ovious problem with missing unlock, mutex_init initializes to unlocked state…so it lookes the mutex is locked anywhere else…but struct holding the lock-pointer is defined in pcie-mediatek.c so imho it can’t be used anywhere else