[BPI-R64] PCIe issues

Hi,

as described here we have issues getting second pcie slot (CN8) working. i tried now my wifi-card in the other slot and loaded the ath10 kernel driver and it crashes:

details
[   10.459496] ath10k_pci 0000:01:00.0: assign IRQ: got 140                     
[   10.481601] pci 0000:00:00.0: enabling device (0000 -> 0002)           
[   10.488761] pci 0000:00:00.0: enabling bus mastering         
[   10.494914] ath10k_pci 0000:01:00.0: enabling device (0000 -> 0002)
[   10.502656] ath10k_pci 0000:01:00.0: enabling bus mastering                  
[   10.509667] Unable to handle kernel paging request at virtual address 0000000
[   10.517607] Mem abort info:                                                  
[   10.520450]   ESR = 0x96000005                                               
[   10.523529]   EC = 0x25: DABT (current EL), IL = 32 bits                     
[   10.528855]   SET = 0, FnV = 0                                               
[   10.531917]   EA = 0, S1PTW = 0                                              
[   10.535066] Data abort info:                                                 
[   10.537939]   ISV = 0, ISS = 0x00000005                                      
[   10.541781]   CM = 0, WnR = 0                                                
[   10.544759] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000079416000        
[   10.551209] [0000000400000040] pgd=0000000000000000, pud=0000000000000000    
[   10.558010] Internal error: Oops: 96000005 [#1] PREEMPT SMP                  
[   10.563576] Modules linked in: ath10k_pci(+) ath10k_core mt7622(+) ath mt76 t
[   10.573673] CPU: 0 PID: 123 Comm: systemd-udevd Not tainted 5.4.0-r64-main #2
[   10.580886] Hardware name: Bananapi BPI-R64 (DT)                             
[   10.585496] pstate: 00000005 (nzcv daif -PAN -UAO)                           
[   10.590287] pc : mutex_can_spin_on_owner+0x30/0x5c                           
[   10.595070] lr : mutex_can_spin_on_owner+0x24/0x5c                           
[   10.599851] sp : ffffffc010ceb580                                            
[   10.603156] x29: ffffffc010ceb580 x28: 0000000000080000                      
[   10.608461] x27: ffffffc010ceb838 x26: 0000000000000001                      
[   10.613766] x25: 000000000000008d x24: 0000000000000002                      
[   10.619071] x23: 000000000000008d x22: ffffff803e155500                      
[   10.624375] x21: ffffff803dc0bc00 x20: ffffffc010838000                      
[   10.629680] x19: ffffff803e155500 x18: 000000000000000a                      
[   10.634984] x17: 0000000000000000 x16: 0000000000000000                      
[   10.640289] x15: 000000000000008d x14: ffffff8039792604                      
[   10.645593] x13: ffffffffffffffff x12: 0000000000000010                      
[   10.650898] x11: 0101010101010101 x10: ffffff803dc0bc68                      
[   10.656203] x9 : 0000000000080000 x8 : ffffff803dc0bc60                      
[   10.661507] x7 : ffffff803e155500 x6 : ffffff800336a400                      
[   10.666811] x5 : 0000000400000003 x4 : 0000000400000003                      
[   10.672116] x3 : ffffffc0108f6000 x2 : ffffff800336a400                      
[   10.677420] x1 : ffffff800336a400 x0 : 0000000400000000                      
[   10.682726] Call trace:                                                      
[   10.685169]  mutex_can_spin_on_owner+0x30/0x5c                               
[   10.689608]  __mutex_lock.isra.9+0x58/0x2a4                                  
[   10.693784]  __mutex_lock_slowpath+0x10/0x18                                 
[   10.698047]  mutex_lock+0x44/0x68                                            
[   10.701358]  mtk_pcie_irq_domain_alloc+0x38/0xc8                             
[   10.705970]  irq_domain_alloc_irqs_hierarchy+0x14/0x1c                       
[   10.711100]  irq_domain_alloc_irqs_parent+0x14/0x24                          
[   10.715970]  msi_domain_alloc+0x90/0x130                                     
[   10.719886]  irq_domain_alloc_irqs_hierarchy+0x14/0x1c                       
[   10.725017]  __irq_domain_alloc_irqs+0x140/0x2b4                             
[   10.729626]  msi_domain_alloc_irqs+0x134/0x2c4                               
[   10.734063]  pci_msi_setup_msi_irqs+0x28/0x38                                
[   10.738412]  __pci_enable_msi_range+0x208/0x30c                              
[   10.742935]  pci_enable_msi+0x18/0x28                                        
[   10.746604]  ath10k_pci_probe+0x50c/0x6d8 [ath10k_pci]                       
[   10.751739]  pci_device_probe+0xb4/0x144                                     
[   10.755658]  really_probe+0x238/0x3f8                                        
[   10.759314]  driver_probe_device+0x114/0x124                                 
[   10.763577]  device_driver_attach+0x40/0x68                                  
[   10.767753]  __driver_attach+0x134/0x138                                     
[   10.771668]  bus_for_each_dev+0x78/0xbc                                      
[   10.775498]  driver_attach+0x20/0x28                                         
[   10.779066]  bus_add_driver+0x1a8/0x1ec                                      
[   10.782894]  driver_register+0xac/0xe4                                       
[   10.786638]  __pci_register_driver+0x40/0x48                                 
[   10.790913]  ath10k_pci_init+0x28/0x1000 [ath10k_pci]                        
[   10.795959]  do_one_initcall+0x74/0x178                                      
[   10.799790]  do_init_module+0x58/0x2fc                                       
[   10.803533]  load_module+0x113c/0x1608                                       
[   10.807276]  __do_sys_finit_module+0xd0/0xf0                                 
[   10.811539]  __arm64_sys_finit_module+0x18/0x20                              
[   10.816063]  el0_svc_common.constprop.1+0xfc/0x168                           
[   10.820846]  el0_svc_handler+0x44/0x70                                       
[   10.824586]  el0_svc+0x8/0xc                                                 
[   10.827465] Code: 94005b39 f9400260 f27df000 54000080 (b9404013)             
[   10.833552] ---[ end trace 3741f6ce457a2bec ]---

has anyone an idea if the problem is in pcie-driver/tphy or in ath10k?

used Kernel 5.4.0-r64-main, as i get similar crash on 4.19 i guess it’s also a pcie driver issue

[    5.870751] Call trace:
[    5.873272]  __mutex_lock.isra.1+0x238/0x498
[    5.877672]  __mutex_lock_slowpath+0x10/0x18
[    5.882070]  mutex_lock+0x2c/0x34
[    5.885486]  mtk_pcie_irq_domain_alloc+0x38/0xc8

@sinovoip @ryder.lee @moore can you please try to use pcie-card on both slots with device driver (e.g. wifi driver) so that card is fully initialized and not only recognized with lspci?

as far as i currently know crash happens in mtk_pcie_irq_domain_alloc (drivers/pci/controller/pcie-mediatek.c) on try (or directly after) setting mutex-lock, port is not NULL

[   11.530288] DEBUG: Passed mtk_pcie_irq_domain_alloc 441 port:0x00000000b3aaf77c                                                                 
[   11.537629] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000260    
....
mtk_pcie_irq_domain_alloc+0x5c                                        

(gdb) list *(mtk_pcie_irq_domain_alloc+0x5c)                                                                                                       
0xffffffc0102e68cc is in mtk_pcie_irq_domain_alloc (drivers/pci/controller/pcie-mediatek.c:447).                                                   
442                                                                                                                                                
443             WARN_ON(nr_irqs != 1);                                                                                                             
444             mutex_lock(&port->lock);                                                                                                           
445                                                                                                                                                
446                                                                                                                                                
447             printk(KERN_ALERT "DEBUG: Passed %s %d\n",__FUNCTION__,__LINE__);                                                                  
448             bit = find_first_zero_bit(port->msi_irq_in_use, MTK_MSI_IRQS_NUM);

printk on 447 is not printed and i have no pointer dereference so we still in mutex_lock, i guess port->lock is not yet initialized…this is done in mtk_pcie_allocate_msi_domains which is done before (i had checked)…so i don’t understand the crash till now

i found out that the mutex is already locked…

if (mutex_is_locked(&port->lock)) printk(KERN_ALERT "DEBUG: %s mutex already locked\n",__FUNCTION__); //before mutex_lock()

[   11.395077] DEBUG: mtk_pcie_irq_domain_alloc mutex already locked

so i added an mutex_unlock() in the condition and got no crash on bootup (but it seems that this function is not executed further), and later a rcu_preempt self-detected stall happens

log
[  788.702325] rcu: INFO: rcu_preempt self-detected stall on CPU
[  788.708075] rcu:     1-....: (193565 ticks this GP) idle=a7e/1/0x4000000000000002 softirq=5105/5105 fqs=96767 
[  788.717806]  (t=194286 jiffies g=805 q=7450)
[  788.722068] Task dump for CPU 1:
[  788.725288] systemd-udevd   R  running task        0   122    119 0x0000002b
[  788.732331] Call trace:
[  788.734773]  dump_backtrace+0x0/0x160
[  788.738428]  show_stack+0x14/0x1c
[  788.741738]  sched_show_task+0xf8/0x130
[  788.745566]  dump_cpu_task+0x40/0x114
[  788.749222]  rcu_dump_cpu_stacks+0xc8/0xd4
[  788.753310]  rcu_sched_clock_irq+0x31c/0x7d4
[  788.757574]  update_process_times+0x2c/0x50
[  788.761750]  tick_sched_handle.isra.12+0x3c/0x44
[  788.766360]  tick_sched_timer+0x54/0x94
[  788.770187]  __hrtimer_run_queues+0xe4/0x13c
[  788.774448]  hrtimer_interrupt+0xb8/0x1c0
[  788.778452]  arch_timer_handler_phys+0x28/0x3c
[  788.782893]  handle_percpu_devid_irq+0x58/0xf8
[  788.787328]  generic_handle_irq+0x18/0x2c
[  788.791330]  __handle_domain_irq+0x94/0x98
[  788.795418]  gic_handle_irq+0x70/0xac
[  788.799072]  el1_irq+0xb8/0x180
[  788.802207]  __cmpwait_case_32+0x18/0x1c
[  788.806121]  do_raw_spin_lock+0x48/0x6c
[  788.809952]  _raw_spin_lock+0x20/0x2c
[  788.813608]  __mutex_unlock_slowpath.isra.19+0x70/0x114
[  788.818825]  mutex_unlock+0x2c/0x34
[  788.822307]  mtk_pcie_irq_domain_alloc+0x7c/0x148
[  788.827004]  irq_domain_alloc_irqs_hierarchy+0x14/0x1c
[  788.832134]  irq_domain_alloc_irqs_parent+0x14/0x24
[  788.837005]  msi_domain_alloc+0x90/0x130
[  788.840919]  irq_domain_alloc_irqs_hierarchy+0x14/0x1c
[  788.846050]  __irq_domain_alloc_irqs+0x140/0x2b4
[  788.850658]  msi_domain_alloc_irqs+0x134/0x2c4
[  788.855094]  pci_msi_setup_msi_irqs+0x28/0x38
[  788.859443]  __pci_enable_msi_range+0x208/0x30c
[  788.863966]  pci_enable_msi+0x18/0x28
[  788.867633]  ath10k_pci_probe+0x50c/0x6d8 [ath10k_pci]
[  788.872765]  pci_device_probe+0xb4/0x144
[  788.876682]  really_probe+0x238/0x3f8
[  788.880336]  driver_probe_device+0x114/0x124
[  788.884598]  device_driver_attach+0x40/0x68
[  788.888774]  __driver_attach+0x134/0x138
[  788.892688]  bus_for_each_dev+0x78/0xbc
[  788.896515]  driver_attach+0x20/0x28
[  788.900082]  bus_add_driver+0x1a8/0x1ec
[  788.903911]  driver_register+0xac/0xe4
[  788.907652]  __pci_register_driver+0x40/0x48
[  788.911920]  ath10k_pci_init+0x28/0x1000 [ath10k_pci]
[  788.916964]  do_one_initcall+0x74/0x178
[  788.920793]  do_init_module+0x58/0x2fc
[  788.924535]  load_module+0x113c/0x1608
[  788.928277]  __do_sys_finit_module+0xd0/0xf0
[  788.932538]  __arm64_sys_finit_module+0x18/0x20
[  788.937061]  el0_svc_common.constprop.1+0xfc/0x168
[  788.941843]  el0_svc_handler+0x44/0x70
[  788.945583]  el0_svc+0x8/0xc
[  795.618344] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... } 195971 jiffies s: 21 root: 0x2/.
[  795.629067] rcu: blocking rcu_node structures:
[  795.634072] Task dump for CPU 1:
[  795.637567] systemd-udevd   R  running task        0   122    119 0x0000002b
[  795.644842] Call trace:
[  795.647530]  __switch_to+0xcc/0x118
[  795.651238]  0xffffffc010808960

after poweroff and some reboots this seems not happen anymore…but strange why mutex is locked on entering mtk_pcie_irq_domain_alloc

mutex_lock is only used in mtk_pcie_irq_domain_alloc / mtk_pcie_irq_domain_free and there is no ovious problem with missing unlock, mutex_init initializes to unlocked state…so it lookes the mutex is locked anywhere else…but struct holding the lock-pointer is defined in pcie-mediatek.c so imho it can’t be used anywhere else

can you sand a email with the descriptions to ryder.lee@mediatek.com?