okay, but the “mdio timed out” is still active. Isn’t it something related?
I usually do not debug drivers/net/mdio/*
so donnow…
Is dumping the stack possible to fix the i2c loop?
[109103.153288] i2c_get_dma_safe_msg_buf:2646: i2c-core: using bounce buffer for addr=0x51, len=6
as the length isn’t always the same, but line number is, the backtrace would be nice here. Can I do that using dynamic debugging?
This is read using i2c and the i2c-to-rollball bridge and it seems to work.
well, I’m a lucky boy, I guess, we have sysrq triggers in a loop:
ip link set sfp-lan up && sleep 5 && echo l > /proc/sysrq-trig
ger && sleep 5 && echo l > /proc/sysrq-trigger && ip link set sfp-lan down
and bingo:
[109833.532118] vprintk_emit+0x20c/0x2a8
[109833.532122] vprintk_default+0x38/0x48
[109833.532125] vprintk+0x40/0xa0
[109833.532129] _printk+0x54/0x68
[109833.532133] __dynamic_pr_debug+0x7c/0xa0
[109833.532139] i2c_get_dma_safe_msg_buf+0xcc/0x100
[109833.532146] mtk_i2c_transfer+0x384/0x1118
[109833.532149] __i2c_transfer+0x1ac/0x600
[109833.532154] i2c_mux_alloc+0x530/0x5f8 [i2c_mux]
[109833.532160] __i2c_transfer+0x1ac/0x600
[109833.532165] mdio_i2c_alloc+0x240/0xe98 [mdio_i2c]
[109833.532170] mdio_i2c_alloc+0x3ac/0xe98 [mdio_i2c]
[109833.532173] mdio_i2c_alloc+0x560/0xe98 [mdio_i2c]
[109833.532176] __mdiobus_c45_read+0x3c/0x110
[109833.532181] mdiobus_c45_read+0x4c/0x78
[109833.532184] get_phy_c45_ids+0xe0/0x218
[109833.532187] get_phy_device+0x4c/0x150
[109833.532191] init_module+0x1f4/0xbe8 [sfp]
[109833.532194] cleanup_module+0x1d50/0x3f10 [sfp]
[109833.532197] cleanup_module+0x2b3c/0x3f10 [sfp]
[109833.532200] process_one_work+0x178/0x3e0
[109833.532206] worker_thread+0x290/0x450
[109833.532210] kthread+0xe0/0xf0
[109833.532214] ret_from_fork+0x10/0x20
so it does not tell a lot, but at least we got something. A decode_stacktrace gives me:
[109833.532118] vprintk_emit (kernel/printk/printk.c:2048 kernel/printk/printk.c:2406)
[109833.532122] vprintk_default (kernel/printk/printk.c:2423)
[109833.532125] vprintk (kernel/printk/printk_safe.c:74)
[109833.532129] _printk (kernel/printk/printk.c:2436)
[109833.532133] __dynamic_pr_debug (lib/dynamic_debug.c:881)
[109833.532139] i2c_get_dma_safe_msg_buf (drivers/i2c/i2c-core-base.c:2649)
[109833.532146] mtk_i2c_transfer (drivers/i2c/busses/i2c-mt65xx.c:1150 drivers/i2c/busses/i2c-mt65xx.c:1292)
[109833.532149] __i2c_transfer (drivers/i2c/i2c-core-base.c:2252)
[109833.532154] i2c_mux_alloc (drivers/i2c/i2c-mux.c:51) i2c_mux
[109833.532160] __i2c_transfer (drivers/i2c/i2c-core-base.c:2252)
[109833.532165] mdio_i2c_alloc (drivers/net/mdio/mdio-i2c.c:138 drivers/net/mdio/mdio-i2c.c:215) mdio_i2c
[109833.532170] mdio_i2c_alloc (drivers/net/mdio/mdio-i2c.c:259) mdio_i2c
[109833.532173] mdio_i2c_alloc (drivers/net/mdio/mdio-i2c.c:316) mdio_i2c
[109833.532176] __mdiobus_c45_read (drivers/net/phy/mdio_bus.c:992)
[109833.532181] mdiobus_c45_read (drivers/net/phy/mdio_bus.c:1133)
[109833.532184] get_phy_c45_ids (drivers/net/phy/phy_device.c:942)
[109833.532187] get_phy_device (drivers/net/phy/phy_device.c:1054)
[109833.532191] init_module (drivers/net/phy/sfp.c:1820) sfp
[109833.532194] cleanup_module (drivers/net/phy/sfp.c:1956 drivers/net/phy/sfp.c:2667 drivers/net/phy/sfp.c:2748) sfp
[109833.532197] cleanup_module (drivers/net/phy/sfp.c:2760 drivers/net/phy/sfp.c:2892) sfp
[109833.532200] process_one_work (./arch/arm64/include/asm/jump_label.h:32 ./include/linux/jump_label.h:207 ./include/trace/events/workqueue.h:110 kernel/workqueue.c:3234)
[109833.532206] worker_thread (kernel/workqueue.c:3304 (discriminator 2) kernel/workqueue.c:3391 (discriminator 2))
[109833.532210] kthread (kernel/kthread.c:389)
[109833.532214] ret_from_fork (arch/arm64/kernel/entry.S:863)
time for some investigation
Is your power supply powerful enough? Copper sfp need more power than fiber sfp…i guess it is power related. Maybe use another power supply.
yes, I’ve been thinking about this as well … I already bought a power supply of 60W (5A 12V) to feed the board, it should be enought to feed a board with only one of 2 3.0W modules and all WiFi cards deactivated, right?
sorry Frank, a simple “but”: the RJ-45 cable isn’t even inserted
EDIT
I found another stacktrace while the module is initializing but i2c debugging is turned off (turning on everything in mdio/ phy/ etc):
[ 8958.529701] desc_read_finalized_seq (kernel/printk/printk_ringbuffer.c:1893)
[ 8958.529706] _prb_read_valid (kernel/printk/printk_ringbuffer.c:1928 kernel/printk/printk_ringbuffer.c:2113)
[ 8958.529710] prb_read_valid (kernel/printk/printk_ringbuffer.c:2183)
[ 8958.529714] printk_get_next_message (kernel/printk/printk.c:2961)
[ 8958.529717] console_flush_all (kernel/printk/printk.c:3044 (discriminator 1) kernel/printk/printk.c:3178 (discriminator 1))
[ 8958.529720] console_unlock (kernel/printk/printk.c:3237 kernel/printk/printk.c:3277)
[ 8958.529723] vprintk_emit (kernel/printk/printk.c:2408)
[ 8958.529726] dev_vprintk_emit (drivers/base/core.c:4963)
[ 8958.529731] dev_printk_emit (drivers/base/core.c:4978)
[ 8958.529736] __dynamic_dev_dbg (lib/dynamic_debug.c:903 (discriminator 3))
[ 8958.529741] mdio_i2c_alloc (drivers/net/mdio/mdio-i2c.c:268) mdio_i2c
[ 8958.529746] mdio_i2c_alloc (drivers/net/mdio/mdio-i2c.c:316) mdio_i2c
[ 8958.529750] __mdiobus_c45_read (drivers/net/phy/mdio_bus.c:992)
[ 8958.529755] mdiobus_c45_read (drivers/net/phy/mdio_bus.c:1133)
[ 8958.529759] get_phy_c45_ids (drivers/net/phy/phy_device.c:947)
[ 8958.529762] get_phy_device (drivers/net/phy/phy_device.c:1054)
[ 8958.529765] init_module (drivers/net/phy/sfp.c:1820) sfp
[ 8958.529769] cleanup_module (drivers/net/phy/sfp.c:1956 drivers/net/phy/sfp.c:2667 drivers/net/phy/sfp.c:2748) sfp
[ 8958.529772] cleanup_module (drivers/net/phy/sfp.c:2760 drivers/net/phy/sfp.c:2892) sfp
[ 8958.529775] process_one_work (./arch/arm64/include/asm/jump_label.h:32 ./include/linux/jump_label.h:207 ./include/trace/events/workqueue.h:110 kernel/workqueue.c:3234)
[ 8958.529781] worker_thread (kernel/workqueue.c:3304 (discriminator 2) kernel/workqueue.c:3391 (discriminator 2))
[ 8958.529785] kthread (kernel/kthread.c:389)
[ 8958.529788] ret_from_fork (arch/arm64/kernel/entry.S:863)
so my theory would be: The mdio bus tries to access the i2c device but it doesn’t respond in time. this may be a locking problem. When turning on i2c debugging, it gets the answer because i2c debugging obviously slows down the whole thing and the chances of a locking problem are nonexistent. Obviously, this is not a soft lockup so the kernel lock detector won’t see it. What do you guys think?
it looks like this code in mdio-i2c is the issue:
/* By experiment it takes up to 70 ms to access a register for these
* SFPs. Sleep 20ms between iterations and try 10 times.
*/
i = 10;
do {
msleep(20);
ret = i2c_transfer_rollball(i2c, msgs, ARRAY_SIZE(msgs));
if (ret)
return ret;
if (*res == ROLLBALL_CMD_DONE)
return 0;
} while (i-- > 0);
dev_dbg(&bus->dev, "poll timed out\n");
so, what should I do?
EDIT I added the following hack:
--- a/drivers/net/mdio/mdio-i2c.c 2025-09-18 11:20:08.632502973 +0200
+++ b/drivers/net/mdio/mdio-i2c.c 2025-09-18 11:19:17.259796840 +0200
@@ -261,6 +261,7 @@
if (*res == ROLLBALL_CMD_DONE)
return 0;
+ dev_dbg(&bus->dev, "rollball command not done %d\n", *res);
} while (i-- > 0);
dev_dbg(&bus->dev, "poll timed out\n");
and it helps (=slows down so aquantia phy is detected), but ends up in an endless loop of read errors:
[ 3364.116730] [7858] mdio_i2c:i2c_rollball_mii_poll:264: mdio_bus i2c:sfp2: rollball command not done 2
[ 3364.156731] [7858] mdio_i2c:i2c_rollball_mii_poll:264: mdio_bus i2c:sfp2: rollball command not done 2
[ 3364.226734] [7858] mdio_i2c:i2c_rollball_mii_poll:264: mdio_bus i2c:sfp2: rollball command not done 2
[ 3364.266732] [7858] mdio_i2c:i2c_rollball_mii_poll:264: mdio_bus i2c:sfp2: rollball command not done 2
[ 3364.306735] [7858] mdio_i2c:i2c_rollball_mii_poll:264: mdio_bus i2c:sfp2: rollball command not done 2
EDIT2
I think I found it: during the initialization, in get_phy_c45_ids() a lot of probes are being sent. Is it possible the bus locks up due to a large amount of probes?
EDIT3
And this made it work:
--- a/drivers/net/mdio/mdio-i2c.c 2025-09-18 11:20:08.632502973 +0200
+++ b/drivers/net/mdio/mdio-i2c.c 2025-09-18 11:19:17.259796840 +0200
@@ -261,6 +261,7 @@
if (*res == ROLLBALL_CMD_DONE)
return 0;
+ msleep(50);
} while (i-- > 0);
dev_dbg(&bus->dev, "poll timed out\n");
so simply slowing down the mdio-i2c interface made it work …
Have you tried increasing the existing msleep (e.g. 20 => 50) instead of adding additional delay?
I assume it will, that would be obvious,
But, wouldn’t it be better to make a msleep(x) function depend on the iteration? eg: the msleep(20+ 10*(10-i)) here, is that a good idea? or pretty dangerous for other i2c devices I’m currently not testing?
edit corrected typo
I didn’t try adding a few i’s yet, but the idea of this seems to work as well:
- msleep(20);
+ msleep(20+(10*(10-i)));
isn’t this the best way to not-stress-sensitive-devices?
You can send your patch as RFC to netdev, or ask Marek privately first.