okay, but the “mdio timed out” is still active. Isn’t it something related?
I usually do not debug drivers/net/mdio/* so donnow…
Is dumping the stack possible to fix the i2c loop?
[109103.153288] i2c_get_dma_safe_msg_buf:2646: i2c-core: using bounce buffer for addr=0x51, len=6
as the length isn’t always the same, but line number is, the backtrace would be nice here. Can I do that using dynamic debugging?
This is read using i2c and the i2c-to-rollball bridge and it seems to work.
well, I’m a lucky boy, I guess, we have sysrq triggers in a loop:
ip link set sfp-lan up && sleep 5 && echo l > /proc/sysrq-trig
ger && sleep 5 && echo l > /proc/sysrq-trigger && ip link set sfp-lan down
and bingo:
[109833.532118] vprintk_emit+0x20c/0x2a8
[109833.532122] vprintk_default+0x38/0x48
[109833.532125] vprintk+0x40/0xa0
[109833.532129] _printk+0x54/0x68
[109833.532133] __dynamic_pr_debug+0x7c/0xa0
[109833.532139] i2c_get_dma_safe_msg_buf+0xcc/0x100
[109833.532146] mtk_i2c_transfer+0x384/0x1118
[109833.532149] __i2c_transfer+0x1ac/0x600
[109833.532154] i2c_mux_alloc+0x530/0x5f8 [i2c_mux]
[109833.532160] __i2c_transfer+0x1ac/0x600
[109833.532165] mdio_i2c_alloc+0x240/0xe98 [mdio_i2c]
[109833.532170] mdio_i2c_alloc+0x3ac/0xe98 [mdio_i2c]
[109833.532173] mdio_i2c_alloc+0x560/0xe98 [mdio_i2c]
[109833.532176] __mdiobus_c45_read+0x3c/0x110
[109833.532181] mdiobus_c45_read+0x4c/0x78
[109833.532184] get_phy_c45_ids+0xe0/0x218
[109833.532187] get_phy_device+0x4c/0x150
[109833.532191] init_module+0x1f4/0xbe8 [sfp]
[109833.532194] cleanup_module+0x1d50/0x3f10 [sfp]
[109833.532197] cleanup_module+0x2b3c/0x3f10 [sfp]
[109833.532200] process_one_work+0x178/0x3e0
[109833.532206] worker_thread+0x290/0x450
[109833.532210] kthread+0xe0/0xf0
[109833.532214] ret_from_fork+0x10/0x20
so it does not tell a lot, but at least we got something. A decode_stacktrace gives me:
[109833.532118] vprintk_emit (kernel/printk/printk.c:2048 kernel/printk/printk.c:2406)
[109833.532122] vprintk_default (kernel/printk/printk.c:2423)
[109833.532125] vprintk (kernel/printk/printk_safe.c:74)
[109833.532129] _printk (kernel/printk/printk.c:2436)
[109833.532133] __dynamic_pr_debug (lib/dynamic_debug.c:881)
[109833.532139] i2c_get_dma_safe_msg_buf (drivers/i2c/i2c-core-base.c:2649)
[109833.532146] mtk_i2c_transfer (drivers/i2c/busses/i2c-mt65xx.c:1150 drivers/i2c/busses/i2c-mt65xx.c:1292)
[109833.532149] __i2c_transfer (drivers/i2c/i2c-core-base.c:2252)
[109833.532154] i2c_mux_alloc (drivers/i2c/i2c-mux.c:51) i2c_mux
[109833.532160] __i2c_transfer (drivers/i2c/i2c-core-base.c:2252)
[109833.532165] mdio_i2c_alloc (drivers/net/mdio/mdio-i2c.c:138 drivers/net/mdio/mdio-i2c.c:215) mdio_i2c
[109833.532170] mdio_i2c_alloc (drivers/net/mdio/mdio-i2c.c:259) mdio_i2c
[109833.532173] mdio_i2c_alloc (drivers/net/mdio/mdio-i2c.c:316) mdio_i2c
[109833.532176] __mdiobus_c45_read (drivers/net/phy/mdio_bus.c:992)
[109833.532181] mdiobus_c45_read (drivers/net/phy/mdio_bus.c:1133)
[109833.532184] get_phy_c45_ids (drivers/net/phy/phy_device.c:942)
[109833.532187] get_phy_device (drivers/net/phy/phy_device.c:1054)
[109833.532191] init_module (drivers/net/phy/sfp.c:1820) sfp
[109833.532194] cleanup_module (drivers/net/phy/sfp.c:1956 drivers/net/phy/sfp.c:2667 drivers/net/phy/sfp.c:2748) sfp
[109833.532197] cleanup_module (drivers/net/phy/sfp.c:2760 drivers/net/phy/sfp.c:2892) sfp
[109833.532200] process_one_work (./arch/arm64/include/asm/jump_label.h:32 ./include/linux/jump_label.h:207 ./include/trace/events/workqueue.h:110 kernel/workqueue.c:3234)
[109833.532206] worker_thread (kernel/workqueue.c:3304 (discriminator 2) kernel/workqueue.c:3391 (discriminator 2))
[109833.532210] kthread (kernel/kthread.c:389)
[109833.532214] ret_from_fork (arch/arm64/kernel/entry.S:863)
time for some investigation ![]()
Is your power supply powerful enough? Copper sfp need more power than fiber sfp…i guess it is power related. Maybe use another power supply.
yes, I’ve been thinking about this as well … I already bought a power supply of 60W (5A 12V) to feed the board, it should be enought to feed a board with only one of 2 3.0W modules and all WiFi cards deactivated, right?
sorry Frank, a simple “but”: the RJ-45 cable isn’t even inserted
EDIT
I found another stacktrace while the module is initializing but i2c debugging is turned off (turning on everything in mdio/ phy/ etc):
[ 8958.529701] desc_read_finalized_seq (kernel/printk/printk_ringbuffer.c:1893)
[ 8958.529706] _prb_read_valid (kernel/printk/printk_ringbuffer.c:1928 kernel/printk/printk_ringbuffer.c:2113)
[ 8958.529710] prb_read_valid (kernel/printk/printk_ringbuffer.c:2183)
[ 8958.529714] printk_get_next_message (kernel/printk/printk.c:2961)
[ 8958.529717] console_flush_all (kernel/printk/printk.c:3044 (discriminator 1) kernel/printk/printk.c:3178 (discriminator 1))
[ 8958.529720] console_unlock (kernel/printk/printk.c:3237 kernel/printk/printk.c:3277)
[ 8958.529723] vprintk_emit (kernel/printk/printk.c:2408)
[ 8958.529726] dev_vprintk_emit (drivers/base/core.c:4963)
[ 8958.529731] dev_printk_emit (drivers/base/core.c:4978)
[ 8958.529736] __dynamic_dev_dbg (lib/dynamic_debug.c:903 (discriminator 3))
[ 8958.529741] mdio_i2c_alloc (drivers/net/mdio/mdio-i2c.c:268) mdio_i2c
[ 8958.529746] mdio_i2c_alloc (drivers/net/mdio/mdio-i2c.c:316) mdio_i2c
[ 8958.529750] __mdiobus_c45_read (drivers/net/phy/mdio_bus.c:992)
[ 8958.529755] mdiobus_c45_read (drivers/net/phy/mdio_bus.c:1133)
[ 8958.529759] get_phy_c45_ids (drivers/net/phy/phy_device.c:947)
[ 8958.529762] get_phy_device (drivers/net/phy/phy_device.c:1054)
[ 8958.529765] init_module (drivers/net/phy/sfp.c:1820) sfp
[ 8958.529769] cleanup_module (drivers/net/phy/sfp.c:1956 drivers/net/phy/sfp.c:2667 drivers/net/phy/sfp.c:2748) sfp
[ 8958.529772] cleanup_module (drivers/net/phy/sfp.c:2760 drivers/net/phy/sfp.c:2892) sfp
[ 8958.529775] process_one_work (./arch/arm64/include/asm/jump_label.h:32 ./include/linux/jump_label.h:207 ./include/trace/events/workqueue.h:110 kernel/workqueue.c:3234)
[ 8958.529781] worker_thread (kernel/workqueue.c:3304 (discriminator 2) kernel/workqueue.c:3391 (discriminator 2))
[ 8958.529785] kthread (kernel/kthread.c:389)
[ 8958.529788] ret_from_fork (arch/arm64/kernel/entry.S:863)
so my theory would be: The mdio bus tries to access the i2c device but it doesn’t respond in time. this may be a locking problem. When turning on i2c debugging, it gets the answer because i2c debugging obviously slows down the whole thing and the chances of a locking problem are nonexistent. Obviously, this is not a soft lockup so the kernel lock detector won’t see it. What do you guys think?
it looks like this code in mdio-i2c is the issue:
/* By experiment it takes up to 70 ms to access a register for these
* SFPs. Sleep 20ms between iterations and try 10 times.
*/
i = 10;
do {
msleep(20);
ret = i2c_transfer_rollball(i2c, msgs, ARRAY_SIZE(msgs));
if (ret)
return ret;
if (*res == ROLLBALL_CMD_DONE)
return 0;
} while (i-- > 0);
dev_dbg(&bus->dev, "poll timed out\n");
so, what should I do?
EDIT I added the following hack:
--- a/drivers/net/mdio/mdio-i2c.c 2025-09-18 11:20:08.632502973 +0200
+++ b/drivers/net/mdio/mdio-i2c.c 2025-09-18 11:19:17.259796840 +0200
@@ -261,6 +261,7 @@
if (*res == ROLLBALL_CMD_DONE)
return 0;
+ dev_dbg(&bus->dev, "rollball command not done %d\n", *res);
} while (i-- > 0);
dev_dbg(&bus->dev, "poll timed out\n");
and it helps (=slows down so aquantia phy is detected), but ends up in an endless loop of read errors:
[ 3364.116730] [7858] mdio_i2c:i2c_rollball_mii_poll:264: mdio_bus i2c:sfp2: rollball command not done 2
[ 3364.156731] [7858] mdio_i2c:i2c_rollball_mii_poll:264: mdio_bus i2c:sfp2: rollball command not done 2
[ 3364.226734] [7858] mdio_i2c:i2c_rollball_mii_poll:264: mdio_bus i2c:sfp2: rollball command not done 2
[ 3364.266732] [7858] mdio_i2c:i2c_rollball_mii_poll:264: mdio_bus i2c:sfp2: rollball command not done 2
[ 3364.306735] [7858] mdio_i2c:i2c_rollball_mii_poll:264: mdio_bus i2c:sfp2: rollball command not done 2
EDIT2
I think I found it: during the initialization, in get_phy_c45_ids() a lot of probes are being sent. Is it possible the bus locks up due to a large amount of probes?
EDIT3
And this made it work:
--- a/drivers/net/mdio/mdio-i2c.c 2025-09-18 11:20:08.632502973 +0200
+++ b/drivers/net/mdio/mdio-i2c.c 2025-09-18 11:19:17.259796840 +0200
@@ -261,6 +261,7 @@
if (*res == ROLLBALL_CMD_DONE)
return 0;
+ msleep(50);
} while (i-- > 0);
dev_dbg(&bus->dev, "poll timed out\n");
so simply slowing down the mdio-i2c interface made it work …
Have you tried increasing the existing msleep (e.g. 20 => 50) instead of adding additional delay?
I assume it will, that would be obvious,
But, wouldn’t it be better to make a msleep(x) function depend on the iteration? eg: the msleep(20+ 10*(10-i)) here, is that a good idea? or pretty dangerous for other i2c devices I’m currently not testing?
edit corrected typo
I didn’t try adding a few i’s yet, but the idea of this seems to work as well:
- msleep(20);
+ msleep(20+(10*(10-i)));
isn’t this the best way to not-stress-sensitive-devices?
You can send your patch as RFC to netdev, or ask Marek privately first.
I tried sending it as an RFC, I didn’t get too much useful info. Russel complained about my mentality blaming i2c on embedded devices. I thought something like “yeah, fine, if nobody else saw it before, it will likely be just me in an exotic config, not worth trying to convince vanilla devs”.
What I do know is that @frank-w was probably right: there is a power problem. When using 2 minipcie 3.3V adapters, a M.2 3.3V card and adding those 2 SFP+ SR modules on my R4, unexplainable errors occur: PCIE AER errors, and the link state of the SFP+ module going up and down again. When removing one of those 2 SFP+ cages, the error is gone. So, I don’t know what I should do with this patch, but should I send it to openwrt devs? Anyway, for anybody having issues with cheap-ass short range SFP+ modules, this patch may work for you:
--- a/drivers/net/mdio/mdio-i2c.c 2025-09-19 16:35:52.000000000 +0200
+++ b/drivers/net/mdio/mdio-i2c.c 2025-09-27 14:11:59.406323627 +0200
@@ -248,12 +248,15 @@ static int i2c_rollball_mii_poll(struct mii_bus *bus, int bus_addr, u8 *buf,
msgs[1].len = len;
msgs[1].buf = res;
- /* By experiment it takes up to 70 ms to access a register for these
- * SFPs. Sleep 20ms between iterations and try 10 times.
+ /* By experiment it takes up to 70 ms
+ * to access a register for normal SFPs.
+ * Sleep at least 20ms between iterations and try 10 times.
+ * Slower modules on embedded devices may need more.
+ * Dynamically increase sleep to avoid wasted i2c_transfer_rollball() calls
*/
i = 10;
do {
- msleep(20);
+ msleep(20+5*(10-i));
ret = i2c_transfer_rollball(i2c, msgs, ARRAY_SIZE(msgs));
if (ret)
If it is power-related imho a software patch is wrong.
Normally there are specifications that a device have to be ready after some time and drivers should follow that…if it takes longer than hw has to be fixed.
if you can, be my guest! I tried everything in my power: using ASPM, disabling PHYs of the WiFi cards, setting all power/control of all devices to “auto”, … is it possible the board is simply exhaused by my experiments and the 3.3V line doesn’t work properly anymore?
Which power supply are you using?
alteq altq-acdc-120, providing 90W, according to its label
Hey folks, On mtk feed build (because vanilla openwrt does not provide any logs), in SFP with rtl8261n I have such message:
~ ❯ grep sfp2 ~/sfp-issue 58s 16:23:05
[ 13.460606] sfp sfp2: Host maximum power 3.0W
[ 13.465052] sfp sfp2: tx disable 1 -> 1
[ 13.468900] sfp sfp2: SM: enter empty:detached:down event insert
[ 13.474923] sfp sfp2: SM: exit probe:detached:down
[ 13.479957] sfp sfp2: SM: enter probe:detached:down event dev_attach
[ 13.486410] sfp sfp2: SM: exit probe:down:down
[ 13.807892] sfp sfp2: SM: enter probe:down:down event timeout
[ 13.836464] sfp sfp2: module OEM SFP-10G-T rev A sn 2025xx8 dc 25xx6
[ 13.857507] sfp sfp2: module OEM SFP-10G-T rev A has been found in the quirk list
[ 13.866980] sfp sfp2: sfp: support mode 00,00000000,00000800,00007040
[ 13.873410] sfp sfp2: SM: exit present:down:down
[ 13.884134] sfp sfp2: SM: enter present:down:down event los_high
[ 13.890128] sfp sfp2: SM: exit present:down:down
[ 20.385470] sfp sfp2: SM: enter present:down:down event dev_up
[ 20.391315] sfp sfp2: tx disable 1 -> 0
[ 20.395310] sfp sfp2: SM: exit present:up:wait
[ 20.465077] sfp sfp2: SM: enter present:up:wait event timeout
[ 20.470840] sfp sfp2: probing phy device through the [MDIO_I2C_ROLLBALL] protocol
[ 47.876144] sfp sfp2: no PHY detected, 24 tries left
[ 47.881101] sfp sfp2: SM: exit present:up:init_phy
[ 47.927419] sfp sfp2: SM: enter present:up:init_phy event los_low
[ 47.933506] sfp sfp2: SM: exit present:up:init_phy
[ 48.957516] sfp sfp2: SM: enter present:up:init_phy event timeout
[ 76.306143] sfp sfp2: no PHY detected, 23 tries left
[ 76.311115] sfp sfp2: SM: exit present:up:init_phy
[ 76.316275] sfp sfp2: SM: enter present:up:init_phy event los_high
[ 76.322456] sfp sfp2: SM: exit present:up:init_phy
[ 77.354918] sfp sfp2: SM: enter present:up:init_phy event timeout
[ 104.726069] sfp sfp2: no PHY detected, 22 tries left
[ 104.731041] sfp sfp2: SM: exit present:up:init_phy
[ 105.216643] sfp sfp2: SM: enter present:up:init_phy event los_low
[ 105.230567] sfp sfp2: SM: exit present:up:init_phy
[ 105.754093] sfp sfp2: SM: enter present:up:init_phy event timeout
[ 133.105946] sfp sfp2: no PHY detected, 21 tries left
[ 133.110919] sfp sfp2: SM: exit present:up:init_phy
[ 134.153910] sfp sfp2: SM: enter present:up:init_phy event timeout
[ 161.495832] sfp sfp2: no PHY detected, 20 tries left
[ 161.500803] sfp sfp2: SM: exit present:up:init_phy
[ 162.553797] sfp sfp2: SM: enter present:up:init_phy event timeout
[ 189.905741] sfp sfp2: no PHY detected, 19 tries left
[ 189.910714] sfp sfp2: SM: exit present:up:init_phy
[ 190.953704] sfp sfp2: SM: enter present:up:init_phy event timeout
[ 218.275650] sfp sfp2: no PHY detected, 18 tries left
[ 218.280623] sfp sfp2: SM: exit present:up:init_phy
[ 219.353615] sfp sfp2: SM: enter present:up:init_phy event timeout
can someone give a hint where to check what’s going on?