I think we are now facing the same issue where the EEPROM just isn’t available. Both frank-w and ericwoud have suggested adding a delay during the module initialization. If you’re keen to try that, can you modify sfp.c to add this snippet after the quirk fixup?
This should force the quirk to be applied regardless of whether the EEPROM is available. If it is applied, you should hopefully see the “hack:…” message in your dmesg.
(I’m also surprised that your sfp still works without a moddef0 gpio. No idea how that is happening but it’s good that there’s power and modules are still getting probed)
with the above, and manually changing moddef0 to low. When I boot up with a dac cable in sfp1.
it shows and correctly reads the eeprom of the dac cable.
sfp sfp1: No tx_disable pin: SFP modules will always be emitting.
at this point if I swap out the dac cable for the S800E. It just works without further kernel messages/errors. Link is up and everything works as it should. But ethtool -m eth2 shows nothing. So eeprom is not probed/loaded but it works.
ethtool -m eth2
netlink error: No such device or address
ethtool eth2
Settings for eth2:
Supported ports: [ ]
Supported link modes: 2500baseX/Full
1000baseX/Full
10000baseCR/Full
Supported pause frame use: Symmetric Receive-only
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes: 10000baseCR/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: 10000Mb/s
Duplex: Full
Auto-negotiation: on
Port: Other
PHYAD: 0
Transceiver: internal
Current message level: 0x000000ff (255)
drv probe link timer ifdown ifup rx_err tx_err
Link detected: yes
I will patch a build with a default startup delay as suggested adn find time/opportunity to test.
the S800E xgspon is my main fiber on another setup. bpi-r4 is more of a test/lab device.
during boot up for sfp2, the eeprom is displayed then the kernel warning “hack: applying slow gpon” is displayed (which mean patch is indeed applied). But for sfp1, it’s still the same “please wait, module slow to respond” then further down “failed to read EEPROM: -ENXIO”. sfp1 does not display the slow gpon warning meaning eeprom has to be read first.
so no go on that front.
from this: patch set: Allow slow to initialise GPON modules to work it gives a bit more detail on the boot process and reading eeprom of the sfp
https://lore.kernel.org/all/[email protected]/T/
It tried to increase the probe time and retries 5s to 10s and retries from 12 to 50
But even with the increase probe time and retries. it still failed to read the eeprom. i can actually see the probe attempts number increasing to 50 then fail.
the issues to me now seems to be the actual probing for the eeprom (wrong address? wrong eeprom response??). Maybe we are looking at the wrong place? maybe i2c for sfp1 issue?
So with just removing moddef0 gpio in the device tree and manually pulling the pin low. I was able to provide power to the S800E. So for testing, if the system is booted up with a dac cable in sfp1 and everything detected correctly. quickly hot swapping to the S800E works. But a reboot will bring back the same failed to read eeprom error.
However I get nothing with a S800E on the same port.
i2cdetect -y 3
0 1 2 3 4 5 6 7 8 9 a b c d e f
00: -- -- -- -- -- -- -- --
10: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
20: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
30: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
40: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
50: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
60: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
70: UU -- -- -- -- -- -- --
i2cdump -y 3 0x50
No size specified (using byte-data access)
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef
00: XX XX XX XX XX XX XX XX XX XX XX XX XX XX XX XX XXXXXXXXXXXXXXXX
10: XX XX XX XX XX XX XX XX XX XX XX XX XX XX XX XX XXXXXXXXXXXXXXXX
Any ideas what’s happening here? and next steps for trouble shoorting?
I did not remove R106. As per stated above, i removed moddef0 gpio from the sfp1 device tree. Then manually change direction to out and pull it to low. This was the manual way to provide power to sfp. Details a few scrolls up above. Yes, module is fully functional if i hotswap. But it fails when i do a reboot.
The s800e is providing 10G fiber connection on another setup as my primary internet connect.
my next step is to telnet or ssh into the S800E and see what’s up with it.
from what i understand and from the sfp.c code, the tx-disable (output - high) is enforced, until module probe is successful. As this is suppose to power down the laser in the sfp module. I can remove the below from the sfp1 device to free up the resource and manually pull it low.
tx-disable-gpios = <&pio 70 GPIO_ACTIVE_HIGH>;
But do u think this will help? Primary the module does not work because of failed module probe attempts
I eventually would like to trace the same for the bpir4 to compare if there are any major differences. In the meantime, an idea to try is to emulate the signals here by hand and see if the module starts up properly?
Edit: uploading the raw pulseview trace here:
mlx_boot_export.sr (215.8 KB)
I don’t have any answers, but these are my observations:
after poweron, LOS, tx-fault and moddef0 goes low after about 25s, which is almost identical to the working mellanox capture. Seems like the S800E starts up as long as there is 3.3V, regardless of tx-disable.
tx-disable is mostly high during startup, then it goes low at about 24s. This is unexpected to me as I would think that tx-disable will only be toggled after moddef0 state has changed. Also for comparison, tx-disable is almost always held at GND on the mellanox.
rate select looks like it’s floating and generating a lot of noise, which is probably OK since “floating” is a valid state
however tx-fault also looks noisy during the first ~20s which is strange because it is supposed to be pulled up to 3v3 via a 4.7K resistor (R96). Maybe it is affected by tx-disable being high?
after moddef0 goes low, the sfp driver can be seen probing quickly (10x 100ms) then switching to 5s interval probes. All i2c requests are not acknowledged. Eventually it times out with the EEPROM error
The SCL clock rate looks okay at about 100k, in fact the mellanox i2c runs slightly faster at about 120k.
At the time of capture, the only active hardware modification is the mosfet bypass. The previously removed R106 has been bridged. Pulseview capture here:
bpir4_boot_annotate.sr (631.5 KB)
The EEPROM gets detected, and the sfp driver halts after detecting a checksum failure. Sure enough, after adding up all the bytes to verify, the original checksum from the S800E is indeed incorrect.
I compared the values just in case the dump was wrong. I ran ethtool -m eth0 hex on on the mellanox, and compared with another dump generated from the raw i2c signals, and they both match correctly. At least this can be quirked into the sfp driver, assuming the i2c issue ever finds a solution.
Anyway, base structure and the extended structure both have checksum failures. I used i2csfp to fix them (thank you ericwoud)
root@OpenWrt:/tmp# ./i2csfp sfp1 eepromfix
Checksum 0x00-0x3e failed, set at 5c, but should be 98
Checksum 0x40-0x5e failed, set at 55, but should be f9
Error: i2c_transfer() failed: No such device or address
Error: i2c_transfer() failed: No such device or address
Error: i2c_transfer() failed: No such device or address
Error: i2c_transfer() failed: No such device or address
RollBall Password used: 0xfffffffa
Error: i2c_transfer() failed: No such device or address
Error: Cannot fill in password!
...
root@OpenWrt:/tmp# ./i2csfp sfp1 byte write 0x50 0x3F 0x98
root@OpenWrt:/tmp# ./i2csfp sfp1 byte write 0x50 0x5F 0xF9
root@OpenWrt:/tmp# ./i2csfp sfp1 restore
With the checksums fixed, the module goes further into the initialization!
[ 749.594781] sfp sfp1: module removed
[ 1060.452804] sfp sfp1: Host maximum power 3.0W
[ 1060.773374] sfp sfp1: module HUAWEI S800E rev sn 4857XXXXXXXXXXXX dc 24082602
[ 1060.803238] hwmon hwmon2: temp1_input not attached to any thermal zone
Strangely, it gets stuck waiting for LOS even though the line was already deasserted:
I changed that field from 1C to 1A, wrote it into the EEPROM, and fixed the checksum again:
root@OpenWrt:/tmp# ./i2csfp sfp1 byte write 0x50 0x41 0x1A
root@OpenWrt:/tmp# ./i2csfp sfp1 eepromfix
Checksum 0x00-0x3e matched 98
Checksum 0x40-0x5e failed, set at f9, but should be f7
...
root@OpenWrt:/tmp# ./i2csfp sfp1 byte write 0x50 0x5F 0xF7
root@OpenWrt:/tmp# ./i2csfp sfp1 restore
With that, I was greeted with this:
[ 499.274793] mtk_soc_eth 15100000.ethernet eth2: Link is Up - 10Gbps/Full - flow control off
[ 499.274823] br-wan: port 2(eth2) entered blocking state
[ 499.288355] br-wan: port 2(eth2) entered forwarding state
root@OpenWrt:/tmp# ifconfig
br-lan Link encap:Ethernet HWaddr 9A:90:63:9B:8C:BA
inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0
inet6 addr: fe80::9890:63ff:fe9b:8cba/64 Scope:Link
inet6 addr: fd44:c9c7:ddf::1/60 Scope:Global
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:14993 errors:0 dropped:0 overruns:0 frame:0
TX packets:7007 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2186957 (2.0 MiB) TX bytes:2191255 (2.0 MiB)
br-wan Link encap:Ethernet HWaddr 9A:90:63:9B:8C:BB
inet addr:202.XX.XX.XXX Bcast:202.XX.XX.255 Mask:255.255.255.0
inet6 addr: fe80::9890:63ff:fe9b:8cbb/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:3021 errors:0 dropped:57 overruns:0 frame:0
TX packets:884 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:286989 (280.2 KiB) TX bytes:408756 (399.1 KiB)
This permanently “breaks” my sfp1 WAN port, but runs on stock OpenWrt and survives reboots. It’s not a real solution but is very cool to see the link up on a bpi-r4.
Happy New Year to you too… great to see some soldering porn on the 1st day of the year.
the 1st question that pops to mind, since the checksum of S800E is wrong. Wy didn’t the mellanox flag it? which also begs to the importance of the epprom data.
S800E is a SFP+ xgs-pon ONU. So for all intensive purpose, we don’t really care what the eeprom say as long as
kern.info kernel: [ 11.624377] mtk_soc_eth 15100000.ethernet eth2: switched to inband/10gbase-r link mode
so more like a fixed-phy than an sfp. which gave me a really stupid/crazy idea… going to test it out now.
Mellanox (specifically this connectx-3) runs off kmod-mlx4-core and does not appear to involve the current sfp driver that we are hacking on, so the implementation is probably different and possibly more forgiving.
IMO the EEPROM data is “nice to have” but not absolutely necessary. LOS is already exposed as a pin, the only useful omission is the DDMI data such as rx power level, module temperature. Most of the actual link configuration appear to be done in-band once the soc is aware that it is possible to do so.
Going to be fun to see what you are up to, if I had to guess, likely something to do with gmac1 on the devicetree? :^)
going to test a few more things then clean things up. hotplug(as in plugin and plug out), reboots and cold power on. All works. No hardware hacks needed and only minor dts changes needed.
To get the S800E working on bpi-r4 without any hardware modifications on openwrt-24.10. Current mainline may have switched around sfp1/sfp2 and/or gmac/eth definitions. Do your homework before you start.
Step one:
Edit mt7988a-bananapi-bpi-r4.dtsi
Remove the following:
/* SFP1 cage (WAN) */
sfp1: sfp1 {
compatible = "sff,sfp";
i2c-bus = <&i2c_sfp1>;
los-gpios = <&pio 54 GPIO_ACTIVE_HIGH>;
mod-def0-gpios = <&pio 82 GPIO_ACTIVE_LOW>;
tx-disable-gpios = <&pio 70 GPIO_ACTIVE_HIGH>;
tx-fault-gpios = <&pio 69 GPIO_ACTIVE_HIGH>;
rate-select0-gpios = <&pio 21 GPIO_ACTIVE_LOW>;
maximum-power-milliwatt = <3000>;
};
Edit &gmac2 to the following, must change usxgmill to 10gbase-R and remove sfp reference.
&gmac2 {
managed = "in-band-status";
phy-mode = "10gbase-r";
status = "okay";
};
Step two:
Add the following to startup /etc/rc.local to force manually enable power to the S800E in sfp1 slot.
In my testing, hotplugging works and link goes up and down. warm restarts sysupgrade and cold boot all work.
One thing to note, the S800E does run warm. So it’s reading 57 deg cel and have been running the last 12+hrs… Up from 50 deg cel without the S800E module. This is with the factory casing (all case up) and a passive heatsink on the processor. And this is also overclocked from 1.8GHz to 2.2GHz at the same default voltage. All within safety margins i suppose.
Nice, excellent to see that it is possible to get a working link without any physical mods!
Exciting as the bpi-r4 is probably one of the best performance-to-value option for 10g-capable routing (assuming that the bpi-r4 can concurrently push line rates at both SFP+ ports).
For step two, I wonder if it is possible to move it into the devicetree, something like this:
If that works, it’ll keep all the changes within the devicetree and might make it easier to maintain a single dtso to hack in support for the S800E.
For step 3, how optional is that configuration? i.e. can I plug in the S800E and expect a working 10g link to come up automatically without touching /etc/config/network?
57 degC should be a reasonably comfortable temperature for the S800E. Assuming the EEPROM values are sane, the temperature should only be a cause for concern above 70c+.
Module temperature : 58.23 degrees C / 136.82 degrees F
Module temperature high alarm threshold : 80.00 degrees C / 176.00 degrees F
Module temperature high warning threshold : 75.00 degrees C / 167.00 degrees F