Huawei OptiXstar S800E XGSPON SPF+ ONU not detected

I think we are now facing the same issue where the EEPROM just isn’t available. Both frank-w and ericwoud have suggested adding a delay during the module initialization. If you’re keen to try that, can you modify sfp.c to add this snippet after the quirk fixup?

In sfp_sm_mod_probe at linux/drivers/net/phy/sfp.c at 059dd502b263d8a4e2a84809cf1068d6a3905e6f · torvalds/linux · GitHub

	if (sfp->quirk && sfp->quirk->fixup)
		sfp->quirk->fixup(sfp);

+	dev_warn(sfp->dev, "hack: applying slow gpon\n");
+	sfp->module_t_start_up = T_START_UP_BAD_GPON;

	sfp->state_hw_mask &= ~sfp->state_ignore_mask;
	mutex_unlock(&sfp->st_mutex);

This should force the quirk to be applied regardless of whether the EEPROM is available. If it is applied, you should hopefully see the “hack:…” message in your dmesg.

(I’m also surprised that your sfp still works without a moddef0 gpio. No idea how that is happening but it’s good that there’s power and modules are still getting probed)

slow sun day here so…

i removed the below inaddition to the moddef0 gpio in the device tree

		tx-disable-gpios = <&pio 70 GPIO_ACTIVE_HIGH>;
		tx-fault-gpios = <&pio 69 GPIO_ACTIVE_HIGH>;

with the above, and manually changing moddef0 to low. When I boot up with a dac cable in sfp1. it shows and correctly reads the eeprom of the dac cable.

sfp sfp1: No tx_disable pin: SFP modules will always be emitting.

at this point if I swap out the dac cable for the S800E. It just works without further kernel messages/errors. Link is up and everything works as it should. But ethtool -m eth2 shows nothing. So eeprom is not probed/loaded but it works.

ethtool -m eth2
netlink error: No such device or address
ethtool eth2
Settings for eth2:
	Supported ports: [  ]
	Supported link modes:   2500baseX/Full
	                        1000baseX/Full
	                        10000baseCR/Full
	Supported pause frame use: Symmetric Receive-only
	Supports auto-negotiation: Yes
	Supported FEC modes: Not reported
	Advertised link modes:  10000baseCR/Full
	Advertised pause frame use: Symmetric
	Advertised auto-negotiation: Yes
	Advertised FEC modes: Not reported
	Speed: 10000Mb/s
	Duplex: Full
	Auto-negotiation: on
	Port: Other
	PHYAD: 0
	Transceiver: internal
        Current message level: 0x000000ff (255)
                               drv probe link timer ifdown ifup rx_err tx_err
	Link detected: yes
cat /sys/kernel/debug/sfp1/state
Module state: present
Module probe attempts: 0 0
Device state: up
Main state: link_up
Fault recovery remaining retries: 5
PHY probe remaining retries: 12
Signalling rate: 10313 kBd
Rate select threshold: 0 kBd
moddef0: 1
rx_los: 0
tx_fault: 0
tx_disable: 0
rs0: 0
rs1: 0

Howevery if I do a reboot, then the same error comes back.

sfp sfp1: please wait, module slow to respond

…further down the bootlog

sfp sfp1: failed to read EEPROM: -ENXIO
cat /sys/kernel/debug/sfp1/state
Module state: error
Module probe attempts: 10 12
Device state: up
Main state: down
Fault recovery remaining retries: 0
PHY probe remaining retries: 0
Signalling rate: 10313 kBd
Rate select threshold: 0 kBd
moddef0: 1
rx_los: 0
tx_fault: 0
tx_disable: 1
rs0: 0
rs1: 0

I will patch a build with a default startup delay as suggested adn find time/opportunity to test. the S800E xgspon is my main fiber on another setup. bpi-r4 is more of a test/lab device.

You have tx disable set to 1, so the other side does not get link and so you get none too

applied your suggested default slow gpon patch…

during boot up for sfp2, the eeprom is displayed then the kernel warning “hack: applying slow gpon” is displayed (which mean patch is indeed applied). But for sfp1, it’s still the same “please wait, module slow to respond” then further down “failed to read EEPROM: -ENXIO”. sfp1 does not display the slow gpon warning meaning eeprom has to be read first.

so no go on that front.

from this: patch set: Allow slow to initialise GPON modules to work it gives a bit more detail on the boot process and reading eeprom of the sfp https://lore.kernel.org/all/[email protected]/T/

It tried to increase the probe time and retries 5s to 10s and retries from 12 to 50

#define T_PROBE_RETRY_INIT     msecs_to_jiffies(100)
#define R_PROBE_RETRY_INIT     10
#define T_PROBE_RETRY_SLOW     msecs_to_jiffies(5000)
#define R_PROBE_RETRY_SLOW     12

This is what is referred to in the probe attempts 10 and 12.

cat /sys/kernel/debug/sfp1/state
Module state: error
Module probe attempts: 10 12

But even with the increase probe time and retries. it still failed to read the eeprom. i can actually see the probe attempts number increasing to 50 then fail.

the issues to me now seems to be the actual probing for the eeprom (wrong address? wrong eeprom response??). Maybe we are looking at the wrong place? maybe i2c for sfp1 issue?

So with just removing moddef0 gpio in the device tree and manually pulling the pin low. I was able to provide power to the S800E. So for testing, if the system is booted up with a dac cable in sfp1 and everything detected correctly. quickly hot swapping to the S800E works. But a reboot will bring back the same failed to read eeprom error.

ideas?

just installed i2c-tools to troubleshoot

I can detect and dump i2c eeprom from a DAC cable connected on SFP1

i2cdetect -y 3
     0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
00:                         -- -- -- -- -- -- -- -- 
10: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
20: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
30: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
40: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
50: 50 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
60: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
70: UU -- -- -- -- -- -- -- 

i2cdump -y 3 0x50
No size specified (using byte-data access)
     0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f    0123456789abcdef
00: 03 04 23 01 00 00 04 41 84 80 d5 00 67 00 00 00    ??#?..?A???.g...
10: 00 00 01 00 55 62 69 71 75 69 74 69 20 49 6e 63    ..?.Ubiquiti Inc
20: 2e 20 20 20 00 24 5a 4c 44 41 43 2d 53 46 50 31    .   .$ZLDAC-SFP1

However I get nothing with a S800E on the same port.

i2cdetect -y 3
     0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
00:                         -- -- -- -- -- -- -- -- 
10: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
20: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
30: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
40: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
50: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
60: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
70: UU -- -- -- -- -- -- --

i2cdump -y 3 0x50
No size specified (using byte-data access)
     0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f    0123456789abcdef
00: XX XX XX XX XX XX XX XX XX XX XX XX XX XX XX XX    XXXXXXXXXXXXXXXX
10: XX XX XX XX XX XX XX XX XX XX XX XX XX XX XX XX    XXXXXXXXXXXXXXXX

Any ideas what’s happening here? and next steps for trouble shoorting?

You need a circuit diagram.

Some sfp’s use a real eeprom at 0x50, so try to locate one on the module. Likely it is not there.

Otherwise the module needs to be powered first, if it is not already.

Maybe removing R106 was not such a great idea…

Is the module still functional?

I did not remove R106. As per stated above, i removed moddef0 gpio from the sfp1 device tree. Then manually change direction to out and pull it to low. This was the manual way to provide power to sfp. Details a few scrolls up above. Yes, module is fully functional if i hotswap. But it fails when i do a reboot. The s800e is providing 10G fiber connection on another setup as my primary internet connect.

my next step is to telnet or ssh into the S800E and see what’s up with it.

So looks like tx disable is the trouble?

Change the function of the pin to a led in dts and you can easily play with it.

from what i understand and from the sfp.c code, the tx-disable (output - high) is enforced, until module probe is successful. As this is suppose to power down the laser in the sfp module. I can remove the below from the sfp1 device to free up the resource and manually pull it low.

		tx-disable-gpios = <&pio 70 GPIO_ACTIVE_HIGH>;

But do u think this will help? Primary the module does not work because of failed module probe attempts

Module state: error
Module probe attempts: 10 12

The emulated eeprom maybe only functional when the tx is not disabled. Not really according specs, but still…

good point. worth a shot.

ericwoud: I was the one who removed R106, it should be okay, I can substitute with a piece of wire since it is 0R.

glassdoor: I am hoping to capture the signals between my working setup (mellanox) and bpir4 to compare the pin states during module initialization.

Right now, I’ve soldered leads onto my mellanox card to observe the pin states at the s800e is initialized correctly:

This is what the trace roughly looks like:

  • S800E plugged into ConnectX-3, x86 openwrt, cold start
  • Power on at ~0.7s, all pins high
  • tx-disable goes low ~2.2s
  • ~6s, moddef0 pulses low very briefly, followed by tx-fault pulsing low
  • ~25s, LOS goes low, quickly followed by tx-fault, then followed by moddef0
  • ~27s, bunch of i2c reads
  • no significant signal state changes for ~30s, link appears to be up

i2c reads during setup:

; Identifier
107877393-107878989 24xx EEPROM: Operations: Sequential random read (addr=00, 2 bytes): 03 04

; Transceiver
107879118-107880683 24xx EEPROM: Operations: Sequential random read (addr=07, 2 bytes): 00 00
107880768-107882332 24xx EEPROM: Operations: Sequential random read (addr=08, 2 bytes): 00 00
107882416-107883980 24xx EEPROM: Operations: Sequential random read (addr=09, 2 bytes): 00 00
107884066-107885630 24xx EEPROM: Operations: Sequential random read (addr=0A, 2 bytes): 00 03

107885715-107887279 24xx EEPROM: Operations: Sequential random read (addr=03, 2 bytes): 20 00
107887364-107888928 24xx EEPROM: Operations: Sequential random read (addr=04, 2 bytes): 00 00
107889013-107890577 24xx EEPROM: Operations: Sequential random read (addr=05, 2 bytes): 00 00
107890660-107892225 24xx EEPROM: Operations: Sequential random read (addr=06, 2 bytes): 00 00

107913442-107915007 24xx EEPROM: Operations: Sequential random read (addr=08, 2 bytes): 00 00
107915126-107916691 24xx EEPROM: Operations: Sequential random read (addr=03, 2 bytes): 20 00
107916804-107918368 24xx EEPROM: Operations: Sequential random read (addr=06, 2 bytes): 00 00

; Vendor OUI
107918484-107920049 24xx EEPROM: Operations: Sequential random read (addr=25, 2 bytes): 00 00
107920132-107921696 24xx EEPROM: Operations: Sequential random read (addr=26, 2 bytes): 00 00
107921781-107923346 24xx EEPROM: Operations: Sequential random read (addr=27, 2 bytes): 00 53

; Transceiver
107923460-107925024 24xx EEPROM: Operations: Sequential random read (addr=06, 2 bytes): 00 00

I eventually would like to trace the same for the bpir4 to compare if there are any major differences. In the meantime, an idea to try is to emulate the signals here by hand and see if the module starts up properly?

Edit: uploading the raw pulseview trace here: mlx_boot_export.sr (215.8 KB)

Hi folks, I’ve added a bunch of test leads on my bpi-r4 and traced the s800e startup:

I don’t have any answers, but these are my observations:

  • after poweron, LOS, tx-fault and moddef0 goes low after about 25s, which is almost identical to the working mellanox capture. Seems like the S800E starts up as long as there is 3.3V, regardless of tx-disable.
  • tx-disable is mostly high during startup, then it goes low at about 24s. This is unexpected to me as I would think that tx-disable will only be toggled after moddef0 state has changed. Also for comparison, tx-disable is almost always held at GND on the mellanox.
  • rate select looks like it’s floating and generating a lot of noise, which is probably OK since “floating” is a valid state
  • however tx-fault also looks noisy during the first ~20s which is strange because it is supposed to be pulled up to 3v3 via a 4.7K resistor (R96). Maybe it is affected by tx-disable being high?
  • after moddef0 goes low, the sfp driver can be seen probing quickly (10x 100ms) then switching to 5s interval probes. All i2c requests are not acknowledged. Eventually it times out with the EEPROM error
  • The SCL clock rate looks okay at about 100k, in fact the mellanox i2c runs slightly faster at about 120k.

At the time of capture, the only active hardware modification is the mosfet bypass. The previously removed R106 has been bridged. Pulseview capture here: bpir4_boot_annotate.sr (631.5 KB)

looks like primary difference between the two is that tx-disable is pulled low about 2s into bootup on the mellanox.

as tx-disable is an input on the bpi-r4. question is what is causing S800E to pull tx-disable pin low on mellanox?

tx-disable switches between an input (high impedance) and output:

As far as I can tell, it is enabled once during initialization:


Today in hardware gore, I thought it would be interesting to stick an actual at24 eeprom directly on the bus, using values dumped from the S800E.

The EEPROM gets detected, and the sfp driver halts after detecting a checksum failure. Sure enough, after adding up all the bytes to verify, the original checksum from the S800E is indeed incorrect.

[  182.279685] sfp sfp1: EEPROM base structure checksum failure: 0x98 != 0x5c
[  182.286582] sfp EE: 00000000: 03 04 01 20 00 00 00 00 00 00 00 03 64 00 14 c8  ... ........d...
[  182.295274] sfp EE: 00000010: 00 00 00 00 48 55 41 57 45 49 20 20 20 20 20 20  ....HUAWEI
[  182.303963] sfp EE: 00000020: 20 20 20 20 00 00 00 00 53 38 30 30 45 00 00 00      ....S800E...
[  182.312655] sfp EE: 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 04 f6 00 5c  ...............\
[  182.321342] sfp EE: 00000040: 90 d2 8c 80 01 00 00 00 30 bd d0 80 c0 ff ff ff  ........0.......
[  182.330054] sfp EE: 00000050: 24 35 01 80 c0 ff ff ff 70 bd d0 80 c0 ff ff ff  $5......p.......

I compared the values just in case the dump was wrong. I ran ethtool -m eth0 hex on on the mellanox, and compared with another dump generated from the raw i2c signals, and they both match correctly. At least this can be quirked into the sfp driver, assuming the i2c issue ever finds a solution.

Anyway, base structure and the extended structure both have checksum failures. I used i2csfp to fix them (thank you ericwoud)

root@OpenWrt:/tmp# ./i2csfp sfp1 eepromfix
Checksum 0x00-0x3e failed, set at 5c, but should be 98
Checksum 0x40-0x5e failed, set at 55, but should be f9
Error: i2c_transfer() failed: No such device or address
Error: i2c_transfer() failed: No such device or address
Error: i2c_transfer() failed: No such device or address
Error: i2c_transfer() failed: No such device or address
RollBall Password used: 0xfffffffa
Error: i2c_transfer() failed: No such device or address
Error: Cannot fill in password!

...

root@OpenWrt:/tmp# ./i2csfp sfp1 byte write 0x50 0x3F 0x98
root@OpenWrt:/tmp# ./i2csfp sfp1 byte write 0x50 0x5F 0xF9
root@OpenWrt:/tmp# ./i2csfp sfp1 restore

With the checksums fixed, the module goes further into the initialization!

[  749.594781] sfp sfp1: module removed
[ 1060.452804] sfp sfp1: Host maximum power 3.0W
[ 1060.773374] sfp sfp1: module HUAWEI           S800E rev  sn 4857XXXXXXXXXXXX dc 24082602
[ 1060.803238] hwmon hwmon2: temp1_input not attached to any thermal zone

Strangely, it gets stuck waiting for LOS even though the line was already deasserted:

root@OpenWrt:/tmp# cat /sys/kernel/debug/sfp1/state
Module state: present
Module probe attempts: 0 0
Device state: up
Main state: wait_los
Fault recovery remaining retries: 5
PHY probe remaining retries: 12
Signalling rate: 10313 kBd
Rate select threshold: 0 kBd
moddef0: 1
rx_los: 0
tx_fault: 0
tx_disable: 0
rs0: 0
rs1: 0

Turns out, for some whack reason, the S800E specifies that it requires the LOS to be inverted, so the driver waits indefinitely.

I changed that field from 1C to 1A, wrote it into the EEPROM, and fixed the checksum again:

root@OpenWrt:/tmp# ./i2csfp sfp1 byte write 0x50 0x41 0x1A
root@OpenWrt:/tmp# ./i2csfp sfp1 eepromfix
Checksum 0x00-0x3e matched 98
Checksum 0x40-0x5e failed, set at f9, but should be f7
...
root@OpenWrt:/tmp# ./i2csfp sfp1 byte write 0x50 0x5F 0xF7
root@OpenWrt:/tmp# ./i2csfp sfp1 restore

With that, I was greeted with this:

[  499.274793] mtk_soc_eth 15100000.ethernet eth2: Link is Up - 10Gbps/Full - flow control off
[  499.274823] br-wan: port 2(eth2) entered blocking state
[  499.288355] br-wan: port 2(eth2) entered forwarding state
root@OpenWrt:/tmp# ifconfig
br-lan    Link encap:Ethernet  HWaddr 9A:90:63:9B:8C:BA
          inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::9890:63ff:fe9b:8cba/64 Scope:Link
          inet6 addr: fd44:c9c7:ddf::1/60 Scope:Global
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:14993 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7007 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2186957 (2.0 MiB)  TX bytes:2191255 (2.0 MiB)

br-wan    Link encap:Ethernet  HWaddr 9A:90:63:9B:8C:BB
          inet addr:202.XX.XX.XXX  Bcast:202.XX.XX.255  Mask:255.255.255.0
          inet6 addr: fe80::9890:63ff:fe9b:8cbb/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3021 errors:0 dropped:57 overruns:0 frame:0
          TX packets:884 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:286989 (280.2 KiB)  TX bytes:408756 (399.1 KiB)

This permanently “breaks” my sfp1 WAN port, but runs on stock OpenWrt and survives reboots. It’s not a real solution but is very cool to see the link up on a bpi-r4.

Edit: happy new year folks!

Happy New Year to you too… great to see some soldering porn on the 1st day of the year.

the 1st question that pops to mind, since the checksum of S800E is wrong. Wy didn’t the mellanox flag it? which also begs to the importance of the epprom data.

S800E is a SFP+ xgs-pon ONU. So for all intensive purpose, we don’t really care what the eeprom say as long as

kern.info kernel: [   11.624377] mtk_soc_eth 15100000.ethernet eth2: switched to inband/10gbase-r link mode

so more like a fixed-phy than an sfp. which gave me a really stupid/crazy idea… going to test it out now.

Mellanox (specifically this connectx-3) runs off kmod-mlx4-core and does not appear to involve the current sfp driver that we are hacking on, so the implementation is probably different and possibly more forgiving.

IMO the EEPROM data is “nice to have” but not absolutely necessary. LOS is already exposed as a pin, the only useful omission is the DDMI data such as rx power level, module temperature. Most of the actual link configuration appear to be done in-band once the soc is aware that it is possible to do so.

Going to be fun to see what you are up to, if I had to guess, likely something to do with gmac1 on the devicetree? :^)

crazy/stupid idea worked!

going to test a few more things then clean things up. hotplug(as in plugin and plug out), reboots and cold power on. All works. No hardware hacks needed and only minor dts changes needed.

Details to follow.

Special thanks to @j_g and all who chipped in.

To get the S800E working on bpi-r4 without any hardware modifications on openwrt-24.10. Current mainline may have switched around sfp1/sfp2 and/or gmac/eth definitions. Do your homework before you start.

Step one: Edit mt7988a-bananapi-bpi-r4.dtsi

Remove the following:

	/* SFP1 cage (WAN) */
	sfp1: sfp1 {
		compatible = "sff,sfp";
		i2c-bus = <&i2c_sfp1>;
		los-gpios = <&pio 54 GPIO_ACTIVE_HIGH>;
		mod-def0-gpios = <&pio 82 GPIO_ACTIVE_LOW>;
		tx-disable-gpios = <&pio 70 GPIO_ACTIVE_HIGH>;
		tx-fault-gpios = <&pio 69 GPIO_ACTIVE_HIGH>;
		rate-select0-gpios = <&pio 21 GPIO_ACTIVE_LOW>;
		maximum-power-milliwatt = <3000>;
	};

Edit &gmac2 to the following, must change usxgmill to 10gbase-R and remove sfp reference.

&gmac2 {
	managed = "in-band-status";
	phy-mode = "10gbase-r";
	status = "okay";
};

Step two: Add the following to startup /etc/rc.local to force manually enable power to the S800E in sfp1 slot.

echo 594 > /sys/class/gpio/export
echo out > /sys/class/gpio/gpio594/direction
echo 0 > /sys/class/gpio/gpio594/value

Step three: (optional) Edit eth2 device under /etc/config/network. Adjust to own taste.

config device                          
        option name 'eth2'                   
        option autoneg '0'             
        option speed '10000'           
        option duplex '1'              
        option rxpause '1'             
        option txpause '1'

In my testing, hotplugging works and link goes up and down. warm restarts sysupgrade and cold boot all work.

One thing to note, the S800E does run warm. So it’s reading 57 deg cel and have been running the last 12+hrs… Up from 50 deg cel without the S800E module. This is with the factory casing (all case up) and a passive heatsink on the processor. And this is also overclocked from 1.8GHz to 2.2GHz at the same default voltage. All within safety margins i suppose.

Have fun and don’t burn down the house.

1 Like

Nice, excellent to see that it is possible to get a working link without any physical mods!

Exciting as the bpi-r4 is probably one of the best performance-to-value option for 10g-capable routing (assuming that the bpi-r4 can concurrently push line rates at both SFP+ ports).

For step two, I wonder if it is possible to move it into the devicetree, something like this:

&pio {
+     sfp1-module-power-hog {
+          gpio-hog;
+          gpios = <&pio 82 GPIO_ACTIVE_HIGH>;
+          output-low;
+     };
     pwm0_pins: pwm0-pins {
          mux {
               groups = "pwm0";
               function = "pwm";
          };
     };
};

If that works, it’ll keep all the changes within the devicetree and might make it easier to maintain a single dtso to hack in support for the S800E.

For step 3, how optional is that configuration? i.e. can I plug in the S800E and expect a working 10g link to come up automatically without touching /etc/config/network?

57 degC should be a reasonably comfortable temperature for the S800E. Assuming the EEPROM values are sane, the temperature should only be a cause for concern above 70c+.

        Module temperature                        : 58.23 degrees C / 136.82 degrees F
        Module temperature high alarm threshold   : 80.00 degrees C / 176.00 degrees F
        Module temperature high warning threshold : 75.00 degrees C / 167.00 degrees F