[R2] Can't get the kernel to load MT7615 firmware on neither debian 9 (4.14) nor 10 (5.4)

i guess this is the problem:

semaphore is aquired, function is left and so semaphore is not released (done in out:-block “sem = mt7615_mcu_patch_sem_ctrl(dev, 0);” )

i’m not sure how to handle it correctly, but i guess we need to replace the “return 0;” with “goto out;”

I don’t think that’s the actual problem (though it may be a problem later).

I’ve added prints to the mt7615_mcu_patch_sem_ctrl function, and the first lock attempt on dmesg already fails.

root@bpi-r2:~# dmesg | grep DEBUG
[   12.061474] DEBUG: Attempting to lock mt7615's semaphore.
[   32.474102] DEBUG: mt7615's semaphore returned an error.
[  202.445135] DEBUG: Attempting to lock mt7615's semaphore.
[  202.454843] DEBUG: mt7615's semaphore returned an error.

Upon closer inspection of the switch-case, it seems that…

  • PATCH_IS_DL means that the semaphore is already locked by something, and thus nothing more can be done (yet).
  • PATCH_NOT_DL_SEM_SUCCESS means that the semaphore has been locked successfully, and so the code continues.

Then my guess is that the default is an error, but it does return -EAGAIN. Assuming the kernel errnos mean the same as the userspace ones, that would mean that even though the resource is not available now, it may be later. Isn’t that the same as PATCH_IS_DL? It may be referring to the hardware itself (in the sense that the hardware is not ready), but that would imply there’s something wrong with my hardware… What do you think?

i guess we need somebody knowing the driver more…i do not understand this exactly…maybe it it like you wrote…but why the semaphore is blocked before first call of mt7615_mcu_patch_sem_ctrl ? that does not make sense for me

how is your exact debug-code ( mt7615's semaphore returned an error.) ? maybe you treat PATCH_IS_DL or PATCH_NOT_DL_SEM_SUCCESS as error. i guess first returns PATCH_IS_DL and returns 0 without releasing the semaphore…second failes with -EAGAIN because semaphore is still locked

i see in 893369b769c18ad14d57f731428d97a7c04dfec6 “mt7615: mcu: remove skb_ret from mt7615_mcu_msg_send” the mt7615_mcu_patch_sem_ctrl function is changed

 static int mt7615_mcu_patch_sem_ctrl(struct mt7615_dev *dev, bool get)
 {
        struct {
-               __le32 operation;
+               __le32 op;
        } req = {
-               .operation = cpu_to_le32(get ? PATCH_SEM_GET :
-                                        PATCH_SEM_RELEASE),
+               .op = cpu_to_le32(get ? PATCH_SEM_GET : PATCH_SEM_RELEASE),
        };
-       struct event {
-               u8 status;
-               u8 reserved[3];
-       } *resp;
        struct sk_buff *skb = mt7615_mcu_msg_alloc(&req, sizeof(req));
-       struct sk_buff *skb_ret;
-       int ret;
 
-       ret = mt7615_mcu_msg_send(dev, skb, -MCU_CMD_PATCH_SEM_CONTROL,
-                                 &skb_ret);
-       if (ret)
-               goto out;
-
-       resp = (struct event *)(skb_ret->data);
-       ret = resp->status;
-       dev_kfree_skb(skb_ret);
-
-out:
-       return ret;
+       return mt7615_mcu_msg_send(dev, skb, -MCU_CMD_PATCH_SEM_CONTROL);
 }

you cannot apply this directly because you still need the skb-param…but structure-change may help

maybe this fixes problem in later versions (but i still do not get the error on my r2 with mt7615 in pcie-slot)

i guess we need somebody knowing the driver more…i do not understand this exactly…maybe it it like you wrote…but why the semaphore is blocked before first call of mt7615_mcu_patch_sem_ctrl ? that does not make sense for me

Neither does to me. One explanation is that it may be locked by something else elsewhere, maybe not intentionally. Or even that the semaphore is stored in non-volatile memory on the mt7615 hardware, which of course, it doesn’t make sense.

how is your exact debug-code ( mt7615's semaphore returned an error. ) ? maybe you treat PATCH_IS_DL or PATCH_NOT_DL_SEM_SUCCESS as error. i guess first returns PATCH_IS_DL and returns 0 without releasing the semaphore…second failes with -EAGAIN because semaphore is still locked

    int ret;

	struct {
		__le32 op;
	} req = {
		.op = cpu_to_le32(get ? PATCH_SEM_GET : PATCH_SEM_RELEASE),
	};

    if (get)
        pr_alert("DEBUG: Attempting to lock mt7615's semaphore.\n");
    else
        pr_alert("DEBUG: Attempting to unlock mt7615's semaphore.\n");

	ret = __mt76_mcu_send_msg(&dev->mt76, MCU_CMD_PATCH_SEM_CONTROL,
				   &req, sizeof(req), true);

    switch(ret)
    {
    case PATCH_IS_DL:
        pr_alert("DEBUG: mt7615's semaphore was already locked (by whom?).\n");
        break;
    case PATCH_NOT_DL_SEM_SUCCESS:
        pr_alert("DEBUG: mt7615's semaphore was locked.\n");
        break;
    default:
        pr_alert("DEBUG: mt7615's semaphore returned an error.\n");
        break;
    }

    return ret;

I know I could reduce some lines of code using ternary operators, but I preferred to avoid issues later.

i see in 893369b769c18ad14d57f731428d97a7c04dfec6 “mt7615: mcu: remove skb_ret from mt7615_mcu_msg_send” the mt7615_mcu_patch_sem_ctrl function is changed

Now that you mention this change, it seems that only the first byte is status, while the others are reserved. Knowing that, I’ve added another print: pr_alert("DEBUG: __mt76_mcu_send_msg() returned 0x%08x", ret);.

It worked, more or less… It printed 0xffffff92. I assume the communication is done using little-endian since the request uses __le32. Assuming that, 0x92 is the status, while 0xff are the reserved bytes…

Of course, if the 4 bytes are compared, the switch is doomed to fail miserably. But even if the reserved bytes are ignored we’re left with 0x92, which is still invalid:

enum {
	PATCH_NOT_DL_SEM_FAIL	 = 0x0,
	PATCH_IS_DL		 = 0x1,
	PATCH_NOT_DL_SEM_SUCCESS = 0x2,
	PATCH_REL_SEM_SUCCESS	 = 0x3
};

As you say the lowest 4 bits are the status you could compare

switch(ret & 0xf)

The status is 2,so not-dl-success…but i’m not sure this is right way

Shouldn’t be the lowest 8 bits? The structure you mentioned wasn’t a bitfield or anything special.

I guess to be sure we should check __mt76_mcu_send_msg, but it’s a macro that calls a function pointer, so I have no idea where it points…

I’ve just tried ignoring all but the lowest 4 bits, but it still throws the exact same error if I try to interact with it.

EDIT: We’re wrong assuming that a mask is the solution:

[   11.989938] DEBUG: Attempting to lock mt7615's semaphore.
[   32.473472] DEBUG: __mt76_mcu_send_msg() returned 0xffffff92
[   32.473477] DEBUG: mt7615's semaphore was locked.
[   52.953480] DEBUG: Attempting to unlock mt7615's semaphore.
[   73.433468] DEBUG: __mt76_mcu_send_msg() returned 0xffffff92
[   73.433473] DEBUG: mt7615's semaphore was locked.

It returned exactly the same value when unlocking…

Alright, some progress (or deductions, more likely):

It seems the kernel returns -<errno> for errors. If we assume the 0xffffff92 is not from the mt7615 but from the kernel, then decoding it as a little-endian signed integer gives -110, which corresponds to -ETIMEOUT.

This makes complete sense since the time from when requesting the semaphore until when the response is given is consistent to about 20 seconds.

Seeing as the default timeout for sockets on Linux >3.something is 20 seconds (don’t quote me on that, I haven’t actually checked it) it doesn’t seem strange that a similar timeout would be applied to hardware.

Sadly, that makes the hypothesis that there’s something wrong with my hardware even stronger :C

EDIT: Since I guessed the most likely explanation is the ETIMEOUT, I decided, just in case, to power off my BPi, disconnect the mt7615, let it cool down and clean it, then reconnected it and booted the BPi again.

Guess what, it works now… I feel stupid heh.

Well, lesson learnt: Always remove dust from appliances before plugging them in…
At the very least I’ve learnt quite a few things on the Linux kernel that will surely come handy some day.
Sorry for wasting your time, and thanks a lot.

Good to hear problem is solved

But code should show the etimeout and not semaphore error…

You mean the pr_alerts I added? Those aren’t there anymore.

If you mean the driver, then I guess I could give it a try so that it returns ETIMEOUT instead of Input/output error, but I can’t right now.