PXE/Grub network problems after version 1.16 #113

lory696 · 2020-11-20T19:39:50Z

The network seems to no work properly in grub from version 1.17. I tried with a pendrive with only grub on it, and most of the commands related to network (like net_dhcp) show the error “couldn’t send network packet”. However, version 1.16 works. I also recompiled the version 1.21 with the 1.16 network drivers and all seems to work properly. I think this commit, introduced in version 1.17, broke something.

samerhaj · 2020-11-25T16:43:03Z

@andreiw

samerhaj · 2021-03-30T14:03:40Z

For reference, this issue is listed as one of the issues blocking some automated deployment/provisioning scenarios (using MAAS) as explained here: https://discourse.maas.io/t/raspberry-pi-4-provisioning-and-kvm-pod-setup/3607

Versions after 1.16 do not work. Apparently, a commit related to version 1.17 broke the network: the uefi firmware can download the grub bootloader, but then It always shows the error message “couldn’t send network packet”. In grub most of the commands related to network (like net_dhcp) show that error. However, version 1.16 works, and it managed to download kernel files. I opened an issue 21.

jlinton · 2021-04-14T19:07:03Z

I think I just fixed this.

@@ -589,10 +617,16 @@ GenetSimpleNetworkTransmit (
   }

   if (!Genet->SnpMode.MediaPresent) {
-    //
-    // Don't bother transmitting if there's no link.
-    //
-    return EFI_NOT_READY;
+       Status = GenericPhyUpdateConfig (&Genet->Phy);^M
+       if (EFI_ERROR (Status)) {^M
+               //^M
+               // Don't bother transmitting if there's no link.^M
+               //^M
+               DEBUG ((DEBUG_ERROR, "%a: no link\n", __FUNCTION__));^M
+               return EFI_NOT_READY;^M
+       } else {^M
+               Genet->SnpMode.MediaPresent = TRUE;^M
+       }^M
   }

   if (HeaderSize != 0) {

jlinton · 2021-04-15T14:52:33Z

I think we also want

+++ b/Platform/RaspberryPi/RPi4/RPi4.dsc
@@ -719,7 +719,7 @@
   Silicon/Broadcom/Drivers/Net/BcmGenetDxe/BcmGenetDxe.inf {
     <PcdsFixedAtBuild>
       gEmbeddedTokenSpaceGuid.PcdDmaDeviceOffset|0x00000000
-      gEmbeddedTokenSpaceGuid.PcdDmaDeviceLimit|0xffffffff
+      gEmbeddedTokenSpaceGuid.PcdDmaDeviceLimit|0xffffffffff^M
   }

   #

lory696 · 2021-04-16T14:02:54Z

I just tested the code. It behaves slightly differently from version 1.16 because it still prints some "couldn’t send network packet" errors, but after a while it boots correctly. So, I think the patch fixes this issue. This is the serial log:

>>Start PXE over IPv4InstallProtocolInterface: 245DCA21-FB7B-11D3-8F01-00A0C969723B 35386190
.InstallProtocolInterface: 41D94CD2-35B6-455A-8258-D4E51334AADD 3535C6A0
InstallProtocolInterface: 3AD9DF29-4501-478D-B1F8-7F7FE70E50F3 3535D138
InstallProtocolInterface: F4B427BB-BA21-4F16-BC4E-43E416AB619C 3535BEB0

  Station IP address is 192.168.10.105

  Server IP address is 192.168.10.1
  NBP filename is grubaa64.efi
  NBP filesize is 2157568 BytesInstallProtocolInterface: 245DCA21-FB7B-11D3-8F01-00A0C969723B 35386190
InstallProtocolInterface: 245DCA21-FB7B-11D3-8F01-00A0C969723B 35386190

 Downloading NBP file...

  NBP file downloaded successfully.
[Bds] Expand MAC(DCA632E22291,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0) -> MAC(DCA632E22291,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0)
[Security] 3rd party image[0] can be loaded after EndOfDxe: MAC(DCA632E22291,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0).
InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 3535AD40
Loading driver at 0x00033768000 EntryPoint=0x00033768400
Loading driver at 0x00033768000 EntryPoint=0x00033768400
FSOpen: Open 'RPI_EFI.FD' Success
Variables dumped!
InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 35359F98
ProtectUefiImageCommon - 0x3535AD40
  - 0x0000000033768000 - 0x000000000020EC00
Unknown command `#'.
Try `help' for usage
Unknown command `#'.
Try `help' for usage

GenericPhyDetect: PHY detected at address 0x01 (PHYIDR1=0x600D, PHYIDR2=0x84A2)
GenetSimpleNetworkTransmit: no link
error: couldn't send network packet.
GenetSimpleNetworkTransmit: no link
error: couldn't send network packet.
GenetSimpleNetworkTransmit: no link
error: couldn't send network packet.
GenetSimpleNetworkTransmit: no link
error: couldn't send network packet.
GenetSimpleNetworkTransmit: no link
error: couldn't send network packet.
GenetSimpleNetworkTransmit: no link
error: couldn't send network packet.
GenetSimpleNetworkTransmit: no link
error: couldn't send network packet.
GenetSimpleNetworkTransmit: no link
error: couldn't send network packet.
GenericPhyGetConfig: Link speed 1000 Mbps, full-duplex
Booting under MAAS direction...  [ grub.cfg-dc:a6:32:e2  710B  100%  10.52B/s ]
EFI stub: Booting Linux Kernel...    [ boot-initrd  89.13MiB  100%  3.84MiB/s ]
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services and installing virtual address map...
XhcClearBiosOwnership: called to clear BIOS ownership
SetUefiImageMemoryAttributes - 0x0000000037140000 - 0x0000000000040000 (0x0000000000000008)
SetUefiImageMemoryAttributes - 0x00000000370F0000 - 0x0000000000040000 (0x0000000000000008)
SetUefiImageMemoryAttributes - 0x00000000370B0000 - 0x0000000000030000 (0x0000000000000008)
SetUefiImageMemoryAttributes - 0x0000000033A60000 - 0x00000000000B0000 (0x0000000000000008)
SetUefiImageMemoryAttributes - 0x0000000037070000 - 0x0000000000030000 (0x0000000000000008)
SetUefiImageMemoryAttributes - 0x0000000037030000 - 0x0000000000030000 (0x0000000000000008)
SetUefiImageMemoryAttributes - 0x0000000033A20000 - 0x0000000000030000 (0x0000000000000008)
[    0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd083]
[    0.000000] Linux version 5.8.0-34-generic

jlinton · 2021-04-16T15:48:08Z

Thanks for testing this. The posted patch should remove all those errors: https://edk2.groups.io/g/devel/topic/patch_bug_0_2_rpi_fix_pxe/82125865?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,40,82125865

Particularly: https://edk2.groups.io/g/devel/topic/patch_1_2/82125863?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,40,82125863

Under normal circumstances GenetSimpleNetworkTransmit won't be called unless the rest of the network stack detects the link is up. So, during normal operation when the adapter is initialized the link naturally transitions to link up, and then is ready for activity later in the boot sequence. If that hasn't happened by the time PXE runs then it will itself wait for the link. OTOH, the normal distro PXE sequence involves PXE loading shim which in turn loads grub, which tries to read machine specific configs, modules, and grub.cfg in order to prepare the boot menu. Then, once a grub selection is picked, it might try to load the kernel+initrd. In this sequence the network stack is shutdown and restarted multiple times. Grub though, starts up, notices its been network booted, reads saved network parameters and immediately tries to transmit data assuming the link is still up. When that happens grub will print "couldn’t send network packet" and if that lasts long enough it fails to load grub.cfg and the user gets dropped to the grub prompt because no one in the path bothers to assure the link state has transitioned back up. For reference: pftf/RPi4#113 Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>

Under normal circumstances GenetSimpleNetworkTransmit won't be called unless the rest of the network stack detects the link is up. So, during normal operation when the adapter is initialized the link naturally transitions to link up, and then is ready for activity later in the boot sequence. If that hasn't happened by the time PXE runs then it will itself wait for the link. OTOH, the normal distro PXE sequence involves PXE loading shim which in turn loads grub, which tries to read machine specific configs, modules, and grub.cfg in order to prepare the boot menu. Then, once a grub selection is picked, it might try to load the kernel+initrd. In this sequence the network stack is shutdown and restarted multiple times. Grub though, starts up, notices its been network booted, reads saved network parameters and immediately tries to transmit data assuming the link is still up. When that happens grub will print "couldn’t send network packet" and if that lasts long enough it fails to load grub.cfg and the user gets dropped to the grub prompt because no one in the path bothers to assure the link state has transitioned back up. For reference: pftf/RPi4#113 This patch fixes that by polling the link state via GenericPhyUpdateConfig() for ten seconds in the transmit path when the link is down. If the link recovers within this timeout the state machine is transitioned and we continue data transmission. If the 10 seconds expires without the link resuming we will fail as before. While full link negotiation can be fast, it frequently can take a second or two, or longer depending on the remote peer on the other end of the Ethernet cable. It seems auto MDX can slow this down, and certain vendors products seem to be slower than the norm. Ten seconds may not cover some of these possibilities, but the user should validate cabling and the switch/peer's port configuration if resuming the link is taking > 10 seconds. Picking a longer timeout is a tradeoff between the machine appearing to hang for extended periods of time (due to grub retries) if the link is actually down vs generally providing enough time for most endpoints to complete the negotiation. Signed-off-by: Jeremy Linton <jeremy.linton@arm.com> Reviewed-by: Jared McNeill <jmcneill@invisible.ca> Reviewed-by: Andrei Warkentin <awarkentin@vmware.com>

asheplyakov · 2021-05-20T12:23:53Z

I've got a similar issue with a different SoC (BE-M1000) and network driver. Perhaps the problem should be fixed in grub (instead of adding work-around to every driver)?

jlinton · 2021-06-18T22:55:35Z

The link down issue? This particular issue is I would say mostly a genet bug rather than grub. The link is resuming properly the problem is that the genet transmit path doesn't update its own state properly to reflect that. So there isn't really anything grub should do in the case but retry, which it is. AFAIK of course...

If your referring to the second "fix" in my series, which adjusts the rpi DMA constraints, then that's a double bug. The DMA constraints are wrong, but grub is also doing something wrong and not calling the UEFI map/unmap correctly somewhere (I haven't had a chance to find that).

BTW: I don't remember hearing about that SoC it looks pretty interesting.

Under normal circumstances GenetSimpleNetworkTransmit won't be called unless the rest of the network stack detects the link is up. So, during normal operation when the adapter is initialized the link naturally transitions to link up, and then is ready for activity later in the boot sequence. If that hasn't happened by the time PXE runs then it will itself wait for the link. OTOH, the normal distro PXE sequence involves PXE loading shim which in turn loads grub, which tries to read machine specific configs, modules, and grub.cfg in order to prepare the boot menu. Then, once a grub selection is picked, it might try to load the kernel+initrd. In this sequence the network stack is shutdown and restarted multiple times. Grub though, starts up, notices its been network booted, reads saved network parameters and immediately tries to transmit data assuming the link is still up. When that happens grub will print "couldn’t send network packet" and if that lasts long enough it fails to load grub.cfg and the user gets dropped to the grub prompt because no one in the path bothers to assure the link state has transitioned back up. For reference: pftf/RPi4#113 This patch fixes that by polling the link state via GenericPhyUpdateConfig() for ten seconds in the transmit path when the link is down. If the link recovers within this timeout the state machine is transitioned and we continue data transmission. If the 10 seconds expires without the link resuming we will fail as before. While full link negotiation can be fast, it frequently can take a second or two, or longer depending on the remote peer on the other end of the Ethernet cable. It seems auto MDX can slow this down, and certain vendors products seem to be slower than the norm. Ten seconds may not cover some of these possibilities, but the user should validate cabling and the switch/peer's port configuration if resuming the link is taking > 10 seconds. Picking a longer timeout is a tradeoff between the machine appearing to hang for extended periods of time (due to grub retries) if the link is actually down vs generally providing enough time for most endpoints to complete the negotiation. Signed-off-by: Jeremy Linton <jeremy.linton@arm.com> Reviewed-by: Jared McNeill <jmcneill@invisible.ca> Reviewed-by: Andrei Warkentin <awarkentin@vmware.com>

Under normal circumstances GenetSimpleNetworkTransmit won't be called unless the rest of the network stack detects the link is up. So, during normal operation when the adapter is initialized the link naturally transitions to link up, and then is ready for activity later in the boot sequence. If that hasn't happened by the time PXE runs then it will itself wait for the link. OTOH, the normal distro PXE sequence involves PXE loading shim which in turn loads grub, which tries to read machine specific configs, modules, and grub.cfg in order to prepare the boot menu. Then, once a grub selection is picked, it might try to load the kernel+initrd. In this sequence the network stack is shutdown and restarted multiple times. Grub though, starts up, notices its been network booted, reads saved network parameters and immediately tries to transmit data assuming the link is still up. When that happens grub will print "couldn't send network packet" and if that lasts long enough it fails to load grub.cfg and the user gets dropped to the grub prompt because no one in the path bothers to assure the link state has transitioned back up. For reference: pftf/RPi4#113 This patch fixes that by polling the link state via GenericPhyUpdateConfig() for ten seconds in the transmit path when the link is down. If the link recovers within this timeout the state machine is transitioned and we continue data transmission. If the 10 seconds expires without the link resuming we will fail as before. While full link negotiation can be fast, it frequently can take a second or two, or longer depending on the remote peer on the other end of the Ethernet cable. It seems auto MDX can slow this down, and certain vendors products seem to be slower than the norm. Ten seconds may not cover some of these possibilities, but the user should validate cabling and the switch/peer's port configuration if resuming the link is taking > 10 seconds. Picking a longer timeout is a tradeoff between the machine appearing to hang for extended periods of time (due to grub retries) if the link is actually down vs generally providing enough time for most endpoints to complete the negotiation. Signed-off-by: Jeremy Linton <jeremy.linton@arm.com> Reviewed-by: Jared McNeill <jmcneill@invisible.ca> Reviewed-by: Andrei Warkentin <awarkentin@vmware.com>

pbatard · 2021-09-01T19:30:25Z

This should be fixed with the 1.30 release.

samerhaj added the genet Pi4 NIC driver work items label Nov 25, 2020

jlinton changed the title ~~Grub network problems after version 1.16~~ PXE/Grub network problems after version 1.16 Feb 22, 2021

jlinton mentioned this issue Mar 9, 2021

Linux kernels cannot load network: (failed to connect to PHY) when UEFI netbooted through iPXE or grub #137

Closed

samerhaj assigned jlinton Apr 16, 2021

pbatard closed this as completed Sep 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PXE/Grub network problems after version 1.16 #113

PXE/Grub network problems after version 1.16 #113

lory696 commented Nov 20, 2020

samerhaj commented Nov 25, 2020

samerhaj commented Mar 30, 2021

jlinton commented Apr 14, 2021 •

edited

jlinton commented Apr 15, 2021

lory696 commented Apr 16, 2021 •

edited

jlinton commented Apr 16, 2021 •

edited

asheplyakov commented May 20, 2021

jlinton commented Jun 18, 2021 •

edited

pbatard commented Sep 1, 2021

PXE/Grub network problems after version 1.16 #113

PXE/Grub network problems after version 1.16 #113

Comments

lory696 commented Nov 20, 2020

samerhaj commented Nov 25, 2020

samerhaj commented Mar 30, 2021

jlinton commented Apr 14, 2021 • edited

jlinton commented Apr 15, 2021

lory696 commented Apr 16, 2021 • edited

jlinton commented Apr 16, 2021 • edited

asheplyakov commented May 20, 2021

jlinton commented Jun 18, 2021 • edited

pbatard commented Sep 1, 2021

jlinton commented Apr 14, 2021 •

edited

lory696 commented Apr 16, 2021 •

edited

jlinton commented Apr 16, 2021 •

edited

jlinton commented Jun 18, 2021 •

edited