Nodes Install then fail to boot on successive boots
Howdy,
I'm testing out OpenHPC in a VirtualBox environment using the instructions for Warewolf + Slurm on CentOS 7.5 x86_64:
Any suggestions as to what I may be missing? Let me know if I can provide additional details. Details follow below:
The VirtualBox compute nodes PXE boot and install just fine, however after that they fail to boot with the following console message:
[TXE: 1 x "Network unreachable (http://ipxe.org/28086011 )"] Configuring (net0 08:00:27:8e:87:8b)................ No configuration methods succeeded (http://ipxe.org/040ee119 ) No more network devices
FATAL: INT18: BOOT FAILURE
Here's some information from the server side: # wwsh node list NAME GROUPS IPADDR HWADDR ================================================================================ c0001 UNDEF 192.168.10.101 08:00:27:8e:87:8b
# wwsh provision list (<-- is there any way to expand the column width for this output?) NODE VNFS BOOTSTRAP FILES ================================================================================ c0001 centos7.5 3.10.0-862.3.3.el7... dynamic_hosts,grou...
# cat /srv/warewulf/ipxe/cfg/08\:00\:27\:8e\:87\:8b #!ipxe # Configuration for Warewulf node: c0001 # Warewulf data store ID: 30 echo Now booting c0001 with Warewulf bootstrap (3.10.0-862.3.3.el7.x86_64) set base http://192.168.10.1/WW/bootstrap initrd ${base}/x86_64/6/initfs.gz kernel ${base}/x86_64/6/kernel ro initrd=initfs.gz wwhostname=c0001 net.ifnames=1,biosdevname=1 wwmaster=192.168.10.1 wwipaddr=192.168.10.101 wwnetmask=255.255.255.0 wwnetdev=enp0s3 wwhwaddr=08:00:27:8e:87:8b boot
/var/log/messages output for the event: Jun 22 09:21:09 ohpc systemd: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 8079 (find) Jun 22 09:21:09 ohpc systemd: Mounting Arbitrary Executable File Formats File System... Jun 22 09:21:09 ohpc systemd: Mounted Arbitrary Executable File Formats File System. Jun 22 09:23:08 ohpc dhcpd: DHCPDISCOVER from 08:00:27:8e:87:8b via enp0s3 Jun 22 09:23:08 ohpc dhcpd: DHCPOFFER on 192.168.10.101 to 08:00:27:8e:87:8b via enp0s3 Jun 22 09:23:09 ohpc dhcpd: DHCPREQUEST for 192.168.10.101 (192.168.10.1) from 08:00:27:8e:87:8b via enp0s3 Jun 22 09:23:09 ohpc dhcpd: DHCPACK on 192.168.10.101 to 08:00:27:8e:87:8b via enp0s3 Jun 22 09:23:09 ohpc in.tftpd[8550]: Error code 0: TFTP Aborted Jun 22 09:23:09 ohpc in.tftpd[8551]: Client 192.168.10.101 finished /warewulf/ipxe/bin-i386-pcbios/undionly.kpxe Jun 22 09:23:12 ohpc dhcpd: DHCPDISCOVER from 08:00:27:8e:87:8b via enp0s3 Jun 22 09:23:12 ohpc dhcpd: DHCPOFFER on 192.168.10.101 to 08:00:27:8e:87:8b via enp0s3 Jun 22 09:23:13 ohpc dhcpd: DHCPDISCOVER from 08:00:27:8e:87:8b via enp0s3 Jun 22 09:23:13 ohpc dhcpd: DHCPOFFER on 192.168.10.101 to 08:00:27:8e:87:8b via enp0s3 Jun 22 09:23:15 ohpc dhcpd: DHCPDISCOVER from 08:00:27:8e:87:8b via enp0s3 Jun 22 09:23:15 ohpc dhcpd: DHCPOFFER on 192.168.10.101 to 08:00:27:8e:87:8b via enp0s3 Jun 22 09:23:19 ohpc dhcpd: DHCPDISCOVER from 08:00:27:8e:87:8b via enp0s3 Jun 22 09:23:19 ohpc dhcpd: DHCPOFFER on 192.168.10.101 to 08:00:27:8e:87:8b via enp0s3
---------------- Mike Hanby mhanby @ uab.edu Systems Analyst II - Enterprise IT Research Computing Services The University of Alabama at Birmingham
|
|
On Mon, 25 Jun 2018 at 16:53, Hanby, Mike <mhanby@...> wrote:
Hi, Is the DHCP server enable at boot on the master? cheers, --renato |
|
Hi, I have had this error, have you installed this node with statefull provision?. Could you send it the output with this command? # wwsh provision print Thanks. |
|
Renato, Yes DHCP is running on the master. The /var/log/messages on the master shows that the client is requesting and receiving the address.
Jaun, here's the output (note that the original notes that I put in the ticket used c0001 as the compute node. I had since restored the master back to a clean snapshot and re-provisioned the cluster to use the default c1, c2, c3, etc... node naming):
# wwsh provision print #### c1 ####################################################################### c1: BOOTSTRAP = 3.10.0-862.3.3.el7.x86_64 c1: VNFS = centos7.5 c1: FILES = dynamic_hosts,group,munge.key,network,passwd,shadow,slurm.conf c1: PRESHELL = FALSE c1: POSTSHELL = FALSE c1: CONSOLE = UNDEF c1: PXELINUX = UNDEF c1: SELINUX = DISABLED c1: KARGS = "net.ifnames=0 biosdevname=0 quiet" c1: BOOTLOCAL = FALSE
Above the compute nodes weren't configured for stateful, I enabled that as follows and rebooted the node:
export CHROOT=/opt/ohpc/admin/images/centos7.5
# Add GRUB2 bootloader and re-assemble VNFS image yum -y --installroot=$CHROOT install grub2 wwvnfs --chroot $CHROOT
# Using 'centos7.5' as the VNFS name # Creating VNFS image from centos7.5 # Compiling hybridization link tree : 0.25 s # Building file list : 0.84 s # Compiling and compressing VNFS : 37.10 s # Adding image to datastore : 33.47 s # Total elapsed time : 71.66 s
# Select (and customize) appropriate parted layout example export compute_regex='c[1-9]' cp /etc/warewulf/filesystem/examples/gpt_example.cmds /etc/warewulf/filesystem/gpt.cmds
wwsh provision set --filesystem=gpt "${compute_regex}"
# SET: FS = select /dev/sda,mklabel gpt,mkpart primary 1MiB 3MiB,mkpart primary ext4 3MiB 513MiB,mkpart primary linux-swap 513MiB 50%,mkpart primary ext4 50% 100%,name 1 grub,name 2 boot,name 3 swap,name 4 root,set 1 bios_grub on,set 2 boot on,mkfs 2 ext4 -L boot,mkfs 3 swap,mkfs 4 ext4 -L root,fstab 4 / ext4 defaults 0 0,fstab 2 /boot ext4 defaults 0 0,fstab 3 swap swap defaults 0 0
wwsh provision set --bootloader=sda "${compute_regex}"
# SET: BOOTLOADER = sda
# Boot the node from the local disk wwsh provision set --bootlocal=normal "${compute_regex}"
# wwsh provision print #### c1 ####################################################################### c1: BOOTSTRAP = 3.10.0-862.3.3.el7.x86_64 c1: VNFS = centos7.5 c1: FILES = dynamic_hosts,group,munge.key,network,passwd,shadow,slurm.conf c1: PRESHELL = FALSE c1: POSTSHELL = FALSE c1: CONSOLE = UNDEF c1: PXELINUX = UNDEF c1: SELINUX = DISABLED c1: KARGS = "net.ifnames=0 biosdevname=0 quiet" c1: FS = "select /dev/sda,mklabel gpt,mkpart primary 1MiB 3MiB,mkpart primary ext4 3MiB 513MiB,mkpart primary linux-swap 513MiB 50%,mkpart primary ext4 50% 100%,name 1 grub,name 2 boot,name 3 swap,name 4 root,set 1 bios_grub on,set 2 boot on,mkfs 2 ext4 -L boot,mkfs 3 swap,mkfs 4 ext4 -L root,fstab 4 / ext4 defaults 0 0,fstab 2 /boot ext4 defaults 0 0,fstab 3 swap swap defaults 0 0" c1: BOOTLOADER = sda c1: BOOTLOCAL = FALSE
The c1 console now shows after boot:
--- snip --- Configuring (net0 08:00:27:8e:87:8b)... ok net0: 192.168.10.101/255.255.255.0 gw 192.168.10.1 Next server: 192.168.10.1 Filename: http://192.168.10.1/WW/ipxe/cfg/08:00:27:8e:87:8b 08:00:27:8e:87:8b : 160 bytes [script] Set to bootlocal (normal), booting local disk Booting from SAN device 0x80 Boot from SAN device 0x80 failed: Input/output error (http://ipxe.org/1d852039 ) Could not boot image: Input/output error No more network devices
FATAL: INT18: BOOT FAILURE --- snip -----
Here's the output from the master /var/log/messages
Jun 25 11:54:59 ohpc dhcpd: DHCPDISCOVER from 08:00:27:8e:87:8b via enp0s3 Jun 25 11:54:59 ohpc dhcpd: DHCPOFFER on 192.168.10.101 to 08:00:27:8e:87:8b via enp0s3 Jun 25 11:55:01 ohpc dhcpd: DHCPREQUEST for 192.168.10.101 (192.168.10.1) from 08:00:27:8e:87:8b via enp0s3 Jun 25 11:55:01 ohpc dhcpd: DHCPACK on 192.168.10.101 to 08:00:27:8e:87:8b via enp0s3 Jun 25 11:55:01 ohpc xinetd[961]: START: tftp pid=25865 from=192.168.10.101 Jun 25 11:55:01 ohpc in.tftpd[25866]: Error code 0: TFTP Aborted Jun 25 11:55:01 ohpc in.tftpd[25867]: Client 192.168.10.101 finished /warewulf/ipxe/bin-i386-pcbios/undionly.kpxe Jun 25 11:55:03 ohpc dhcpd: DHCPDISCOVER from 08:00:27:8e:87:8b via enp0s3 Jun 25 11:55:03 ohpc dhcpd: DHCPOFFER on 192.168.10.101 to 08:00:27:8e:87:8b via enp0s3 Jun 25 11:55:04 ohpc dhcpd: DHCPDISCOVER from 08:00:27:8e:87:8b via enp0s3 Jun 25 11:55:04 ohpc dhcpd: DHCPOFFER on 192.168.10.101 to 08:00:27:8e:87:8b via enp0s3 Jun 25 11:55:06 ohpc dhcpd: DHCPREQUEST for 192.168.10.101 (192.168.10.1) from 08:00:27:8e:87:8b via enp0s3 Jun 25 11:55:06 ohpc dhcpd: DHCPACK on 192.168.10.101 to 08:00:27:8e:87:8b via enp0s3
---------------- Mike Hanby mhanby @ uab.edu Systems Analyst II - Enterprise IT Research Computing Services The University of Alabama at Birmingham
From: <OpenHPC-users@groups.io> on behalf of Juan Ignacio Sánchez Morales <juani.sanchez.morales@...>
Hi,
I have had this error, have you installed this node with statefull provision?. Could you send it the output with this command?
# wwsh provision print
Thanks.
|
|
Hi Mike, I'm seeing that the nodes is installed in HD. You could try change the bootlocal: wwsh provision set --bootloader=UNDEF "c*" after, reboot the node and when you had booted the node, executed:
wwsh provision set --bootloader=normal node Regards! |
|
jprorama@gmail.com
Did some additional debugging.
The problem seems to relate to this stanza in the dhcpd.conf: if exists user-class and option user-class = "iPXE" { filename "http://192.168.10.1/WW/ipxe/cfg/${mac}"; } else { if option architecture-type = 00:0B { filename "/warewulf/ipxe/bin-arm64-efi/snp.efi"; } elsif option architecture-type = 00:0A { filename "/warewulf/ipxe/bin-arm32-efi/placeholder.efi"; } elsif option architecture-type = 00:09 { filename "/warewulf/ipxe/bin-x86_64-efi/ipxe.efi"; } elsif option architecture-type = 00:07 { filename "/warewulf/ipxe/bin-x86_64-efi/ipxe.efi"; } elsif option architecture-type = 00:06 { filename "/warewulf/ipxe/bin-i386-efi/ipxe.efi"; } elsif option architecture-type = 00:00 { filename "/warewulf/ipxe/bin-i386-pcbios/undionly.kpxe"; } } When the virtualbox iPXE runs after a cold start (power cycle), the user-class = iPXE value appears to be set and dhcpd correctly provides the client filename "http://192.168.10.1/WW/ipxe/cfg/${mac}". We can see it load in the httpd log files. The client boots successfully. When the virtualbox is reset (no power cycle), this first conditional test on the user-class variable fails. The conditions fall through to the last one: } elsif option architecture-type = 00:00 { filename "/warewulf/ipxe/bin-i386-pcbios/undionly.kpxe"; } Because this is not an http uri, a tftp request is started for the /warewulf/ipxe/bin-i386-pcbios/undionly.kpxe file. We see that in the journalctl log. It reads the file but can't boot using it. Oddly the architecture-type isn't even correct, should be x86_64. This seems like some state is stale/invalid in the iPXE fireware when a virtualbox image is reset versus when it is started from the off state. This causes it's dhcp requests to not include all values expected by the dhcpd server and hence it fails. The simple work around is to power cycle the vm. It's odd, however, that it behaves this way. |
|
I was able to resolve this issue by changing the VMs from booting using the VBox NIC to using this ISO: http://boot.ipxe.org/ipxe.iso from https://ipxe.org/download
Thanks for the assist.
Mike
---------------- Mike Hanby mhanby @ uab.edu Systems Analyst II - Enterprise IT Research Computing Services The University of Alabama at Birmingham
From: <OpenHPC-users@groups.io> on behalf of Juan Ignacio Sánchez Morales <juani.sanchez.morales@...>
Hi Mike,
I'm seeing that the nodes is installed in HD. You could try change the bootlocal:
wwsh provision set --bootloader=UNDEF "c*"
after, reboot the node and when you had booted the node, executed:
wwsh provision set --bootloader=normal node
Regards!
|
|