Nodes Install then fail to boot on successive boots


 

Howdy,

 

I'm testing out OpenHPC in a VirtualBox environment using the instructions for Warewolf + Slurm on CentOS 7.5 x86_64:

https://github.com/openhpc/ohpc/releases/download/v1.3.5.GA/Install_guide-CentOS7-Warewulf-SLURM-1.3.5-x86_64.pdf

 

Any suggestions as to what I may be missing? Let me know if I can provide additional details. Details follow below:

 

The VirtualBox compute nodes PXE boot and install just fine, however after that they fail to boot with the following console message:

 

[TXE: 1 x "Network unreachable (http://ipxe.org/28086011 )"]

Configuring (net0 08:00:27:8e:87:8b)................ No configuration methods succeeded (http://ipxe.org/040ee119 )

No more network devices

 

FATAL:  INT18:  BOOT FAILURE

 

 

Here's some information from the server side:

# wwsh node list

NAME                GROUPS              IPADDR              HWADDR

================================================================================

c0001               UNDEF               192.168.10.101      08:00:27:8e:87:8b

 

# wwsh provision list  (<-- is there any way to expand the column width for this output?)

NODE                VNFS            BOOTSTRAP             FILES

================================================================================

c0001               centos7.5       3.10.0-862.3.3.el7... dynamic_hosts,grou...

 

# cat /srv/warewulf/ipxe/cfg/08\:00\:27\:8e\:87\:8b

#!ipxe

# Configuration for Warewulf node: c0001

# Warewulf data store ID: 30

echo Now booting c0001 with Warewulf bootstrap (3.10.0-862.3.3.el7.x86_64)

set base http://192.168.10.1/WW/bootstrap

initrd ${base}/x86_64/6/initfs.gz

kernel ${base}/x86_64/6/kernel ro initrd=initfs.gz wwhostname=c0001 net.ifnames=1,biosdevname=1 wwmaster=192.168.10.1 wwipaddr=192.168.10.101 wwnetmask=255.255.255.0 wwnetdev=enp0s3 wwhwaddr=08:00:27:8e:87:8b

boot

 

/var/log/messages output for the event:

Jun 22 09:21:09 ohpc systemd: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 8079 (find)

Jun 22 09:21:09 ohpc systemd: Mounting Arbitrary Executable File Formats File System...

Jun 22 09:21:09 ohpc systemd: Mounted Arbitrary Executable File Formats File System.

Jun 22 09:23:08 ohpc dhcpd: DHCPDISCOVER from 08:00:27:8e:87:8b via enp0s3

Jun 22 09:23:08 ohpc dhcpd: DHCPOFFER on 192.168.10.101 to 08:00:27:8e:87:8b via enp0s3

Jun 22 09:23:09 ohpc dhcpd: DHCPREQUEST for 192.168.10.101 (192.168.10.1) from 08:00:27:8e:87:8b via enp0s3

Jun 22 09:23:09 ohpc dhcpd: DHCPACK on 192.168.10.101 to 08:00:27:8e:87:8b via enp0s3

Jun 22 09:23:09 ohpc in.tftpd[8550]: Error code 0: TFTP Aborted

Jun 22 09:23:09 ohpc in.tftpd[8551]: Client 192.168.10.101 finished /warewulf/ipxe/bin-i386-pcbios/undionly.kpxe

Jun 22 09:23:12 ohpc dhcpd: DHCPDISCOVER from 08:00:27:8e:87:8b via enp0s3

Jun 22 09:23:12 ohpc dhcpd: DHCPOFFER on 192.168.10.101 to 08:00:27:8e:87:8b via enp0s3

Jun 22 09:23:13 ohpc dhcpd: DHCPDISCOVER from 08:00:27:8e:87:8b via enp0s3

Jun 22 09:23:13 ohpc dhcpd: DHCPOFFER on 192.168.10.101 to 08:00:27:8e:87:8b via enp0s3

Jun 22 09:23:15 ohpc dhcpd: DHCPDISCOVER from 08:00:27:8e:87:8b via enp0s3

Jun 22 09:23:15 ohpc dhcpd: DHCPOFFER on 192.168.10.101 to 08:00:27:8e:87:8b via enp0s3

Jun 22 09:23:19 ohpc dhcpd: DHCPDISCOVER from 08:00:27:8e:87:8b via enp0s3

Jun 22 09:23:19 ohpc dhcpd: DHCPOFFER on 192.168.10.101 to 08:00:27:8e:87:8b via enp0s3

 

---------------- 

Mike Hanby

mhanby @ uab.edu

Systems Analyst II - Enterprise

IT Research Computing Services

The University of Alabama at Birmingham

 

 


 

On Mon, 25 Jun 2018 at 16:53, Hanby, Mike <mhanby@...> wrote:

Howdy,

 

I'm testing out OpenHPC in a VirtualBox environment using the instructions for Warewolf + Slurm on CentOS 7.5 x86_64:

https://github.com/openhpc/ohpc/releases/download/v1.3.5.GA/Install_guide-CentOS7-Warewulf-SLURM-1.3.5-x86_64.pdf

 

Any suggestions as to what I may be missing?


Hi,

Is the DHCP server enable at boot on the master?

cheers,
--renato


 

Hi,

 I have had this error, have you installed this node with statefull provision?. 
Could you send it the output with this command?

# wwsh provision print

Thanks.


 

Renato, Yes DHCP is running on the master. The /var/log/messages on the master shows that the client is requesting and receiving the address.

 

Jaun, here's the output (note that the original notes that I put in the ticket used c0001 as the compute node. I had since restored the master back to a clean snapshot and re-provisioned the cluster to use the default c1, c2, c3, etc... node naming):

 

# wwsh provision print

#### c1 #######################################################################

             c1: BOOTSTRAP        = 3.10.0-862.3.3.el7.x86_64

             c1: VNFS             = centos7.5

             c1: FILES            = dynamic_hosts,group,munge.key,network,passwd,shadow,slurm.conf

             c1: PRESHELL         = FALSE

             c1: POSTSHELL        = FALSE

             c1: CONSOLE          = UNDEF

             c1: PXELINUX         = UNDEF

             c1: SELINUX          = DISABLED

             c1: KARGS            = "net.ifnames=0 biosdevname=0 quiet"

             c1: BOOTLOCAL        = FALSE

 

Above the compute nodes weren't configured for stateful, I enabled that as follows and rebooted the node:

 

export CHROOT=/opt/ohpc/admin/images/centos7.5

 

# Add GRUB2 bootloader and re-assemble VNFS image

yum -y --installroot=$CHROOT install grub2

wwvnfs --chroot $CHROOT

 

# Using 'centos7.5' as the VNFS name

# Creating VNFS image from centos7.5

# Compiling hybridization link tree                           : 0.25 s

# Building file list                                          : 0.84 s

# Compiling and compressing VNFS                              : 37.10 s

# Adding image to datastore                                   : 33.47 s

# Total elapsed time                                          : 71.66 s

 

# Select (and customize) appropriate parted layout example

export compute_regex='c[1-9]'

cp /etc/warewulf/filesystem/examples/gpt_example.cmds /etc/warewulf/filesystem/gpt.cmds

 

wwsh provision set --filesystem=gpt "${compute_regex}"

 

#        SET: FS                   = select /dev/sda,mklabel gpt,mkpart primary 1MiB 3MiB,mkpart primary ext4 3MiB 513MiB,mkpart primary linux-swap 513MiB 50%,mkpart primary ext4 50% 100%,name 1 grub,name 2 boot,name 3 swap,name 4 root,set 1 bios_grub on,set 2 boot on,mkfs 2 ext4 -L boot,mkfs 3 swap,mkfs 4 ext4 -L root,fstab 4 / ext4 defaults 0 0,fstab 2 /boot ext4 defaults 0 0,fstab 3 swap swap defaults 0 0

 

wwsh provision set --bootloader=sda "${compute_regex}"

 

#        SET: BOOTLOADER           = sda

 

# Boot the node from the local disk

wwsh provision set --bootlocal=normal "${compute_regex}"

 

# wwsh provision print

#### c1 #######################################################################

             c1: BOOTSTRAP        = 3.10.0-862.3.3.el7.x86_64

             c1: VNFS             = centos7.5

             c1: FILES            = dynamic_hosts,group,munge.key,network,passwd,shadow,slurm.conf

             c1: PRESHELL         = FALSE

             c1: POSTSHELL        = FALSE

             c1: CONSOLE          = UNDEF

             c1: PXELINUX         = UNDEF

             c1: SELINUX          = DISABLED

             c1: KARGS            = "net.ifnames=0 biosdevname=0 quiet"

             c1: FS               = "select /dev/sda,mklabel gpt,mkpart primary 1MiB 3MiB,mkpart primary ext4 3MiB 513MiB,mkpart primary linux-swap 513MiB 50%,mkpart primary ext4 50% 100%,name 1 grub,name 2 boot,name 3 swap,name 4 root,set 1 bios_grub on,set 2 boot on,mkfs 2 ext4 -L boot,mkfs 3 swap,mkfs 4 ext4 -L root,fstab 4 / ext4 defaults 0 0,fstab 2 /boot ext4 defaults 0 0,fstab 3 swap swap defaults 0 0"

             c1: BOOTLOADER       = sda

             c1: BOOTLOCAL        = FALSE

 

The c1 console now shows after boot:

 

--- snip ---

Configuring (net0 08:00:27:8e:87:8b)... ok

net0: 192.168.10.101/255.255.255.0 gw 192.168.10.1

Next server: 192.168.10.1

Filename: http://192.168.10.1/WW/ipxe/cfg/08:00:27:8e:87:8b

08:00:27:8e:87:8b : 160 bytes [script]

Set to bootlocal (normal), booting local disk

Booting from SAN device 0x80

Boot from SAN device 0x80 failed: Input/output error (http://ipxe.org/1d852039 )

Could not boot image: Input/output error

No more network devices

 

FATAL:  INT18:  BOOT FAILURE

---  snip -----

 

Here's the output from the master /var/log/messages

 

Jun 25 11:54:59 ohpc dhcpd: DHCPDISCOVER from 08:00:27:8e:87:8b via enp0s3

Jun 25 11:54:59 ohpc dhcpd: DHCPOFFER on 192.168.10.101 to 08:00:27:8e:87:8b via enp0s3

Jun 25 11:55:01 ohpc dhcpd: DHCPREQUEST for 192.168.10.101 (192.168.10.1) from 08:00:27:8e:87:8b via enp0s3

Jun 25 11:55:01 ohpc dhcpd: DHCPACK on 192.168.10.101 to 08:00:27:8e:87:8b via enp0s3

Jun 25 11:55:01 ohpc xinetd[961]: START: tftp pid=25865 from=192.168.10.101

Jun 25 11:55:01 ohpc in.tftpd[25866]: Error code 0: TFTP Aborted

Jun 25 11:55:01 ohpc in.tftpd[25867]: Client 192.168.10.101 finished /warewulf/ipxe/bin-i386-pcbios/undionly.kpxe

Jun 25 11:55:03 ohpc dhcpd: DHCPDISCOVER from 08:00:27:8e:87:8b via enp0s3

Jun 25 11:55:03 ohpc dhcpd: DHCPOFFER on 192.168.10.101 to 08:00:27:8e:87:8b via enp0s3

Jun 25 11:55:04 ohpc dhcpd: DHCPDISCOVER from 08:00:27:8e:87:8b via enp0s3

Jun 25 11:55:04 ohpc dhcpd: DHCPOFFER on 192.168.10.101 to 08:00:27:8e:87:8b via enp0s3

Jun 25 11:55:06 ohpc dhcpd: DHCPREQUEST for 192.168.10.101 (192.168.10.1) from 08:00:27:8e:87:8b via enp0s3

Jun 25 11:55:06 ohpc dhcpd: DHCPACK on 192.168.10.101 to 08:00:27:8e:87:8b via enp0s3

 

---------------- 

Mike Hanby

mhanby @ uab.edu

Systems Analyst II - Enterprise

IT Research Computing Services

The University of Alabama at Birmingham

 

 

 

From: <OpenHPC-users@groups.io> on behalf of Juan Ignacio Sánchez Morales <juani.sanchez.morales@...>
Reply-To: "OpenHPC-users@groups.io" <OpenHPC-users@groups.io>
Date: Monday, June 25, 2018 at 12:38 PM
To: "OpenHPC-users@groups.io" <OpenHPC-users@groups.io>
Subject: Re: [openhpc-users] Nodes Install then fail to boot on successive boots

 

Hi,

 

 I have had this error, have you installed this node with statefull provision?. 

Could you send it the output with this command?

 

# wwsh provision print

 

Thanks.

 


 

Hi Mike,

I'm seeing that the nodes is installed in HD.
 You could try change the bootlocal:

wwsh provision set --bootloader=UNDEF "c*"

after, reboot the node and when you had booted the node, executed:

wwsh provision set --bootloader=normal node


Regards!


jprorama@gmail.com
 

Did some additional debugging.

The problem seems to relate to this stanza in the dhcpd.conf:

if exists user-class and option user-class = "iPXE" {
    filename "http://192.168.10.1/WW/ipxe/cfg/${mac}";
} else {
    if option architecture-type = 00:0B {
        filename "/warewulf/ipxe/bin-arm64-efi/snp.efi";
    } elsif option architecture-type = 00:0A {
        filename "/warewulf/ipxe/bin-arm32-efi/placeholder.efi";
    } elsif option architecture-type = 00:09 {
        filename "/warewulf/ipxe/bin-x86_64-efi/ipxe.efi";
    } elsif option architecture-type = 00:07 {
        filename "/warewulf/ipxe/bin-x86_64-efi/ipxe.efi";
    } elsif option architecture-type = 00:06 {
        filename "/warewulf/ipxe/bin-i386-efi/ipxe.efi";
    } elsif option architecture-type = 00:00 {
        filename "/warewulf/ipxe/bin-i386-pcbios/undionly.kpxe";
    }
}


When the virtualbox iPXE runs after a cold start (power cycle), the user-class = iPXE value appears to be set and dhcpd correctly provides the client filename "http://192.168.10.1/WW/ipxe/cfg/${mac}".  We can see it load in the httpd log files.  The client boots successfully.

When the virtualbox is reset (no power cycle), this first conditional test on the user-class variable fails. The conditions fall through to the last one:

    } elsif option architecture-type = 00:00 {
        filename "/warewulf/ipxe/bin-i386-pcbios/undionly.kpxe";
    }

Because this is not an http uri, a tftp request is started for the /warewulf/ipxe/bin-i386-pcbios/undionly.kpxe file.  We see that in the journalctl log.  It reads the file but can't boot using it.  Oddly the architecture-type isn't even correct, should be x86_64.

This seems like some state is stale/invalid in the iPXE fireware when a virtualbox image is reset versus when it is started from the off state.  This causes it's dhcp requests to not include all values expected by the dhcpd server and hence it fails.

The simple work around is to power cycle the vm.

It's odd, however, that it behaves this way.


 

I was able to resolve this issue by changing the VMs from booting using the VBox NIC to using this ISO: http://boot.ipxe.org/ipxe.iso from https://ipxe.org/download

 

Thanks for the assist.

 

Mike

 

---------------- 

Mike Hanby

mhanby @ uab.edu

Systems Analyst II - Enterprise

IT Research Computing Services

The University of Alabama at Birmingham

 

 

 

From: <OpenHPC-users@groups.io> on behalf of Juan Ignacio Sánchez Morales <juani.sanchez.morales@...>
Reply-To: "OpenHPC-users@groups.io" <OpenHPC-users@groups.io>
Date: Tuesday, June 26, 2018 at 2:45 AM
To: "OpenHPC-users@groups.io" <OpenHPC-users@groups.io>
Subject: Re: [openhpc-users] Nodes Install then fail to boot on successive boots

 

Hi Mike,

 

I'm seeing that the nodes is installed in HD.

 You could try change the bootlocal:

 

wwsh provision set --bootloader=UNDEF "c*"

 

after, reboot the node and when you had booted the node, executed:

 

wwsh provision set --bootloader=normal node

 

 

Regards!