[slurm-users] Cluster not booting after upgrade to debian jessie

Elisabetta Falivene e.falivene at ilabroma.com
Tue Jan 9 05:16:12 MST 2018


Root file system is on the master. I'm being able to boot the machine
changing kernel. Grub allow to boot from two kernel:


kernel 3.2.0-4-amd64

kernel 3.16.0-4-amd64


The problem is with kernel 3.16, but boots correctly with 3.2.


Anyway, rebooting with kernel 3.2, slurm (now updated to 14.03.9, was
2.3.4) doesn't work anymore and gives this error:


First time after reboot launching sinfo:

*sinfo: error: If munged is up, restart with —numthreads=10*

*sinfo: error: Munge encode failed: Failed to access
/var/run/munge/munge.socket2”: No such file or directory*

*sino: error: Authentication: Socket communication error*

*slurm_load_partition: Protocol authentication error*


Re-launching sinfo

*slurm_load_jobs error: Unable to contact slurm controller (connect
failure)*


What does it mean?


betta


PS: In the kernel 3.16 case, it gives the "gave up waiting" error and
*before* the error is thrown there is another error

"Running scripts/local-block

Unable to find lvm volume"

It keeps trying this thing several times and then falls back to initramfs.
(even if booted in recovery!)

Moreover, in this situation it seems not to load the usb keyboard so i'm
truly able to do anything.



2018-01-08 12:26 GMT+01:00 Markus Köberl <markus.koeberl at tugraz.at>:

> On Monday, 8 January 2018 11:39:32 CET Elisabetta Falivene wrote:
> > Here I am again.
> > In the end, I did the upgrade from debian 7 wheezy to debian 8 jessie in
> > order to update Slurm and solve some issues with it. It seemed it all
> went
> > well. Even slurm problem seemed solved. Then I rebooted the machine and
> the
> > problems began. I can't boot the master anymore returning an error:
> >
> > *gave up waiting for root device. Common problems:- Boot args (cat
> > /proc/cmdline)- check rootdelay= (did the sistem wait long enouth?)-
> check
> > root= (did the sistem wait for the right device?)- missing modules (cat
> > /proc/modules; ls /dev)ALERT! /dev/mapper/system-root does not exist.
> > Dropping to a shell!"*
> > *modprobe: module ehci-pci not found in modules.dep*
> >
> > *modprobe: module ehci-orion not found in modules.dep*
> >
> > *modprobe: module ehci-hcd not found in modules.dep*
> >
> > *modprobe: module ohci-hcd not found in modules.dep*
> >
> > *Busybox v1.22.1 (Debian 1:1.22.0-9+deb8u1) built-in shell (ash)*
> > *Enter help for a list of built-in commands*
> >
> >
> > * /bin/sh can't access tty job control turned off  *
> > *(initramfs)*
> >
> > Maybe did you ever had this type of problem?
>
> Where is your root file system located?
> If it is on a local disk check your /etc/fstab
> Maybe the device location has changed with the newer kernel?
>
>
> regards
> Markus Köberl
> --
> Markus Koeberl
> Graz University of Technology
> Signal Processing and Speech Communication Laboratory
> E-mail: markus.koeberl at tugraz.at
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180109/6dc60679/attachment-0001.html>


More information about the slurm-users mailing list