[slurm-users] Cluster not booting after upgrade to debian jessie
John Hearns
hearnsj at googlemail.com
Tue Jan 9 05:27:44 MST 2018
Elisabetta, I am not an expert on Debian systems.
I think to solve your problem with the kernels, you need to recreate the
initial ramdisk and make sure it has the modules you need.
So boot the system in kernel 3.2 and then run:
mkinitrd 3.16.0-4-amd64
How was the kernel version 3.16.0-4-amd64 installed?
On 9 January 2018 at 13:16, Elisabetta Falivene <e.falivene at ilabroma.com>
wrote:
> Root file system is on the master. I'm being able to boot the machine
> changing kernel. Grub allow to boot from two kernel:
>
>
> kernel 3.2.0-4-amd64
>
> kernel 3.16.0-4-amd64
>
>
> The problem is with kernel 3.16, but boots correctly with 3.2.
>
>
> Anyway, rebooting with kernel 3.2, slurm (now updated to 14.03.9, was
> 2.3.4) doesn't work anymore and gives this error:
>
>
> First time after reboot launching sinfo:
>
> *sinfo: error: If munged is up, restart with —numthreads=10*
>
> *sinfo: error: Munge encode failed: Failed to access
> /var/run/munge/munge.socket2”: No such file or directory*
>
> *sino: error: Authentication: Socket communication error*
>
> *slurm_load_partition: Protocol authentication error*
>
>
> Re-launching sinfo
>
> *slurm_load_jobs error: Unable to contact slurm controller (connect
> failure)*
>
>
> What does it mean?
>
>
> betta
>
>
> PS: In the kernel 3.16 case, it gives the "gave up waiting" error and
> *before* the error is thrown there is another error
>
> "Running scripts/local-block
>
> Unable to find lvm volume"
>
> It keeps trying this thing several times and then falls back to initramfs.
> (even if booted in recovery!)
>
> Moreover, in this situation it seems not to load the usb keyboard so i'm
> truly able to do anything.
>
>
>
> 2018-01-08 12:26 GMT+01:00 Markus Köberl <markus.koeberl at tugraz.at>:
>
>> On Monday, 8 January 2018 11:39:32 CET Elisabetta Falivene wrote:
>> > Here I am again.
>> > In the end, I did the upgrade from debian 7 wheezy to debian 8 jessie in
>> > order to update Slurm and solve some issues with it. It seemed it all
>> went
>> > well. Even slurm problem seemed solved. Then I rebooted the machine and
>> the
>> > problems began. I can't boot the master anymore returning an error:
>> >
>> > *gave up waiting for root device. Common problems:- Boot args (cat
>> > /proc/cmdline)- check rootdelay= (did the sistem wait long enouth?)-
>> check
>> > root= (did the sistem wait for the right device?)- missing modules (cat
>> > /proc/modules; ls /dev)ALERT! /dev/mapper/system-root does not exist.
>> > Dropping to a shell!"*
>> > *modprobe: module ehci-pci not found in modules.dep*
>> >
>> > *modprobe: module ehci-orion not found in modules.dep*
>> >
>> > *modprobe: module ehci-hcd not found in modules.dep*
>> >
>> > *modprobe: module ohci-hcd not found in modules.dep*
>> >
>> > *Busybox v1.22.1 (Debian 1:1.22.0-9+deb8u1) built-in shell (ash)*
>> > *Enter help for a list of built-in commands*
>> >
>> >
>> > * /bin/sh can't access tty job control turned off *
>> > *(initramfs)*
>> >
>> > Maybe did you ever had this type of problem?
>>
>> Where is your root file system located?
>> If it is on a local disk check your /etc/fstab
>> Maybe the device location has changed with the newer kernel?
>>
>>
>> regards
>> Markus Köberl
>> --
>> Markus Koeberl
>> Graz University of Technology
>> Signal Processing and Speech Communication Laboratory
>> E-mail: markus.koeberl at tugraz.at
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180109/98b67354/attachment.html>
More information about the slurm-users
mailing list