[slurm-users] Cluster not booting after upgrade to debian jessie
Elisabetta Falivene
e.falivene at ilabroma.com
Tue Jan 9 05:40:19 MST 2018
Interesting. Going to try!
The new kernel was installed during an upgrade from Debian 7 Wheezy to
Debian 8 Jessie. The upgrade went ok on the 8 nodes of the cluster, but not
on the master. Btw, on the nodes kernel 3.16 is working ok.
Stupid question: It's worth trying to make the new kernel work, in your
opinion? If, in the worst case, I have to keep the 3.2 kernel on the master
is so bad?
Elisabetta
2018-01-09 13:27 GMT+01:00 John Hearns <hearnsj at googlemail.com>:
> Elisabetta, I am not an expert on Debian systems.
> I think to solve your problem with the kernels, you need to recreate the
> initial ramdisk and make sure it has the modules you need.
>
> So boot the system in kernel 3.2 and then run:
> mkinitrd 3.16.0-4-amd64
>
>
> How was the kernel version 3.16.0-4-amd64 installed?
>
>
> On 9 January 2018 at 13:16, Elisabetta Falivene <e.falivene at ilabroma.com>
> wrote:
>
>> Root file system is on the master. I'm being able to boot the machine
>> changing kernel. Grub allow to boot from two kernel:
>>
>>
>> kernel 3.2.0-4-amd64
>>
>> kernel 3.16.0-4-amd64
>>
>>
>> The problem is with kernel 3.16, but boots correctly with 3.2.
>>
>>
>> Anyway, rebooting with kernel 3.2, slurm (now updated to 14.03.9, was
>> 2.3.4) doesn't work anymore and gives this error:
>>
>>
>> First time after reboot launching sinfo:
>>
>> *sinfo: error: If munged is up, restart with —numthreads=10*
>>
>> *sinfo: error: Munge encode failed: Failed to access
>> /var/run/munge/munge.socket2”: No such file or directory*
>>
>> *sino: error: Authentication: Socket communication error*
>>
>> *slurm_load_partition: Protocol authentication error*
>>
>>
>> Re-launching sinfo
>>
>> *slurm_load_jobs error: Unable to contact slurm controller (connect
>> failure)*
>>
>>
>> What does it mean?
>>
>>
>> betta
>>
>>
>> PS: In the kernel 3.16 case, it gives the "gave up waiting" error and
>> *before* the error is thrown there is another error
>>
>> "Running scripts/local-block
>>
>> Unable to find lvm volume"
>>
>> It keeps trying this thing several times and then falls back to
>> initramfs. (even if booted in recovery!)
>>
>> Moreover, in this situation it seems not to load the usb keyboard so i'm
>> truly able to do anything.
>>
>>
>>
>> 2018-01-08 12:26 GMT+01:00 Markus Köberl <markus.koeberl at tugraz.at>:
>>
>>> On Monday, 8 January 2018 11:39:32 CET Elisabetta Falivene wrote:
>>> > Here I am again.
>>> > In the end, I did the upgrade from debian 7 wheezy to debian 8 jessie
>>> in
>>> > order to update Slurm and solve some issues with it. It seemed it all
>>> went
>>> > well. Even slurm problem seemed solved. Then I rebooted the machine
>>> and the
>>> > problems began. I can't boot the master anymore returning an error:
>>> >
>>> > *gave up waiting for root device. Common problems:- Boot args (cat
>>> > /proc/cmdline)- check rootdelay= (did the sistem wait long enouth?)-
>>> check
>>> > root= (did the sistem wait for the right device?)- missing modules (cat
>>> > /proc/modules; ls /dev)ALERT! /dev/mapper/system-root does not exist.
>>> > Dropping to a shell!"*
>>> > *modprobe: module ehci-pci not found in modules.dep*
>>> >
>>> > *modprobe: module ehci-orion not found in modules.dep*
>>> >
>>> > *modprobe: module ehci-hcd not found in modules.dep*
>>> >
>>> > *modprobe: module ohci-hcd not found in modules.dep*
>>> >
>>> > *Busybox v1.22.1 (Debian 1:1.22.0-9+deb8u1) built-in shell (ash)*
>>> > *Enter help for a list of built-in commands*
>>> >
>>> >
>>> > * /bin/sh can't access tty job control turned off *
>>> > *(initramfs)*
>>> >
>>> > Maybe did you ever had this type of problem?
>>>
>>> Where is your root file system located?
>>> If it is on a local disk check your /etc/fstab
>>> Maybe the device location has changed with the newer kernel?
>>>
>>>
>>> regards
>>> Markus Köberl
>>> --
>>> Markus Koeberl
>>> Graz University of Technology
>>> Signal Processing and Speech Communication Laboratory
>>> E-mail: markus.koeberl at tugraz.at
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180109/d5818588/attachment-0001.html>
More information about the slurm-users
mailing list