[slurm-users] Slurmd not starting

Antony Cleave antony.cleave at gmail.com
Wed Feb 13 14:12:21 UTC 2019


there is very very a strong likelyhood that you have configured
SlurmdUser=slurm and one of the following
1) there is no /var/spool/slurmd folder
2) the /var/spool/slurmd folder exists but is owned by root

make sure it exists and is owned by whatever SlurmdUser is set to

or change your SlurmdUser to run as root which may not be acceptable to you
for security reasons but if you were to change this it makes "doing cool
stuff" in prologs and epilogs easier as you can avoid complex paswordless
sudo configs on all nodes.

Antony

On Wed, 13 Feb 2019 at 14:00, Nathalie Gocht <nathalie.gocht at outlook.com>
wrote:

> Hey,
>
>
>
> I am building up a one node cluster. Master and node are n the same
> machine. My slurm.conf:
>
>
>
> ControlMachine=bayes
>
> #
>
> MpiDefault=none
>
> ProctrackType=proctrack/pgid
>
> ReturnToService=1
>
> SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
>
> SlurmctldPort=6817
>
> SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
>
> SlurmdPort=6818
>
> SlurmdSpoolDir=/var/spool/slurmd
>
> SlurmUser=slurm
>
> StateSaveLocation=/var/spool/slurmctld
>
> SwitchType=switch/none
>
> TaskPlugin=task/none
>
> #
>
> #
>
> # TIMERS
>
> InactiveLimit=0
>
> KillWait=30
>
> MinJobAge=300
>
> SlurmctldTimeout=120
>
> SlurmdTimeout=300
>
> Waittime=0
>
> #
>
> #
>
> # SCHEDULING
>
> FastSchedule=1
>
> SchedulerType=sched/builtin
>
> SelectType=select/linear
>
> #
>
> #
>
> # LOGGING AND ACCOUNTING
>
> AccountingStorageLoc=/var/log/slurm-llnl/job_accounting
>
> AccountingStorageType=accounting_storage/filetxt
>
> AccountingStoreJobComment=YES
>
> ClusterName=bayes
>
> JobCompLoc=/var/log/slurm-llnl/job_completion
>
> JobCompType=jobcomp/filetxt
>
> JobAcctGatherFrequency=60
>
> JobAcctGatherType=jobacct_gather/linux
>
> SlurmctldDebug=info
>
> SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
>
> SlurmdDebug=info
>
> SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
>
>
>
> # COMPUTE NODES
>
> GresTypes=gpu
>
>
>
> NodeName=bayes Gres=gpu:tesla:1 CPUs=48 Sockets=2 CoresPerSocket=12
> ThreadsPerCore=2 State=UNKNOWN
>
> PartitionName=long Nodes=bayes Default=YES MaxTime=INFINITE State=UP
>
>
>
>
>
> I started the control deamon, but get this information:
>
> $ systemctl status slurmctld.service
>
> ● slurmctld.service - Slurm controller daemon
>
>    Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor
> preset: enabled)
>
>    Active: failed (Result: exit-code) since Wed 2019-02-13 14:43:02 CET;
> 7min ago
>
>      Docs: man:slurmctld(8)
>
>   Process: 40552 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
> (code=exited, status=0/SUCCE
>
> Main PID: 40560 (code=exited, status=1/FAILURE)
>
>
>
> $ sinfo
>
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>
> long*        up   infinite      1   idle bayes
>
>
>
> I tried to start the slurm deamon, but the timout exceeds. slurmd -Dvvv
> gives:
>
>
>
> slurmd: error: chmod(/var/spool/slurmd, 0755): Operation not permitted
>
> slurmd: error: Unable to initialize slurmd spooldir
>
> slurmd: error: slurmd initialization failed
>
>
>
> Does someone know whats going on?
>
>
>
> Thanks,
>
> Nathalie
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190213/a8868dc2/attachment.html>


More information about the slurm-users mailing list