[slurm-users] Slurmd not starting

Nathalie Gocht nathalie.gocht at outlook.com
Wed Feb 13 14:37:04 UTC 2019


Both folders exists under /var/spool/:

drwxr-xr-x  4 slurm  slurm 4096 Feb 13 15:34 slurmctld
drwxr-xr-x  2 slurm  slurm 4096 Feb 13 14:05 slurmd

Thank you for the tip, thinking about setting the slurm user to root.

Von: slurm-users <slurm-users-bounces at lists.schedmd.com> Im Auftrag von Antony Cleave
Gesendet: Mittwoch, 13. Februar 2019 15:12
An: Slurm User Community List <slurm-users at lists.schedmd.com>
Betreff: Re: [slurm-users] Slurmd not starting

there is very very a strong likelyhood that you have configured SlurmdUser=slurm and one of the following
1) there is no /var/spool/slurmd folder
2) the /var/spool/slurmd folder exists but is owned by root

make sure it exists and is owned by whatever SlurmdUser is set to

or change your SlurmdUser to run as root which may not be acceptable to you for security reasons but if you were to change this it makes "doing cool stuff" in prologs and epilogs easier as you can avoid complex paswordless sudo configs on all nodes.

Antony

On Wed, 13 Feb 2019 at 14:00, Nathalie Gocht <nathalie.gocht at outlook.com<mailto:nathalie.gocht at outlook.com>> wrote:
Hey,

I am building up a one node cluster. Master and node are n the same machine. My slurm.conf:

ControlMachine=bayes
#
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/builtin
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING
AccountingStorageLoc=/var/log/slurm-llnl/job_accounting
AccountingStorageType=accounting_storage/filetxt
AccountingStoreJobComment=YES
ClusterName=bayes
JobCompLoc=/var/log/slurm-llnl/job_completion
JobCompType=jobcomp/filetxt
JobAcctGatherFrequency=60
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log

# COMPUTE NODES
GresTypes=gpu

NodeName=bayes Gres=gpu:tesla:1 CPUs=48 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 State=UNKNOWN
PartitionName=long Nodes=bayes Default=YES MaxTime=INFINITE State=UP


I started the control deamon, but get this information:
$ systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2019-02-13 14:43:02 CET; 7min ago
     Docs: man:slurmctld(8)
  Process: 40552 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCE
Main PID: 40560 (code=exited, status=1/FAILURE)

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
long*        up   infinite      1   idle bayes

I tried to start the slurm deamon, but the timout exceeds. slurmd -Dvvv gives:

slurmd: error: chmod(/var/spool/slurmd, 0755): Operation not permitted
slurmd: error: Unable to initialize slurmd spooldir
slurmd: error: slurmd initialization failed

Does someone know whats going on?

Thanks,
Nathalie
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190213/e4c66d5e/attachment-0001.html>


More information about the slurm-users mailing list