<div dir="ltr">Hi there,<br><br>First of all, apologies for the rather verbose email.<br><br>Newbie here, wanting to set up a minimal slurm cluster on Debian 12.  I installed slurm-wlm (22.05.8) on the head node and slurmd (also 22.05.8) on the compute node via apt. I have one head, one compute node, and one partition.<br><br><div>I have written the simplest of jobs (slurm_hello_world.sh):</div><div><br></div><span style="font-family:monospace">#!/bin/env bash<br>#SBATCH --job-name=hello_word    # Job name<br>#SBATCH --output=hello_world_%j.log   # Standard output and error log<br><br>echo "Hello world, I am running on node $HOSTNAME"<br>sleep 5<br>date</span><br><br>Which I try to submit via sbatch slurm_hello_world.sh.<br><br><span style="font-family:monospace">$ squeue --long -u $USER<br>Tue Nov 07 08:37:58 2023<br>             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)<br>                 7 all_nodes hello_wo  myuser  PENDING       0:00 UNLIMITED      1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)<br>                 9 all_nodes hello_wo  myuser  PENDING       0:00 UNLIMITED      1 (ReqNodeNotAvail, UnavailableNodes:compute-0)</span><br><br>sinfo shows that the node is drained (but this node is idle and has no processing)<br><br><span style="font-family:monospace">$ sinfo --Node --long<br>Tue Nov 07 08:29:51 2023<br>NODELIST   NODES  PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              <br>compute-0        1 all_nodes*     drained 32      2:8:2  60000        0      1   (null) batch job complete f</span><br><br><br>The slurm.conf (exact copy on head and compute nodes) is (mostly commented out stuff)<br><br><span style="font-family:monospace">#<br># Example slurm.conf file. Please run configurator.html<br># (in doc/html) to build a configuration file customized<br># for your environment.<br>#<br>#<br># slurm.conf file generated by configurator.html.<br># Put this file on all nodes of your cluster.<br># See the slurm.conf man page for more information.<br>#<br>ClusterName=mycluster<br>SlurmctldHost=head<br>#SlurmctldHost=<br>#<br>#DisableRootJobs=NO<br>#EnforcePartLimits=NO<br>#Epilog=<br>#EpilogSlurmctld=<br>#FirstJobId=1<br>#MaxJobId=67043328<br>#GresTypes=<br>#GroupUpdateForce=0<br>#GroupUpdateTime=600<br>#JobFileAppend=0<br>#JobRequeue=1<br>#JobSubmitPlugins=lua<br>#KillOnBadExit=0<br>#LaunchType=launch/slurm<br>#Licenses=foo*4,bar<br>#MailProg=/bin/mail<br>#MaxJobCount=10000<br>#MaxStepCount=40000<br>#MaxTasksPerNode=512<br>MpiDefault=none<br>#MpiParams=ports=#-#<br>#PluginDir=<br>#PlugStackConfig=<br>#PrivateData=jobs<br>ProctrackType=proctrack/cgroup<br>#Prolog=<br>#PrologFlags=<br>#PrologSlurmctld=<br>#PropagatePrioProcess=0<br>#PropagateResourceLimits=<br>#PropagateResourceLimitsExcept=<br>#RebootProgram=<br>ReturnToService=1<br>SlurmctldPidFile=/var/run/slurm/slurmctld.pid<br>SlurmctldPort=6817<br>SlurmdPidFile=/var/run/slurm/slurmd.pid<br>SlurmdPort=6818<br>SlurmdSpoolDir=/var/spool/slurmd<br>SlurmUser=slurm<br>#SlurmdUser=root<br>#SrunEpilog=<br>#SrunProlog=<br>StateSaveLocation=/var/spool/slurmctld<br>SwitchType=switch/none<br>#TaskEpilog=<br>TaskPlugin=task/affinity<br>#TaskProlog=<br>#TopologyPlugin=topology/tree<br>#TmpFS=/tmp<br>#TrackWCKey=no<br>#TreeWidth=<br>#UnkillableStepProgram=<br>#UsePAM=0<br>#<br>#<br># TIMERS<br>#BatchStartTimeout=10<br>#CompleteWait=0<br>#EpilogMsgTime=2000<br>#GetEnvTimeout=2<br>#HealthCheckInterval=0<br>#HealthCheckProgram=<br>InactiveLimit=0<br>KillWait=30<br>#MessageTimeout=10<br>#ResvOverRun=0<br>MinJobAge=300<br>#OverTimeLimit=0<br>SlurmctldTimeout=120<br>SlurmdTimeout=300<br>#UnkillableStepTimeout=60<br>#VSizeFactor=0<br>Waittime=0<br>#<br>#<br># SCHEDULING<br>#DefMemPerCPU=0<br>#MaxMemPerCPU=0<br>#SchedulerTimeSlice=30<br>SchedulerType=sched/backfill<br>SelectType=select/cons_tres<br>#<br>#<br># JOB PRIORITY<br>#PriorityFlags=<br>#PriorityType=priority/multifactor<br>#PriorityDecayHalfLife=<br>#PriorityCalcPeriod=<br>#PriorityFavorSmall=<br>#PriorityMaxAge=<br>#PriorityUsageResetPeriod=<br>#PriorityWeightAge=<br>#PriorityWeightFairshare=<br>#PriorityWeightJobSize=<br>#PriorityWeightPartition=<br>#PriorityWeightQOS=<br>#<br>#<br># LOGGING AND ACCOUNTING<br>#AccountingStorageEnforce=0<br>#AccountingStorageHost=<br>#AccountingStoragePass=<br>#AccountingStoragePort=<br>AccountingStorageType=accounting_storage/none<br>#AccountingStorageUser=<br>#AccountingStoreFlags=<br>#JobCompHost=<br>#JobCompLoc=<br>#JobCompPass=<br>#JobCompPort=<br>JobCompType=jobcomp/none<br>#JobCompUser=<br>#JobContainerType=<br>JobAcctGatherFrequency=30<br>JobAcctGatherType=jobacct_gather/none<br>SlurmctldDebug=debug3<br>SlurmctldLogFile=/var/log/slurm/slurmctld.log<br>SlurmdDebug=debug3<br>SlurmdLogFile=/var/log/slurm/slurmd.log<br>#SlurmSchedLogFile=<br>#SlurmSchedLogLevel=<br>#DebugFlags=<br>#<br>#<br># POWER SAVE SUPPORT FOR IDLE NODES (optional)<br>#SuspendProgram=<br>#ResumeProgram=<br>#SuspendTimeout=<br>#ResumeTimeout=<br>#ResumeRate=<br>#SuspendExcNodes=<br>#SuspendExcParts=<br>#SuspendRate=<br>#SuspendTime=<br>#<br>#<br># COMPUTE NODES<br>NodeName=compute-0 RealMemory=60000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN<br>PartitionName=all_nodes Nodes=ALL Default=YES MaxTime=INFINITE State=UP<br></span><br><br>To my untrained eye, there is nothing obviously wrong in slurmd.log (compute) and slurmctld.log (head). In slurmctld.log:<br><br><div><span style="font-family:monospace">[...SNIP...]</span></div><div><span style="font-family:monospace">[2023-11-07T08:58:35.804] debug2: sched: JobId=10. unable to schedule in Partition=all_nodes (per _failed_partition()). Retaining previous scheduling Reason=ReqNodeNotAvail. Desc=ReqNodeNotAvail, UnavailableNodes:compute-0. Priority=4294901753.</span></div><div><span style="font-family:monospace"></span></div><span style="font-family:monospace">[2023-11-07T08:58:36.396] debug:  sched/backfill: _attempt_backfill: beginning<br>[2023-11-07T08:58:36.396] debug:  sched/backfill: _attempt_backfill: 4 jobs to backfill<br>[2023-11-07T08:58:36.652] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from UID=1002<br>[2023-11-07T08:58:36.652] debug3: _set_hostname: Using auth hostname for alloc_node: head<br>[2023-11-07T08:58:36.652] debug3: JobDesc: user_id=1002 JobId=N/A partition=(null) name=hello_word<br>[2023-11-07T08:58:36.652] debug3:    cpus=1-4294967294 pn_min_cpus=-1 core_spec=-1<br>[2023-11-07T08:58:36.652] debug3:    Nodes=4294967294-[4294967294] Sock/Node=65534 Core/Sock=65534 Thread/Core=65534<br>[2023-11-07T08:58:36.652] debug3:    pn_min_memory_job=18446744073709551615 pn_min_tmp_disk=-1<br>[2023-11-07T08:58:36.653] debug3:    immediate=0 reservation=(null)<br>[2023-11-07T08:58:36.653] debug3:    features=(null) batch_features=(null) cluster_features=(null) prefer=(null)<br>[2023-11-07T08:58:36.653] debug3:    req_nodes=(null) exc_nodes=(null)<br>[2023-11-07T08:58:36.653] debug3:    time_limit=-1--1 priority=-1 contiguous=0 shared=-1<br>[2023-11-07T08:58:36.653] debug3:    kill_on_node_fail=-1 script=#!/bin/env bash<br>#SBATCH --job-name=hello...<br>[2023-11-07T08:58:36.653] debug3:    argv="/home/myuser/myuser-slurm/tests/hello_world_slurm.sh"<br>[2023-11-07T08:58:36.653] debug3:    environment=SHELL=/bin/bash,LANGUAGE=en_GB:en,EDITOR=vim,...<br>[2023-11-07T08:58:36.653] debug3:    stdin=/dev/null stdout=/home/myuser/myuser-slurm/tests/hello_world_%j.log stderr=(null)<br>[2023-11-07T08:58:36.653] debug3:    work_dir=/home/myuser/ansible-slurm/tests alloc_node:sid=head:721<br>[2023-11-07T08:58:36.653] debug3:    power_flags=<br>[2023-11-07T08:58:36.653] debug3:    resp_host=(null) alloc_resp_port=0 other_port=0<br>[2023-11-07T08:58:36.653] debug3:    dependency=(null) account=(null) qos=(null) comment=(null)<br>[2023-11-07T08:58:36.653] debug3:    mail_type=0 mail_user=(null) nice=0 num_tasks=-1 open_mode=0 overcommit=-1 acctg_freq=(null)<br>[2023-11-07T08:58:36.653] debug3:    network=(null) begin=Unknown cpus_per_task=-1 requeue=-1 licenses=(null)<br>[2023-11-07T08:58:36.653] debug3:    end_time= signal=0@0 wait_all_nodes=-1 cpu_freq=<br>[2023-11-07T08:58:36.653] debug3:    ntasks_per_node=-1 ntasks_per_socket=-1 ntasks_per_core=-1 ntasks_per_tres=-1<br>[2023-11-07T08:58:36.653] debug3:    mem_bind=0:(null) plane_size:65534<br>[2023-11-07T08:58:36.653] debug3:    array_inx=(null)<br>[2023-11-07T08:58:36.653] debug3:    burst_buffer=(null)<br>[2023-11-07T08:58:36.653] debug3:    mcs_label=(null)<br>[2023-11-07T08:58:36.653] debug3:    deadline=Unknown<br>[2023-11-07T08:58:36.653] debug3:    bitflags=0x1e000000 delay_boot=4294967294<br>[2023-11-07T08:58:36.654] debug2: found 1 usable nodes from config containing compute-0<br>[2023-11-07T08:58:36.654] debug3: _pick_best_nodes: JobId=11 idle_nodes 1 share_nodes 1<br>[2023-11-07T08:58:36.654] debug2: select/cons_tres: select_p_job_test: evaluating JobId=11<br>[2023-11-07T08:58:36.654] debug2: select/cons_tres: select_p_job_test: evaluating JobId=11<br>[2023-11-07T08:58:36.654] debug3: select_nodes: JobId=11 required nodes not avail<br>[2023-11-07T08:58:36.654] _slurm_rpc_submit_batch_job: JobId=11 InitPrio=4294901752 usec=822<br>[2023-11-07T08:58:38.807] debug:  sched: Running job scheduler for default depth.<br>[2023-11-07T08:58:38.807] debug3: sched: JobId=7. State=PENDING. Reason=Resources. Priority=4294901756. Partition=all_nodes.<br>[2023-11-07T08:58:38.807] debug2: sched: JobId=8. unable to schedule in Partition=all_nodes (per _failed_partition()). Retaining previous scheduling Reason=ReqNodeNotAvail. Desc=ReqNodeNotAvail, UnavailableNodes:compute-0. Priority=4294901755.<br>[2023-11-07T08:58:38.807] debug2: sched: JobId=9. unable to schedule in Partition=all_nodes (per _failed_partition()). Retaining previous scheduling Reason=ReqNodeNotAvail. Desc=ReqNodeNotAvail, UnavailableNodes:compute-0. Priority=4294901754.<br>[2023-11-07T08:58:38.807] debug2: sched: JobId=10. unable to schedule in Partition=all_nodes (per _failed_partition()). Retaining previous scheduling Reason=ReqNodeNotAvail. Desc=ReqNodeNotAvail, UnavailableNodes:compute-0. Priority=4294901753.<br>[2023-11-07T08:58:38.807] debug2: sched: JobId=11. unable to schedule in Partition=all_nodes (per _failed_partition()). Retaining previous scheduling Reason=ReqNodeNotAvail. Desc=ReqNodeNotAvail, UnavailableNodes:compute-0. Priority=4294901752.<br>[2023-11-07T08:58:39.008] debug3: create_mmap_buf: loaded file `/var/spool/slurmctld/job_state` as buf_t<br>[2023-11-07T08:58:39.008] debug3: Writing job id 12 to header record of job_state file</span><br><br>Can you help me figure out what is wrong with my setup please?<br><br>Many thanks<br>Jean-Paul Ebejer<br>University of Malta</div>

<br>
<i>The contents of this email are subject to <b><a href="https://www.um.edu.mt/disclaimer/email/" target="_blank">these terms</a>.</b></i><br>