- slurm-users - lists.schedmd.com

Suspending jobs and resuming
by Ratnasamy, Fritz 22 Nov '24

22 Nov '24

Hi, I am using an old slurm version 20.11.8 and we had to reboot our cluster today for maintenance. I suspended all the jobs on it with the command scontrol suspend list_job_ids and all the jobs paused and were suspended. However, when I tried to resume them after the reboot, scontrol resume did not work (it was showing in the reason column " (JobHeldAdmin)". I was able to release them with scontrol release and the jobs started to run back. However, the SLURM recorded time on it resetted (Time columns, showing 0:00 for all the jobs) though the jobs seem to have re-started from the last point before he got suspended. 1- Did I follow the right procedure to suspend, reboot and resume/release? 2- In this case, does the wall time for all the jobs goes into reset and therefore anyone with slurm admin rights will be able to have their jobs last longer than the wall time limit by suspending and resuming a job? Best, *Fritz Ratnasamy* Data Scientist Information Technology

2 1

Slurm PID Files
by Matthias Leopold 20 Nov '24

20 Nov '24

Hi, I compiled and installed Slurm 24.05 on Ubuntu 22.04 following this tutorial: https://www.schedmd.com/slurm/installation-tutorial/ Systemd service files are from deb packages that result from this. Do I have to worry that slurmctld and slurmd don't write PID files although SlurmctldPidFile and SlurmdPidFile are defined in slurm.conf? Paths for PID files exist and are writeable, logs don't show any error. slurmdbd does write a PID file as defined in slurmdbd.conf. thx Matthias

1 0

Does Slurm support DSP
by shaobo liu 20 Nov '24

20 Nov '24

Dear all Does slurm support DSP (Digital Signal Processing)? slurm website does not see DSP related content。

3 4

Does Slurm support DSP
by shaobo liu 20 Nov '24

20 Nov '24

Dear all Does slurm support DSP (Digital Signal Processing)? slurm website does not see DSP related content。

1 0

error: Unable to contact slurm controller (connect failure)
by Daniel Rodriguez Lopez (ext) 19 Nov '24

19 Nov '24

Dear all, We recently tried to fix our version of slurm in every node of our cluster. After the instalation (slurm 20.11.9) in one of the compute nodes, most of the commads (squeue, sinfo, scontrol show config etc) returns this error: error: Unable to contact slurm controller (connect failure) The .log files don't show any errors, we have both debugs values equal to debug5. Also, the rest of the cluster works as usual. I appreciate any insight on what could be the cause. Thank you and regards, Daniel

4 3

Slurm services status=1/FAILURE for CentOs cluster
by Mojtaba Farrokhbin 19 Nov '24

19 Nov '24

<https://superuser.com/posts/1862116/timeline> I have reinstalled slurm resource management on a HPC cluster. But it seems there is a problem on starting slurmd services. Here is the system status: "systemctl status slurmd" shows: ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Sun 2024-11-10 11:31:24 +0130; 1 weeks 1 days ago ------------------------------------------- The slurm.conf output is: # # Example slurm.conf file. Please run configurator.html # (in doc/html) to build a configuration file customized # for your environment. # # # slurm.conf file generated by configurator.html. # # See the slurm.conf man page for more information. # ClusterName=linux ControlMachine=master.cluster.... #ControlAddr= #BackupController= #BackupAddr= # SlurmUser=slurm #SlurmdUser=root SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge #JobCredentialPrivateKey= #JobCredentialPublicCertificate= StateSaveLocation=/var/spool/slurm/ctld SlurmdSpoolDir=/var/spool/slurm/d SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid #ProctrackType=proctrack/pgid ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup TaskPluginParam=cpusets #PluginDir= #FirstJobId= #MaxJobCount= #PlugStackConfig= #PropagatePrioProcess= #PropagateResourceLimits= #PropagateResourceLimitsExcept= #Prolog= #Epilog= #SrunProlog= #SrunEpilog= #TaskProlog= #TaskEpilog= #TaskPlugin= #TrackWCKey=no #TreeWidth=50 #TmpFS= #UsePAM= # # GPU definition (Added by S 2022/11) GresTypes=gpu # TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 # # SCHEDULING SchedulerType=sched/backfill #SchedulerAuth= #SelectType=select/linear SelectType=select/cons_res # select partial node #SelectTypeParameters=CR_CPU_Memory SelectTypeParameters=CR_Core_Memory #FastSchedule=1 FastSchedule=0 PriorityType=priority/multifactor PriorityDecayHalfLife=0 PriorityUsageResetPeriod=NONE PriorityWeightFairshare=100000 #PriorityWeightAge=1000 PriorityWeightPartition=10000 #PriorityType=priority/multifactor #PriorityDecayHalfLife=14-0 #PriorityUsageResetPeriod=14-0 #PriorityWeightFairshare=100000 #PriorityWeightAge=1000 #PriorityWeightPartition=10000 #PriorityWeightJobSize=1000 #PriorityMaxAge=1-0 DefMemPerCPU=2000 # LOGGING SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurmd.log JobCompType=jobcomp/none #JobCompLoc= # # ACCOUNTING #JobAcctGatherType=jobacct_gather/linux #JobAcctGatherFrequency=30 # AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=master #AccountingStorageLoc= #AccountingStoragePass= #AccountingStorageUser= # # COMPUTE NODES # OpenHPC default configuration # Edited by Surin 1399/11 #AccountingStorageEnforce=limits AccountingStorageEnforce=QOS,Limits,Associations #TaskPlugin=task/affinity #PropagateResourceLimitsExcept=MEMLOCK #AccountingStorageType=accounting_storage/filetxt Epilog=/etc/slurm/slurm.epilog.clean NodeName=cn0[1-5] NodeHostName=cn0[1-5] RealMemory=128307 Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 Feature=HyperThread State=UNKNOWN NodeName=gp01 NodeHostName=gp01 RealMemory=128307 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 Feature=HyperThread Gres=gpu:4 State=UNKNOWN #PartitionName=all Nodes=cn0[1-5],gp01 MaxTime=INFINITE State=UP Oversubscribe=EXCLUSIVE PartitionName=all Nodes=cn0[1-5],gp01 MaxTime=10-00:00:00 State=UP MaxNodes=1 PartitionName=normal Nodes=cn0[1-5] Default=YES MaxTime=10-00:00:00 State=UP MaxNodes=1 #PartitionName=normal Nodes=cn0[1-5] Default=YES MaxTime=INFINITE State=UP Oversubscribe=EXCLUSIVE PartitionName=gpu Nodes=gp01 MaxTime=10-00:00:00 State=UP SlurmctldParameters=enable_configless ReturnToService=1 HealthCheckProgram=/usr/sbin/nhc HealthCheckInterval=300 ------------------------------------------- scontrol show node cn01 shows: NodeName=cn01 CoresPerSocket=16 CPUAlloc=0 CPUTot=64 CPULoad=N/A AvailableFeatures=HyperThread ActiveFeatures=HyperThread Gres=(null) NodeAddr=cn01 NodeHostName=cn01 RealMemory=128557 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1 State=DOWN* ThreadsPerCore=2 TmpDisk=64278 Weight=1 Owner=N/A MCS_label=N/A Partitions=all,normal BootTime=None SlurmdStartTime=None CfgTRES=cpu=64,mem=128557M,billing=64 AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=undraining ------------------------------------------- "scontrol ping" also works as: Slurmctld(primary) at master.cluster... is UP ------------------------------------------- "systemctl start slurmd" shows: Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe" for details. ------------------------------------------- "systemctl status slurmd.service" shows: ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2024-11-18 17:02:50 +0130; 41s ago Process: 219025 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=1/FAILURE) Nov 18 17:02:50 master.cluster.... systemd[1]: Starting Slurm node daemon... Nov 18 17:02:50 master.cluster.... slurmd[219025]: fatal: Unable to determine this slurmd's NodeName Nov 18 17:02:50 master.cluster.... systemd[1]: slurmd.service: control process exited, code=exited status=1 Nov 18 17:02:50 master.cluster.... systemd[1]: Failed to start Slurm node daemon. Nov 18 17:02:50 master.cluster.... systemd[1]: Unit slurmd.service entered failed state. Nov 18 17:02:50 master.cluster.... systemd[1]: slurmd.service failed. ------------------------------------------- "journalctl -xe" output is: Nov 18 17:04:54 master.cluster.... dhcpd[2250]: DHCPDISCOVER from 70:35:09:f8:13:40 via eno2: network eno2: no free leases Nov 18 17:05:04 master.cluster.... dhcpd[2250]: DHCPDISCOVER from 70:35:09:f8:13:40 via eno2: network eno2: no free leases Nov 18 17:05:08 master.cluster.... dhcpd[2250]: DHCPDISCOVER from 70:35:09:f8:13:40 via eno2: network eno2: no free leases Nov 18 17:05:12 master.cluster.... dhcpd[2250]: DHCPDISCOVER from 70:35:09:f8:13:40 via eno2: network eno2: no free leases Nov 18 17:05:22 master.cluster.... dhcpd[2250]: DHCPDISCOVER from 70:35:09:f8:13:40 via eno2: network eno2: no free leases Nov 18 17:05:26 master.cluster.... dhcpd[2250]: DHCPDISCOVER from 70:35:09:f8:13:40 via eno2: network eno2: no free leases Nov 18 17:06:34 master.cluster.... munged[2514]: Purged 2 credentials from replay hash ------------------------------------------- slurmd.services contains the following information: [Unit] Description=Slurm node daemon After=munge.service network.target remote-fs.target ConditionPathExists=/etc/slurm/slurm.conf [Service] Type=forking EnvironmentFile=-/etc/sysconfig/slurmd ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID PIDFile=/var/run/slurmd.pid KillMode=process LimitNOFILE=131072 LimitMEMLOCK=infinity LimitSTACK=infinity Delegate=yes [Install] WantedBy=multi-user.target [root@master system]# which slurmd /usr/sbin/slurmd [root@master system]# cat slurmd.service [Unit] Description=Slurm node daemon After=munge.service network.target remote-fs.target ConditionPathExists=/etc/slurm/slurm.conf [Service] Type=forking EnvironmentFile=-/etc/sysconfig/slurmd ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID PIDFile=/var/run/slurmd.pid KillMode=process LimitNOFILE=131072 LimitMEMLOCK=infinity LimitSTACK=infinity Delegate=yes [Install] WantedBy=multi-user.target ------------------------------------------- For "which slurmd" command I get this address: /usr/sbin/slurm ------------------------------------------- ls -l /usr/sbin/slurmd shows: lrwxrwxrwx 1 root root 69 Nov 10 10:54 /usr/sbin/slurmd -> /install/centos7.9/compute_gpu/rootimg/usr/sbin/slurmd

1 0

InvalidAccount
by hmeij＠wesleyan.edu 18 Nov '24

18 Nov '24

Manual compilation of 24.05.4. slurmctld and slurmd run on same server. All works ok but all test jobs end up pending with InvalidAccount message. I do not use slurm database and have not enabled accounting. Can not find an answer for this behavior or a misconfiguration. slurm.conf file was generated using easy config tool. Any ideas how to fix this? Thx, -Henk ## looks like all users have access to test queue [hmeij@sharptail2 slurm]$ sinfo -o "%g %.10R %.20l" GROUPS PARTITION TIMELIMIT all test infinite [hmeij@sharptail2 slurm]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST test* up infinite 1 idle sharptail2 ## simple sleep job [hmeij@sharptail2 slurm]$ sbatch sleep Submitted batch job 8 [hmeij@sharptail2 slurm]$ squeue JOBID PARTITION NAME USER ST TIME NODES CPUS MIN_MEMORY NODELIST(REASON) 8 test sleep hmeij PD 0:00 1 1 1G (InvalidAccount) [hmeij@sharptail2 slurm]$ scontrol show job 8 JobId=8 JobName=sleep UserId=hmeij(8216) GroupId=its(623) MCS_label=N/A Priority=1 Nice=0 Account=(null) QOS=(null) JobState=PENDING Reason=InvalidAccount Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2024-11-11T13:27:14 EligibleTime=2024-11-11T13:27:14 AccrueTime=2024-11-11T13:27:14 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-11-11T13:27:14 Scheduler=Main Partition=test AllocNode:Sid=sharptail2:644662 ReqNodeList=(null) ExcNodeList=(null) NodeList= NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:1:1 ReqTRES=cpu=1,mem=1G,node=1,billing=1 AllocTRES=(null) Socks/Node=1 NtasksPerN:B:S:C=1:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=1G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/zfshomes/hmeij/slurm/sleep WorkDir=/zfshomes/hmeij/slurm StdErr=/zfshomes/hmeij/slurm/err StdIn=/dev/null StdOut=/zfshomes/hmeij/slurm/out TresPerTask=cpu=1 ## within a minute or so that InvalidAccount changes to None ( ## but job remains pending; 1-7 stuck over the weekend) [hmeij@sharptail2 slurm]$ squeue JOBID PARTITION NAME USER ST TIME NODES CPUS MIN_MEMORY NODELIST(REASON) 8 test sleep hmeij PD 0:00 1 1 1G (None) ## in the slurmctld.log slurmctld: sched: JobId=8 has invalid account slurmctld: debug: set_job_failed_assoc_qos_ptr: Filling in assoc for JobId=8 Assoc=0 slurmctld: debug: sched: Running job scheduler for full queue. slurmctld: error: _refresh_assoc_mgr_qos_list: no new list given back keeping cached one. ##and the slurm.conf accounting section (both AccountingStorageType lines yield same behavior) #AccountingStorageType= AccountingStorageType=accounting_storage/none #JobAcctGatherFrequency=30 #JobAcctGatherType= ## using SchedulerType = sched/builtin

3 5

Listen to job state changes
by egonle＠posteo.me 18 Nov '24

18 Nov '24

Hello, is there any way to listen to job state changes of slurm 23.x or newer? I’d like to kind of subscribe to job state changes instead of polling for job states. Adding this feature to slurm accounting DB seems to be last option right now, although I’d like to avoid it. Thanks&Best Egonle

4 3

How to power up all ~idle nodes and verify that they have started up without issue programmatically
by Xaver Stiensmeier 15 Nov '24

15 Nov '24

Dear Slurm User list, I would like to startup all ~idle (idle and powered down) nodes and check programmatically if all came up as expected. For context: this is for a program that sets up slurm clusters with on demand cloud scheduling. In the most easiest fashion this could be executing a command like *srun FORALL hostname* which would return the names of the nodes if it succeeds and an error message otherwise. However, there's no such input value like FORALL as far as I am aware. One could use -N{total node number} as all nodes are ~idle when this executes, but I don't know an easy way to get the total number of nodes. Best regards, Xaver

3 5

First setup of slurm with a GPU node
by Patrick Begou 13 Nov '24

13 Nov '24

Hi, I'm using slurm on a small 8 nodes cluster. I've recently added one GPU node with two Nvidia A100, one with 40Gb of RAM and one with 80Gb. As using this GPU resource increase I would like to manage this resource with Gres to avoid usage conflict. But at this time my setup do not works as I can reach a GPU without reserving it: srun -n 1 -p tenibre-gpu ./a.out can use a GPU even if the reservation do not specify this resource (checked with running nvidia-smi on the node). "tenibre-gpu" is a slurm partition with only this gpu node. From the documentation I've created a gres.conf file and it is propagated on all the nodes (9 compute nodes, 1 login node and the management node) and slurmd has been restarted. gres.conf is:* ## GPU setup on tenibre-gpu-0 NodeName=tenibre-gpu-0 Name=gpu Type=A100-40 File=/dev/nvidia0 Flags=nvidia_gpu_env NodeName=tenibre-gpu-0 Name=gpu Type=A100-80 File=/dev/nvidia1 Flags=nvidia_gpu_env * * In slurm.conf I have checked these flags: ## Basic scheduling SelectTypeParameters=CR_Core_Memory SchedulerType=sched/backfill SelectType=select/cons_tres ## Generic resources GresTypes=gpu ## Nodes list .... Nodename=tenibre-gpu-0 RealMemory=257270 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN .... #partitions PartitionName=tenibre-gpu MaxTime=48:00:00 DefaultTime=12:00:00 DefMemPerCPU=4096 MaxMemPerCPU=8192 Shared=YES State=UP Nodes=tenibre-gpu-0 ... May be I've missed something ? I'm running Slurm 20.11.7-1. Thanks for your advices. Patrick

5 6

2025

2024

slurm-users ----- 2025 ----- July 2025 June 2025 May 2025 April 2025 March 2025 February 2025 January 2025 ----- 2024 ----- December 2024 November 2024 October 2024 September 2024 August 2024 July 2024 June 2024 May 2024 April 2024 March 2024 February 2024 January 2024

slurm-users