- slurm-users - lists.schedmd.com

formatting node names
by Davide DelVento 07 Jan '25

07 Jan '25

Hi all, I remember seeing on this list a slurm command to change a slurm-friendly list such as gpu[01-02],node[03-04,12-22,27-32,36] into a bash friendly list such as gpu01 gpu02 node03 node04 node12 etc I made a note about it but I can't find my note anymore, nor the relevant message. Can someone please refresh my memory? I'll be more careful with such a note this time, I promise! Thanks and happy new year!

6 11

Change config file location of slurmdbd temporarily
by Sven Schulze 07 Jan '25

07 Jan '25

Hey all, Is there a way to change the location of the slurmdbd.conf temporarily when starting slurmdbd? For all other daemons I can specify "-f", but this doesn't seem to work for slurmdbd, is there a way to edit the build files to achieve this? Kind Regards, Sven

3 4

Permission denied for slurmdbd.conf
by sportlecon＠gmail.com 07 Jan '25

07 Jan '25

ls -ls /usr/local/slurm/etc/slurmdbd.conf 4 -rw------- 1 slurm slurm 497 Dec 28 16:34 /usr/local/slurm/etc/slurmdbd.conf sudo -u slurm /usr/local/slurm/sbin/slurmdbd -Dvvv slurmdbd: error: s_p_parse_file: unable to read "/usr/local/slurm/etc/slurmdbd.conf": Permission denied slurmdbd: fatal: Could not open/read/parse slurmdbd.conf file /usr/local/slurm/etc/slurmdbd.conf

3 2

Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions
by sportlecon sportlecon 07 Jan '25

07 Jan '25

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 26 cpu myscript user1 PD 0:00 4 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) Anyone can help to fix this?

4 3

All GPUs are Usable if no Gres is Defined
by Jacob Gordon 04 Jan '25

04 Jan '25

Hello, We have a two node GPU cluster with 8 NVidia GPUs. GRES is currently configured and works if a user defines it within their sbtach/interactive job submission (--gres=gpu:3). Users only have access to the GPUs they request. However, when they omit “--gres=gpu:n”, they can use every GPU, which interferes with running jobs that used the gres option. I’m at a loss as to why this is happening. Can someone please look at our configuration to see if anything stands out? SLURM Version = 21.08.5 *Slurm.conf* ClusterName=ommit SlurmctldHost=headnode ProctrackType=proctrack/cgroup ReturnToService=2 SlurmdPidFile=/run/slurmd.pid SlurmdSpoolDir=/var/lib/slurm/slurmd StateSaveLocation=/var/lib/slurm/slurmctld SlurmUser=slurm TaskPlugin=task/cgroup SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory AccountingStorageType=accounting_storage/slurmdbd # AccountingStorageType for other resources # AccountingStorageTRES=gres/gpu #DebugFlags=CPU_Bind,gres JobCompType=jobcomp/none JobAcctGatherType=jobacct_gather/cgroup SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurm/slurmd.log DefMemPerCPU=4000 #NodeName=n01 CPUs=256 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=2 RealMemory=1000000 NodeName=n01 Gres=gpu:nvidia-l40:8 CPUs=256 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=2 RealMemory=1000000 NodeName=n02 Gres=gpu:nvidia-l40:8 CPUs=256 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=2 RealMemory=1000000 #Gres config for GPUs GresTypes=gpu PreemptType=preempt/qos PreemptMode=REQUEUE # reset usage after 1 week PriorityUsageResetPeriod=WEEKLY # The job's age factor reaches 1.0 after waiting in the # queue for 2 weeks. PriorityMaxAge=14-0 # This next group determines the weighting of each of the # components of the Multifactor Job Priority Plugin. # The default value for each of the following is 1. PriorityWeightAge=1000 PriorityWeightFairshare=10000 PriorityWeightJobSize=1000 PriorityWeightPartition=1000 PriorityWeightQOS=1500 # Primary partitions PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP PartitionName=all Nodes=n01,n02 Default=YES MaxTime=01:00:00 DefaultTime=00:30:00 State=UP PartitionName=statds Nodes=n01 Default=NO MaxTime=48:00:00 State=UP Priority=100 State=UP OverSubscribe=FORCE AllowAccounts=statds PartitionName=phil Nodes=n02 Default=NO MaxTime=48:00:00 State=UP Priority=100 State=UP OverSubscribe=FORCE AllowAccounts=phil #Set up condo mode # Condo partitions PartitionName=phil_condo Nodes=n02 Default=NO MaxTime=48:00:00 DefaultTime=00:01:00 State=UP Priority=50 OverSubscribe=FORCE AllowQos=normal PartitionName=statds_condo Nodes=n01 Default=NO MaxTime=48:00:00 DefaultTime=00:01:00 State=UP Priority=50 OverSubscribe=FORCE AllowQos=normal JobSubmitPlugins=lua *Gres.conf* NodeName=n01 Name=gpu Type=nvidia-l40 File=/dev/nvidia0 NodeName=n01 Name=gpu Type=nvidia-l40 File=/dev/nvidia1 NodeName=n01 Name=gpu Type=nvidia-l40 File=/dev/nvidia2 NodeName=n01 Name=gpu Type=nvidia-l40 File=/dev/nvidia3 NodeName=n01 Name=gpu Type=nvidia-l40 File=/dev/nvidia4 NodeName=n01 Name=gpu Type=nvidia-l40 File=/dev/nvidia5 NodeName=n01 Name=gpu Type=nvidia-l40 File=/dev/nvidia6 NodeName=n01 Name=gpu Type=nvidia-l40 File=/dev/nvidia7 NodeName=n02 Name=gpu Type=nvidia-l40 File=/dev/nvidia0 NodeName=n02 Name=gpu Type=nvidia-l40 File=/dev/nvidia1 NodeName=n02 Name=gpu Type=nvidia-l40 File=/dev/nvidia2 NodeName=n02 Name=gpu Type=nvidia-l40 File=/dev/nvidia3 NodeName=n02 Name=gpu Type=nvidia-l40 File=/dev/nvidia4 NodeName=n02 Name=gpu Type=nvidia-l40 File=/dev/nvidia5 NodeName=n02 Name=gpu Type=nvidia-l40 File=/dev/nvidia6 NodeName=n02 Name=gpu Type=nvidia-l40 File=/dev/nvidia7 *Cgroup.conf* CgroupMountpoint="/sys/fs/cgroup" CgroupAutomount=yes CgroupReleaseAgentDir="/etc/slurm/cgroup" AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf" ConstrainCores=yes ConstrainDevices=yes ConstrainRAMSpace=yes *cgroup_allowed_devices_file.conf* /dev/null /dev/urandom /dev/zero /dev/sda* /dev/cpu/*/* /dev/pts/* /dev/nvidia*

3 2

Resource guarantees
by christopher.furbee＠jhuapl.edu 23 Dec '24

23 Dec '24

Hello, Long time SGE admin, new SLURM admin here. I recently started the transition of all my clusters from SGE to SLURM and everything was great until I hit the "Taco Bell" cluster (fake name). Taco Bell supports 4 projects and under SGE we had a priority system setup using projects to balance the queue. For the life of me I have been unable to replicate this in SLURM. We are looking to configure guaranteed resources based on the project. I had thought we could accomplish this with QOS and accounts but so far we have failed. What we would like to end up with is; When project Gordita is running uncontested 100% of the cluster is available. While Gordita is running, if Crunchwrap submits their jobs we want the scheduler to prioritize those jobs until a 75% Gordita, 25% Crunchwrap balance of jobs is reached. No preempting or priority overriding, just as a Gordita job finishes, if Crunchwrap is less than 25%, start a Crunchwrap job. And then maintain that balance until one of the projects jobs are 100% completed. Any assistance or guidance is greatly appreciated.

1 0

Slurm plugin for custom hardware allocation
by Laura Zharmukhametova 23 Dec '24

23 Dec '24

Hello, Is there an existing Slurm plugin for FPGA allocation? If not, can someone please point me in the right direction for how to approach it. Many thanks

2 1

sending mails wit smail on rocky9
by Marcus Wagner 19 Dec '24

19 Dec '24

Hi all, I have a problem with sending mails on rocky 9 via Slurm. One needs to install s-nail to have "/bin/mail" being available. There are some caveats in smail. In the second part (for the message, when the job began) one need to pipe ( eg echo "") into $MAIL, even in a script with no input, s-nail wants to be interactive. but it suffices to echo an empty text to snail. Nonetheless, I don't get any mail through. it seems the, mailprog for some reason gets killed or errors out for some other reason. While it is perfectionally working if run from the console :/ I all the time get in the slurmctld.log the following: 27212:[2024-12-19T15:54:54.935] slurmscriptd: error: run_command: killing MailProg operation on shutdown 27213:[2024-12-19T15:54:54.945] slurmscriptd: _run_script: JobId=0 MailProg killed by signal 9 27214:[2024-12-19T15:54:54.945] error: MailProg returned error, it's output was '' 27395:[2024-12-19T15:55:55.540] slurmscriptd: error: run_command: killing MailProg operation on shutdown 27396:[2024-12-19T15:55:55.551] slurmscriptd: _run_script: JobId=0 MailProg killed by signal 9 27397:[2024-12-19T15:55:55.551] error: MailProg returned error, it's output was '' 27438:[2024-12-19T15:56:55.981] slurmscriptd: error: run_command: killing MailProg operation on shutdown 27439:[2024-12-19T15:56:55.981] slurmscriptd: error: run_command: killing MailProg operation on shutdown 27440:[2024-12-19T15:56:55.992] slurmscriptd: _run_script: JobId=0 MailProg killed by signal 9 27441:[2024-12-19T15:56:55.992] slurmscriptd: _run_script: JobId=0 MailProg killed by signal 9 27442:[2024-12-19T15:56:55.992] error: MailProg returned error, it's output was '' 27443:[2024-12-19T15:56:55.992] error: MailProg returned error, it's output was '' 27450:[2024-12-19T15:56:58.849] slurmscriptd: error: run_command: killing MailProg operation on shutdown 27451:[2024-12-19T15:56:58.859] slurmscriptd: _run_script: JobId=0 MailProg killed by signal 0 any hints? Best Marcus -- Dipl.-Inf. Marcus Wagner stellv. Gruppenleitung IT Center Gruppe: Server, Storage, HPC Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen Tel: +49 241 80 24383 wagner(a)itc.rwth-aachen.de www.itc.rwth-aachen.de Social-Media-Kanäle des IT Centers: https://blog.rwth-aachen.de/itc/ https://www.facebook.com/itcenterrwth https://www.linkedin.com/company/itcenterrwth https://twitter.com/ITCenterRWTH https://www.youtube.com/c/ITCenterRWTHAachen

1 0

lua, glib, gtk, kafka not found for slurm 24.11 and Alma 8
by Bernd Melchers 17 Dec '24

17 Dec '24

Dear all, i tried to rpmbuild slurm-24.11.0 for Alma Linux 8. Build failed because some installed Packages are not found by slurms configure script: rdkafka, glib, gtp and lua But all these packages are installed and they are found by slurm-24.05.x: librdkafka-1.6.1-1.el8.x86_64 librdkafka-devel-1.6.1-1.el8.x86_64 lua-5.3.4-12.el8.x86_64 lua-devel-5.3.4-12.el8.x86_64 glib2-2.56.4-165.el8_10.x86_64 glib2-devel-2.56.4-165.el8_10.x86_64 gtk2-2.24.32-5.el8.x86_64 gtk2-devel-2.24.32-5.el8.x86_64 gtk3-3.22.30-12.el8_10.x86_64 gtk3-devel-3.22.30-12.el8_10.x86_64 Mit freundlichen Grüßen Bernd Melchers -- Archiv- und Backup-Service | fab-service(a)zedat.fu-berlin.de Freie Universität Berlin | Tel. +49-30-838-55905

2 3

AllocNode:Sid in scontrol but not sacct?
by Chris Taylor 17 Dec '24

17 Dec '24

Does the accounting database keep this? Maybe I'm missing something but I don't see a way to query for it in sacct. Chris

1 1

2025

2024

slurm-users ----- 2025 ----- July 2025 June 2025 May 2025 April 2025 March 2025 February 2025 January 2025 ----- 2024 ----- December 2024 November 2024 October 2024 September 2024 August 2024 July 2024 June 2024 May 2024 April 2024 March 2024 February 2024 January 2024

slurm-users