March 2024 - slurm-users - lists.schedmd.com

Re: SLURM in K8s, any advice?
by LEAVY Alan 13 Mar '24

13 Mar '24

I'm a little late to this party but would love to establish contact with others using slurm in Kubernetes. I recently joined a research institute in Vienna (IIASA) and I'm getting to grips with slurm and Kubernetes (my previous role was data engineering / fintech). My current setup sounds like what Urban described in this thread, back in Nov 22. It has some rough edges though. Right now, I'm trying to upgrade to slurm-23.11.4 in Ubuntu 23.10 containers. I'm having trouble with the cgroup/v2 plugin. Are you still using slurm on K8s Urban? How did your installation work out Hans? Would either of you be willing to share your experiences? Regards, Alan.

4 3

Shards distribution over multiple GPUs
by Francois Broquedis 12 Mar '24

12 Mar '24

Hi all, We're trying to enable sharding on our compute cluster. On this cluster: - ensicompute-1 comes with 1 NVIDIA V100 GPU ; - ensicompute-13 comes with 3 NVIDIA A40 GPUs ; - all other nodes (for now, ensicompute-11 and ensicompute-12, but several others will come) come with 3 NVIDIA RTX 6000 GPUs. To enable sharding, I followed these steps: 1. [slurm.conf] Add "shard" to GresTypes ; 2. [slurm.conf] Add "shard:N" to Gres for each node. For testing purposes, I have set N to 9, so each GPU can execute up to 3 jobs concurrently: NodeName=ensicompute-[11-12] Gres=gpu:Quadro:3,shard:9 CPUs=40 RealMemory=128520 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN Feature=gpu,ht 3. [gres.conf] Declare the shards after the definition of the GPUs GRES. For step 3, I tried different things, leading to different outcomes: a. Define a global number of shards, for the entire host: Name=shard Count=9 ==> This way, sharding seems to work ok, but all the jobs are executed on GPU#0. If running 12 jobs for example, 9 of them are assigned to GPU#0 and start executing, while 3 of them remain in a pending state. No job is assigned to GPU#1 or GPU#2. b. Define a per-GPU number of shards, associated to the device file representing the GPU: Name=shard Count=3 File=/dev/nvidia0 Name=shard Count=3 File=/dev/nvidia1 Name=shard Count=3 File=/dev/nvidia2 ==> In this case, the slurmd service fails to start on the compute node. The error message found in /var/log/slurmd.log is "fatal: Invalid GRES record for shard, count does not match File value". c. Don't define anything about shards in gres.conf. ==> Same behavior than in a.: all jobs are executed on GPU#0. I attach to this message the full content of the slurm.conf and gres.conf files. What is the proper way to configure sharding in a cluster with several GPUs per node? Is there a way to specify how many shards should be allocated to each GPU? Cheers, François === slurm.conf === # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ClusterName=ensimag SlurmctldHost=nash ProctrackType=proctrack/cgroup SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root StateSaveLocation=/var/spool/slurmctld TaskPlugin=task/affinity,task/cgroup ReturnToService=2 # # # TIMERS InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 # # # SCHEDULING SchedulerType=sched/backfill SelectType=select/cons_tres # # # LOGGING AND ACCOUNTING JobCompType=jobcomp/none JobAcctGatherFrequency=30 SlurmctldDebug=info SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurmd.log # # # COMPUTE NODES GresTypes=gpu,shard NodeName=ensicompute-1 Gres=gpu:Tesla:1,shard:3 CPUs=40 RealMemory=128520 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN Feature=gpu,ht NodeName=ensicompute-13 Gres=gpu:A40:3,shard:9 CPUs=40 RealMemory=128520 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN Feature=gpu,ht NodeName=ensicompute-[11-12] Gres=gpu:Quadro:3,shard:9 CPUs=40 RealMemory=128520 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN Feature=gpu,ht PartitionName=compute Nodes=ALL Default=YES MaxTime=INFINITE State=UP === gres.conf === AutoDetect=off # ensicompute-1 NodeName=ensicompute-1 Name=gpu Type=Tesla File=/dev/nvidia0 NodeName=ensicompute-1 Name=shard Count=3 File=/dev/nvidia0 # ensicompute-11 NodeName=ensicompute-11 Name=gpu Type=Quadro File=/dev/nvidia0 NodeName=ensicompute-11 Name=gpu Type=Quadro File=/dev/nvidia1 NodeName=ensicompute-11 Name=gpu Type=Quadro File=/dev/nvidia2 NodeName=ensicompute-11 Name=shard Count=3 File=/dev/nvidia0 NodeName=ensicompute-11 Name=shard Count=3 File=/dev/nvidia1 NodeName=ensicompute-11 Name=shard Count=3 File=/dev/nvidia2 # ensicompute-12 NodeName=ensicompute-12 Name=gpu Type=Quadro File=/dev/nvidia0 NodeName=ensicompute-12 Name=gpu Type=Quadro File=/dev/nvidia1 NodeName=ensicompute-12 Name=gpu Type=Quadro File=/dev/nvidia2 NodeName=ensicompute-12 Name=shard Count=3 File=/dev/nvidia0 NodeName=ensicompute-12 Name=shard Count=3 File=/dev/nvidia1 NodeName=ensicompute-12 Name=shard Count=3 File=/dev/nvidia2 # ensicompute-13 NodeName=ensicompute-13 Name=gpu Type=A40 File=/dev/nvidia0 NodeName=ensicompute-13 Name=gpu Type=A40 File=/dev/nvidia1 NodeName=ensicompute-13 Name=gpu Type=A40 File=/dev/nvidia2 NodeName=ensicompute-13 Name=shard Count=3 File=/dev/nvidia0 NodeName=ensicompute-13 Name=shard Count=3 File=/dev/nvidia1 NodeName=ensicompute-13 Name=shard Count=3 File=/dev/nvidia2 -- François Broquedis, Ingénieur Service Informatique Grenoble INP - Ensimag, bureau E208 681 rue de la Passerelle BP 72, 38402 Saint Martin d'Hères CEDEX Tél.: +33 (0)4 76 82 72 78

1 0

Re: user_name attribute missing in job response from slurm rest api method jobs
by Acartürk, Ayhan 11 Mar '24

11 Mar '24

Sorry... my previous message should have been a reply to the post by Filip Holka from March, 25th 2022: https://lists.schedmd.com/pipermail/slurm-users/2022-March/008543.html Best wishes, Ayhan Acartürk

1 0

Re: user_name attribute missing in job response from slurm rest api method jobs
by Acartürk, Ayhan 11 Mar '24

11 Mar '24

Hello together, we have the same problem here. All jobs delivered by the api endpoint /slurm/v0.0.36/jobs have a user_id attached but no user_name. Currently, we solve this issue by an extra call for each job_id to the api endpoint /slurmdb/v.0.0.36/job/{job_id}. Any idea where to search for a possible solution is highly appreciated. Best wishes, Ayhan Acartürk

1 0

Problem building slurm with PMIx
by Patrick Begou 11 Mar '24

11 Mar '24

Hi ! I manage a small CentOS8 cluster using slurm slurm-20.11.7-1 and OpenMPI built from sources. - I know this OS is not maintained any more and I need to negotiate downtime to reinstall - I know Slurm 20.11.7 has security issue (I've built it from source some years ago with rpmbuild -ta --with mysql --with hwloc slurm-20.11.7.tar.bz) and I should update. All was running fine until I add a GPU Node and Nvidia sdk. This SDK provides an openMPI3 implementation GPU aware but I'm unable to launch an intranode parallel job with it using srun: -------------------------------------------------------------------------- The application appears to have been direct launched using "srun", but OMPI was not built with SLURM's PMI support and therefore cannot execute. There are several options for building PMI support under SLURM, depending upon the SLURM version you are using: version 16.05 or later: you can use SLURM's PMIx support. This requires that you configure and build SLURM --with-pmix. Versions earlier than 16.05: you must use either SLURM's PMI-1 or PMI-2 support. SLURM builds PMI-1 by default, or you can manually install PMI-2. You must then build Open MPI using --with-pmi pointing to the SLURM PMI library location. Please configure as appropriate and try again. -------------------------------------------------------------------------- I check with "srun --mpi=list" and got no pmx. srun: MPI types are... srun: pmi2 srun: cray_shasta srun: none So I decide to build the rpms from slurm-20.11.9.tar.bz2 as I had done previously for 20.11.7 and update. I've first installed pmix-2.1.1-1 from src as I had no pmix-devel rpm in my local CentOS8 repo: rpm --rebuild pmix-2.1.1-1.el8.src.rpm dnf install pmix-devel-2.1.1-1.el8.x86_64.rpm pmix-2.1.1-1.el8.x86_64.rpm Then build Slurm from slurm-20.11.9.tar.bz2 (just changing python3 to python38 in the spec file) rpmbuild -ta --with mysql --with hwloc --with pmix slurm-20.11.9.tar.bz2 And the try to install these package on the GPU node dnf install slurm-slurmd-20.11.9-1.el8.x86_64.rpm slurm-20.11.9-1.el8.x86_64.rpm slurm-devel-20.11.9-1.el8.x86_64.rpm slurm-libpmi-20.11.9-1.el8.x86_64.rpm But I get this strange error: Error: Problem: conflicting requests - nothing provides pmix = 20.11.9 needed by slurm-slurmd-20.11.9-1.el8.x86_64 (try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages) Why this request on PMIX with the slurm version number ? Am I wrong somewhere ? Thanks for your help Patrick

2 1

FairShare value is always 0.
by Zacarias Benta 07 Mar '24

07 Mar '24

Hi guys, We've just setup our new cluster and are facing some issues regading fairshare calculation. Our slurm directive regarding priority calculation are defines as follows: PriorityType=priority/multifactor PriorityFlags=MAX_TRES PriorityDecayHalfLife=14-0 PriorityFavorSmall=NO PriorityMaxAge=14-0 PriorityWeightAge=1000 PriorityWeightJobSize=1000 PriorityWeightPartition=10000000 PriorityWeightQOS=10000000 PriorityWeightTRES=CPU=2000,Mem=4000 PriorityWeightFairshare=100000 The partition we are submitinh out jobs to is setup as follows: PartitionName=mypartPriority=1000TRESBillingWeights="CPU=1.0,Mem=0.25G"Default=YESMaxTime=96:0:0DefMemPerCPU=5333Nodes=node[001-036] MaxNodes=20 Whenever we take a look at the fairshare value using sshare -l we see the following output: AccountUserRawSharesNormSharesRawUsageNormUsageEffectvUsageFairShareLevelFSGrpTRESMinsTRESRunMins ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- root10.0000002687245970.0000000.000000cpu=1098201,mem=5856709132,en+ rootroot10.10000000.0000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group110.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group210.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group310.1000002687245970.0000000.000000cpu=1098201,mem=5856709132,en+ group410.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group510.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group610.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group710.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group810.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group910.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ We think it is really weird that the FairShare value is 0 for the root account and "NULL" for all other groups, even the one who had the greatest raw usage. While taking a look at the data for our users we see the following: AccountUserRawSharesNormSharesRawUsageEffectvUsageFairShare ------------------------------------------------------------------------------------- root10.0000002689837210.000000 rootroot10.10000000.0000000.000000 group310.1000002689837210.000000 group3user110.090909121093740.0000000.000000 group3user210.09090900.0000000.000000 group3user310.09090900.0000000.000000 group3user410.09090900.0000000.000000 group3user510.09090900.0000000.000000 group3user610.09090900.0000000.000000 group3user710.09090900.0000000.000000 group3user810.0909092088245970.0000000.000000 group3user910.09090900.0000000.000000 group3user1010.09090900.0000000.000000 group3user1110.090909480497500.0000000.000000 group410.10000000.000000 group4user1310.0000004994520.0000000.000000 group510.10000000.000000 group5user1410.00000015396030.0000000.000000 This is a weird behavior, since user1, user8, user11, user13 and user14 are the ones who have more RawUsage and the FairShare value is the same for all of them, including the users that have no yet submited any job. We also noticed that in the slurmctld log there is the fillowing error message that appears with some regularity [2024-03-07T16:38:13.260] error: _append_list_to_array: unable to append NULL list to assoc list. [2024-03-07T16:38:13.260] error: _calc_tree_fs: unable to calculate fairshare on empty tree The error above looks like it is coming from: https://github.com/SchedMD/slurm/blob/b11bf689b270f1f5dfe4b0cd54c4fa84b4af3… Are we missing any setting on slurm.conf? This is kind of strange, because we have another cluster with pretty much the same configuration and the FairShare is calculated without any problems. Any help would be appreciated. -- Cumprimentos / Best Regards, Zacarias Benta LIP/INCD @ UMINHO ---------------------------------------------- / Use linux, and may the source be with you. / ---------------------------------------------- \ __ -=(o '. '.-.\ /| \\ '| || _\_):,_

1 0

Slurm management of dual-node server trays?
by Ole Holm Nielsen 07 Mar '24

07 Mar '24

We're in the process of installing some racks with Lenovo SD665 V3 [1] water-cooled servers. A Lenovo DW612S chassis contains 6 1U trays with 2 SD665 V3 servers mounted side-by-side in each tray. Lenovo delivers SD665 V3 servers including water-cooled NVIDIA InfiniBand "SharedIO" adapters [2] so that one node is the Primary including a PCIe adapter, and the other is Auxiliary with just a cable to the Primary's adapter. Obviously, servicing 2 "Siamese twin" Slurm nodes requires a bit of care and planning. What is worse is that when the Primary node is rebooted or powered down, the Auxiliary node will lose its Infiniband connection and may have a PCIe fault or an NMI as documented in [3]. And when nodes are powered up, the Primary must have completed POST before the Auxiliary gets started. I wonder how to best deal with power failures? It seems that when Slurm jobs are running on Auxiliary nodes, these jobs are going to crash when the possibly unrelated Primary node goes down. This looks like a pretty bad system design on the part of Lenovo :-( The goal was apparently to same some money on IB adapters and having fewer IB cables. Question: Do any Slurm sites out there already have experiences with Lenovo "Siamese twin" nodes with SharedIO IB? Have you developed some operational strategies, for example dealing with node pairs as a single entity for job scheduling? Thanks for sharing any ideas and insights! Ole [1] https://lenovopress.lenovo.com/lp1612-lenovo-thinksystem-sd665-v3-server [2] https://lenovopress.lenovo.com/lp1693-thinksystem-nvidia-connectx-7-ndr200-… [3] https://support.lenovo.com/us/en/solutions/ht510888-thinksystem-sd650-and-c… -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark

3 3

Re: Lua script
by Paul Raines 06 Mar '24

06 Mar '24

Alternativelly consider setting EnforcePartLimits=ALL in slurm.conf -- Paul Raines (http://help.nmr.mgh.harvard.edu) The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.

1 0

Re: Lua script
by Gestió Servidors 06 Mar '24

06 Mar '24

And how can I reject the job inside the lua script?

2 1

Lua script
by Gestió Servidors 06 Mar '24

06 Mar '24

Hello, I'm writing a small lua script that for modify "TimeLimit" of a submited job if user has configured a TimeLimit bigger that configured in the partition. So, is TimeLimit for partition is, for example, 4 hours (04:00:00) and user submit his/her job with a TimeLimit of 5 hours, lua script modify this submit and forces "TimeLimit" to 4 hours for avoiding that job remains at queue in state "pending" with reason "Partition Time Limit...". I have written these lines: function slurm_job_submit(job_desc, part_list, submit_uid) --[[ my id is 1008; for testing purposes, lua only applies to me --]] if (job_desc.user_id == 1008) then slurm.log_info("Submit job by me") --[[ I check partition --]] if (job_desc['partition'] == "nodo.q") then --[[ if (job_desc['time_limit'] > 14400) job_desc['time_limit'] = 14400 end end end return slurm.SUCCESS end However, if I submit a job with TimeLimit of 5 hours, lua script doesn't modify submit and job remains "pending"... What am I doing wrong? Thanks.

2 1

2025

2024

slurm-users March 2024

2025

2024

slurm-users March 2024 ----- 2025 ----- July 2025 June 2025 May 2025 April 2025 March 2025 February 2025 January 2025 ----- 2024 ----- December 2024 November 2024 October 2024 September 2024 August 2024 July 2024 June 2024 May 2024 April 2024 March 2024 February 2024 January 2024

slurm-users March 2024