- slurm-users - lists.schedmd.com

Re: user_name attribute missing in job response from slurm rest api method jobs
by Acartürk, Ayhan 11 Mar '24

11 Mar '24

Sorry... my previous message should have been a reply to the post by Filip Holka from March, 25th 2022: https://lists.schedmd.com/pipermail/slurm-users/2022-March/008543.html Best wishes, Ayhan Acartürk

1 0

Re: user_name attribute missing in job response from slurm rest api method jobs
by Acartürk, Ayhan 11 Mar '24

11 Mar '24

Hello together, we have the same problem here. All jobs delivered by the api endpoint /slurm/v0.0.36/jobs have a user_id attached but no user_name. Currently, we solve this issue by an extra call for each job_id to the api endpoint /slurmdb/v.0.0.36/job/{job_id}. Any idea where to search for a possible solution is highly appreciated. Best wishes, Ayhan Acartürk

1 0

Problem building slurm with PMIx
by Patrick Begou 11 Mar '24

11 Mar '24

Hi ! I manage a small CentOS8 cluster using slurm slurm-20.11.7-1 and OpenMPI built from sources. - I know this OS is not maintained any more and I need to negotiate downtime to reinstall - I know Slurm 20.11.7 has security issue (I've built it from source some years ago with rpmbuild -ta --with mysql --with hwloc slurm-20.11.7.tar.bz) and I should update. All was running fine until I add a GPU Node and Nvidia sdk. This SDK provides an openMPI3 implementation GPU aware but I'm unable to launch an intranode parallel job with it using srun: -------------------------------------------------------------------------- The application appears to have been direct launched using "srun", but OMPI was not built with SLURM's PMI support and therefore cannot execute. There are several options for building PMI support under SLURM, depending upon the SLURM version you are using: version 16.05 or later: you can use SLURM's PMIx support. This requires that you configure and build SLURM --with-pmix. Versions earlier than 16.05: you must use either SLURM's PMI-1 or PMI-2 support. SLURM builds PMI-1 by default, or you can manually install PMI-2. You must then build Open MPI using --with-pmi pointing to the SLURM PMI library location. Please configure as appropriate and try again. -------------------------------------------------------------------------- I check with "srun --mpi=list" and got no pmx. srun: MPI types are... srun: pmi2 srun: cray_shasta srun: none So I decide to build the rpms from slurm-20.11.9.tar.bz2 as I had done previously for 20.11.7 and update. I've first installed pmix-2.1.1-1 from src as I had no pmix-devel rpm in my local CentOS8 repo: rpm --rebuild pmix-2.1.1-1.el8.src.rpm dnf install pmix-devel-2.1.1-1.el8.x86_64.rpm pmix-2.1.1-1.el8.x86_64.rpm Then build Slurm from slurm-20.11.9.tar.bz2 (just changing python3 to python38 in the spec file) rpmbuild -ta --with mysql --with hwloc --with pmix slurm-20.11.9.tar.bz2 And the try to install these package on the GPU node dnf install slurm-slurmd-20.11.9-1.el8.x86_64.rpm slurm-20.11.9-1.el8.x86_64.rpm slurm-devel-20.11.9-1.el8.x86_64.rpm slurm-libpmi-20.11.9-1.el8.x86_64.rpm But I get this strange error: Error: Problem: conflicting requests - nothing provides pmix = 20.11.9 needed by slurm-slurmd-20.11.9-1.el8.x86_64 (try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages) Why this request on PMIX with the slurm version number ? Am I wrong somewhere ? Thanks for your help Patrick

2 1

FairShare value is always 0.
by Zacarias Benta 07 Mar '24

07 Mar '24

Hi guys, We've just setup our new cluster and are facing some issues regading fairshare calculation. Our slurm directive regarding priority calculation are defines as follows: PriorityType=priority/multifactor PriorityFlags=MAX_TRES PriorityDecayHalfLife=14-0 PriorityFavorSmall=NO PriorityMaxAge=14-0 PriorityWeightAge=1000 PriorityWeightJobSize=1000 PriorityWeightPartition=10000000 PriorityWeightQOS=10000000 PriorityWeightTRES=CPU=2000,Mem=4000 PriorityWeightFairshare=100000 The partition we are submitinh out jobs to is setup as follows: PartitionName=mypartPriority=1000TRESBillingWeights="CPU=1.0,Mem=0.25G"Default=YESMaxTime=96:0:0DefMemPerCPU=5333Nodes=node[001-036] MaxNodes=20 Whenever we take a look at the fairshare value using sshare -l we see the following output: AccountUserRawSharesNormSharesRawUsageNormUsageEffectvUsageFairShareLevelFSGrpTRESMinsTRESRunMins ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- root10.0000002687245970.0000000.000000cpu=1098201,mem=5856709132,en+ rootroot10.10000000.0000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group110.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group210.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group310.1000002687245970.0000000.000000cpu=1098201,mem=5856709132,en+ group410.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group510.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group610.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group710.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group810.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group910.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ We think it is really weird that the FairShare value is 0 for the root account and "NULL" for all other groups, even the one who had the greatest raw usage. While taking a look at the data for our users we see the following: AccountUserRawSharesNormSharesRawUsageEffectvUsageFairShare ------------------------------------------------------------------------------------- root10.0000002689837210.000000 rootroot10.10000000.0000000.000000 group310.1000002689837210.000000 group3user110.090909121093740.0000000.000000 group3user210.09090900.0000000.000000 group3user310.09090900.0000000.000000 group3user410.09090900.0000000.000000 group3user510.09090900.0000000.000000 group3user610.09090900.0000000.000000 group3user710.09090900.0000000.000000 group3user810.0909092088245970.0000000.000000 group3user910.09090900.0000000.000000 group3user1010.09090900.0000000.000000 group3user1110.090909480497500.0000000.000000 group410.10000000.000000 group4user1310.0000004994520.0000000.000000 group510.10000000.000000 group5user1410.00000015396030.0000000.000000 This is a weird behavior, since user1, user8, user11, user13 and user14 are the ones who have more RawUsage and the FairShare value is the same for all of them, including the users that have no yet submited any job. We also noticed that in the slurmctld log there is the fillowing error message that appears with some regularity [2024-03-07T16:38:13.260] error: _append_list_to_array: unable to append NULL list to assoc list. [2024-03-07T16:38:13.260] error: _calc_tree_fs: unable to calculate fairshare on empty tree The error above looks like it is coming from: https://github.com/SchedMD/slurm/blob/b11bf689b270f1f5dfe4b0cd54c4fa84b4af3… Are we missing any setting on slurm.conf? This is kind of strange, because we have another cluster with pretty much the same configuration and the FairShare is calculated without any problems. Any help would be appreciated. -- Cumprimentos / Best Regards, Zacarias Benta LIP/INCD @ UMINHO ---------------------------------------------- / Use linux, and may the source be with you. / ---------------------------------------------- \ __ -=(o '. '.-.\ /| \\ '| || _\_):,_

1 0

Slurm management of dual-node server trays?
by Ole Holm Nielsen 07 Mar '24

07 Mar '24

We're in the process of installing some racks with Lenovo SD665 V3 [1] water-cooled servers. A Lenovo DW612S chassis contains 6 1U trays with 2 SD665 V3 servers mounted side-by-side in each tray. Lenovo delivers SD665 V3 servers including water-cooled NVIDIA InfiniBand "SharedIO" adapters [2] so that one node is the Primary including a PCIe adapter, and the other is Auxiliary with just a cable to the Primary's adapter. Obviously, servicing 2 "Siamese twin" Slurm nodes requires a bit of care and planning. What is worse is that when the Primary node is rebooted or powered down, the Auxiliary node will lose its Infiniband connection and may have a PCIe fault or an NMI as documented in [3]. And when nodes are powered up, the Primary must have completed POST before the Auxiliary gets started. I wonder how to best deal with power failures? It seems that when Slurm jobs are running on Auxiliary nodes, these jobs are going to crash when the possibly unrelated Primary node goes down. This looks like a pretty bad system design on the part of Lenovo :-( The goal was apparently to same some money on IB adapters and having fewer IB cables. Question: Do any Slurm sites out there already have experiences with Lenovo "Siamese twin" nodes with SharedIO IB? Have you developed some operational strategies, for example dealing with node pairs as a single entity for job scheduling? Thanks for sharing any ideas and insights! Ole [1] https://lenovopress.lenovo.com/lp1612-lenovo-thinksystem-sd665-v3-server [2] https://lenovopress.lenovo.com/lp1693-thinksystem-nvidia-connectx-7-ndr200-… [3] https://support.lenovo.com/us/en/solutions/ht510888-thinksystem-sd650-and-c… -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark

3 3

Re: Lua script
by Paul Raines 06 Mar '24

06 Mar '24

Alternativelly consider setting EnforcePartLimits=ALL in slurm.conf -- Paul Raines (http://help.nmr.mgh.harvard.edu) The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.

1 0

Re: Lua script
by Gestió Servidors 06 Mar '24

06 Mar '24

And how can I reject the job inside the lua script?

2 1

Lua script
by Gestió Servidors 06 Mar '24

06 Mar '24

Hello, I'm writing a small lua script that for modify "TimeLimit" of a submited job if user has configured a TimeLimit bigger that configured in the partition. So, is TimeLimit for partition is, for example, 4 hours (04:00:00) and user submit his/her job with a TimeLimit of 5 hours, lua script modify this submit and forces "TimeLimit" to 4 hours for avoiding that job remains at queue in state "pending" with reason "Partition Time Limit...". I have written these lines: function slurm_job_submit(job_desc, part_list, submit_uid) --[[ my id is 1008; for testing purposes, lua only applies to me --]] if (job_desc.user_id == 1008) then slurm.log_info("Submit job by me") --[[ I check partition --]] if (job_desc['partition'] == "nodo.q") then --[[ if (job_desc['time_limit'] > 14400) job_desc['time_limit'] = 14400 end end end return slurm.SUCCESS end However, if I submit a job with TimeLimit of 5 hours, lua script doesn't modify submit and job remains "pending"... What am I doing wrong? Thanks.

2 1

Slurm billback and sreport
by Chip Seraphine 05 Mar '24

05 Mar '24

Hello, I am attempting to implement a billback model and finding myself stymied by the way that sreport handles job arrays. Basically, when a user submits a large array, their usage includes time that jobs in the back of the array spend waiting their turn. (My #1 user in “sreport user topusage” shows more “used” cpu*minutes than the cluster physically _has_ during that interval.) However, jobs that are idle pending resources are simply regarded as pending; as a result, a “polite” user who submits an array of 1000 jobs running N at a time is penalized over a user who just dumps 1000 loose jobs into the queue. This incentives my users to do exactly what I do not want! Has anyone tried to bill their users based on the results of sreport? If so, how did you work around this problem? What did you use to determine the # of CPU*Minutes that a user actually allocated on during a given interval? -- Chip Seraphine Grid Operations For support please use help-grid in email or slack. This e-mail and any attachments may contain information that is confidential and proprietary and otherwise protected from disclosure. If you are not the intended recipient of this e-mail, do not read, duplicate or redistribute it by any means. Please immediately delete it and any attachments and notify the sender that you have received it by mistake. Unintended recipients are prohibited from taking action on the basis of information in this e-mail or any attachments. The DRW Companies make no representations that this e-mail or any attachments are free of computer viruses or other defects.

3 3

Is SWAP memory mandatory for SLURM
by John Joseph 04 Mar '24

04 Mar '24

Dear All, Good morning I do have a 4 node SLURM instance up and running. Like to know if I disable the SWAP memory, will it effect the SLURM performance Is SWAP a mandatory requirement, I have each node more RAM, if my phsicall RAM is more, is there any need for the SWAP thanks Joseph John

4 3

2025

2024