- slurm-users - lists.schedmd.com

Best practices for tracking jobs started across multiple clusters for accounting purposes.
by Di Bernardini, Fabio 02 Sep '24

02 Sep '24

I need to account for jobs composed of multiple jobs launched on multiple federated (and non-federated) clusters, which therefore have different job IDs. What are the best practices to prevent users from bypassing this tracking? NICE SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2096882, Capitale Sociale: 10.329,14 EUR i.v., Cod. Fisc. e P.IVA 01133050052, Societa con Socio Unico

3 4

QOS MaxTRESPU node=X intepretation
by David Magda 30 Aug '24

30 Aug '24

Hello, Have a question on how to interpret a Node=X MaxTRESPU value for a QOS: If (e.g.) X=4, and each node has (say) 64 CPUs or cores: if a particular job needs 32 cores, then would two jobs count as the equivalent of one node (2*32=64)? And if X=4 for the QOS, would that mean that eight of these jobs would be able to run ((4*64)/32=8)? Or does this TRES limit only get activated -N/--nodes is used? Don't see anything obvious: https://slurm.schedmd.com/resource_limits.html https://slurm.schedmd.com/sacctmgr.html https://slurm.schedmd.com/tres.html Thanks for any info. Regards, David

1 0

playing with --nodes=<size_string>
by Matteo Guglielmi 30 Aug '24

30 Aug '24

Hello, I have a cluster with four Intel nodes (node[01-04], Feature=intel) and four Amd nodes (node[05-08], Feature=amd). # job file #SBATCH --ntasks=3 #SBATCH --nodes=2,4 #SBATCH --constraint="[intel|amd]" env | grep SLURM # slurm.conf PartitionName=DEFAULT MinNodes=1 MaxNodes=UNLIMITED # log SLURM_JOB_USER=software SLURM_TASKS_PER_NODE=1(x3) SLURM_JOB_UID=1002 SLURM_TASK_PID=49987 SLURM_LOCALID=0 SLURM_SUBMIT_DIR=/home/software SLURMD_NODENAME=node01 SLURM_JOB_START_TIME=1724932865 SLURM_CLUSTER_NAME=cluster SLURM_JOB_END_TIME=1724933465 SLURM_CPUS_ON_NODE=1 SLURM_JOB_CPUS_PER_NODE=1(x3) SLURM_GTIDS=0 SLURM_JOB_PARTITION=nodes SLURM_JOB_NUM_NODES=3 SLURM_JOBID=26 SLURM_JOB_QOS=lprio SLURM_PROCID=0 SLURM_NTASKS=3 SLURM_TOPOLOGY_ADDR=node01 SLURM_TOPOLOGY_ADDR_PATTERN=node SLURM_MEM_PER_CPU=0 SLURM_NODELIST=node[01-03] SLURM_JOB_ACCOUNT=dalco SLURM_PRIO_PROCESS=0 SLURM_NPROCS=3 SLURM_NNODES=3 SLURM_SUBMIT_HOST=master SLURM_JOB_ID=26 SLURM_NODEID=0 SLURM_CONF=/etc/slurm/slurm.conf SLURM_JOB_NAME=mpijob SLURM_JOB_GID=1002 SLURM_JOB_NODELIST=node[01-03] <<<=== why three nodes? Shouldn't this still be two nodes? Thank you.

2 5

Print Slurm Stats on Login
by Paul Edmon 29 Aug '24

29 Aug '24

We are working to make our users more aware of their usage. One of the ideas we came up with was to having some basic usage stats printed at login (usage over past day, fairshare, job efficiency, etc). Does anyone have any scripts or methods that they use to do this? Before baking my own I was curious what other sites do and if they would be willing to share their scripts and methodology. -Paul Edmon-

10 20

Multiple Counts Question
by Matteo Guglielmi 29 Aug '24

29 Aug '24

Hello, Does anyone know why this is possible in slurm: --constraint="[rack1*2&rack2*4]" and this is not: --constraint="[rack1*2|rack2*4]" ? Thank you.

1 0

REST API - get_user_environment
by jpuerto＠psc.edu 29 Aug '24

29 Aug '24

In previous versions (v0.0.36) of the REST API the job submission endpoint had a field titled "get_user_environment"; however, it doesn't appear to exist in v0.0.40. Is there an equivalent parameter that should be used? What is the suggested approach for mimicking this behavior in v0.0.40? Best regards, Juan

5 13

ResumeAfterTime - Lacking Info
by Sid Young 29 Aug '24

29 Aug '24

G'Day all, Can anyone shed light on the parameter "Resume AfterTime" returned from the command "scontrol show node XXX" Can it be used to automatically resume a "Down"ed node? Sid

1 0

Invitation for Feedback on the Slurm-Lab Project
by Patrick Pun 28 Aug '24

28 Aug '24

Hi Slurm User Group, I’m excited to share a project I’ve been working on: Slurm-Lab. It’s a containerised, easy-to-deploy environment designed for learning, testing, and experimenting with Slurm. This project has allowed me and some of my friends to explore various Slurm features directly on our laptops, and I hope it can be just as useful to some of you. Highlights: • Quick Deployment: Setting up a Slurm environment is a breeze—just clone the project and pull the image to get started. • User-Friendly Interface: Interact with your sandbox environment seamlessly through JupyterHub in your browser. • Educational Focus: Explore different Slurm features with provided examples and interactive notebooks, perfect for both new and experienced users. Why I Need Your Feedback: As members of the Slurm community, your insights and experience are incredibly valuable. I’m seeking your feedback on how Slurm-Lab can better serve our community. Whether it’s feature suggestions, bug reports, or general comments. Please take a look at the project and share your thoughts. You can contribute by opening issues or submitting merge requests directly on GitLab. GitLab Repo: https://gitlab.com/CSniper/slurm-lab GitHub Mirror: https://github.com/csniper-patrick/slurm-lab Thank you for your time and for helping make Slurm-Lab a valuable resource for the entire community! Cheers, Patrick

1 0

Spread a multistep job across clusters
by Di Bernardini, Fabio 28 Aug '24

28 Aug '24

Hi everyone, for accounting reasons, I need to create only one job across two or more federated clusters with two or more srun steps. I'm trying with hetjobs but it's not clear to me from the documentation (https://slurm.schedmd.com/heterogeneous_jobs.html) if this is possible and how to do it. I'm trying with this script, but the steps are executed on only the first cluster. Can you tell me if there is a mistake in the hetjob or if it has to be done in another way? #!/bin/bash #SBATCH hetjob #SBATCH --clusters=cluster2 srun -v --het-group=0 hostname #SBATCH hetjob #SBATCH --clusters=cluster3 srun -v --het-group=1 hostname NICE SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2096882, Capitale Sociale: 10.329,14 EUR i.v., Cod. Fisc. e P.IVA 01133050052, Societa con Socio Unico

3 2

Slurm versions 24.05.3 and 23.11.10 are now available
by Marshall Garey 27 Aug '24

27 Aug '24

We are pleased to announce the availability of Slurm versions 24.05.3 and 23.11.10. Version 24.05.3 fixes a potential database problem when deleting a qos. This bug only existed in 24.05. Both versions have fixes for jobs potentially being stuck when using cloud nodes when some nodes are powered down, a regression in 23.11.9 and 24.05.2 that caused sattach to crash, and some other minor issues. Slurm can be downloaded from https://www.schedmd.com/downloads.php . -- Marshall Garey Release Management, Support, and Development SchedMD LLC - Commercial Slurm Development and Support > * Changes in Slurm 24.05.3 > ========================== > -- data_parser/v0.0.40 - Added field descriptions > -- slurmrestd - Avoid creating new slurmdbd connection per request to > '* /slurm/slurmctld/*/*' endpoints. > -- Fix compilation issue with switch/hpe_slingshot plugin. > -- Fix gres per task allocation with threads-per-core. > -- data_parser/v0.0.41 - Added field descriptions > -- slurmrestd - Change back generated OpenAPI schema for > `DELETE /slurm/v0.0.40/jobs/` to RequestBody instead of using parameters > for request. slurmrestd will continue accept endpoint requests via > RequestBody or HTTP query. > -- topology/tree - Fix issues with switch distance optimization. > -- Fix potential segfault of secondary slurmctld when falling back to the > primary when running with a JobComp plugin. > -- Enable --json/--yaml=v0.0.39 options on client commands to dump data using > data_parser/v0.0.39 instead or outputting nothing. > -- switch/hpe_slingshot - Fix issue that could result in a 0 length state file. > -- Fix unnecessary message protocol downgrade for unregistered nodes. > -- Fix unnecessarily packing alias addrs when terminating jobs with a mix of > non-cloud/dynamic nodes and powered down cloud/dynamic nodes. > -- accounting_storage/mysql - Fix issue when deleting a qos that could remove > too many commas from the qos and/or delta_qos fields of the assoc table. > -- slurmctld - Fix memory leak when using RestrictedCoresPerGPU. > -- Fix allowing access to reservations without MaxStartDelay set. > -- Fix regression introduced in 24.05.0rc1 breaking srun --send-libs parsing. > -- Fix slurmd vsize memory leak when using job submission/allocation commands > that implicitly or explicitly use --get-user-env. > -- slurmd - Fix node going into invalid state when using CPUSpecList and > setting CPUs to the # of cores on a multithreaded node > -- Fix reboot asap nodes being considered in backfill after a restart. > -- Fix --clusters/-M queries for clusters outside of a federation when > fed_display is configured. > -- Fix scontrol allowing updating job with bad cpus-per-task value. > -- sattach - Fix regression from 24.05.2 security fix leading to crash. > -- mpi/pmix - Fix assertion when built under --enable-debug. > * Changes in Slurm 23.11.10 > =========================== > -- switch/hpe_slingshot - Fix issue that could result in a 0 length state file. > -- Fix unnecessary message protocol downgrade for unregistered nodes. > -- Fix unnecessarily packing alias addrs when terminating jobs with a mix of > non-cloud/dynamic nodes and powered down cloud/dynamic nodes. > -- Fix allowing access to reservations without MaxStartDelay set. > -- Fix scontrol allowing updating job with bad cpus-per-task value. > -- sattach - Fix regression from 23.11.9 security fix leading to crash.

1 0

2025

2024

slurm-users