- slurm-users - lists.schedmd.com

Slurm Remote Task Launch
by Bhaskar Chakraborty 04 Sep '24

04 Sep '24

Hello, I have a slurm job which needs to launch multiple tasks across the allocated hosts for the job. My criteria is that most of the tasks need to be launched from within the main task launchedby slurm in the launch compute node. So, if the allocated hosts are h1, h2 & h3 with h1 being the main launcher node then theinitial task, say launchTask, launched in h1 will need to launch RemoteTask1, RemoteTask2 in h2 & h3 at somepoint during execution. Can I use srun from inside launchTask to do so? If yes, what would be the syntax / args?If no, then what alternative I have other than using rsh/ssh which mayn't be available in the cluster. Thanks in advance! Regards,Bhaskar.

1 0

Best practices for tracking jobs started across multiple clusters for accounting purposes.
by Di Bernardini, Fabio 02 Sep '24

02 Sep '24

I need to account for jobs composed of multiple jobs launched on multiple federated (and non-federated) clusters, which therefore have different job IDs. What are the best practices to prevent users from bypassing this tracking? NICE SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2096882, Capitale Sociale: 10.329,14 EUR i.v., Cod. Fisc. e P.IVA 01133050052, Societa con Socio Unico

3 4

QOS MaxTRESPU node=X intepretation
by David Magda 30 Aug '24

30 Aug '24

Hello, Have a question on how to interpret a Node=X MaxTRESPU value for a QOS: If (e.g.) X=4, and each node has (say) 64 CPUs or cores: if a particular job needs 32 cores, then would two jobs count as the equivalent of one node (2*32=64)? And if X=4 for the QOS, would that mean that eight of these jobs would be able to run ((4*64)/32=8)? Or does this TRES limit only get activated -N/--nodes is used? Don't see anything obvious: https://slurm.schedmd.com/resource_limits.html https://slurm.schedmd.com/sacctmgr.html https://slurm.schedmd.com/tres.html Thanks for any info. Regards, David

1 0

playing with --nodes=<size_string>
by Matteo Guglielmi 30 Aug '24

30 Aug '24

Hello, I have a cluster with four Intel nodes (node[01-04], Feature=intel) and four Amd nodes (node[05-08], Feature=amd). # job file #SBATCH --ntasks=3 #SBATCH --nodes=2,4 #SBATCH --constraint="[intel|amd]" env | grep SLURM # slurm.conf PartitionName=DEFAULT MinNodes=1 MaxNodes=UNLIMITED # log SLURM_JOB_USER=software SLURM_TASKS_PER_NODE=1(x3) SLURM_JOB_UID=1002 SLURM_TASK_PID=49987 SLURM_LOCALID=0 SLURM_SUBMIT_DIR=/home/software SLURMD_NODENAME=node01 SLURM_JOB_START_TIME=1724932865 SLURM_CLUSTER_NAME=cluster SLURM_JOB_END_TIME=1724933465 SLURM_CPUS_ON_NODE=1 SLURM_JOB_CPUS_PER_NODE=1(x3) SLURM_GTIDS=0 SLURM_JOB_PARTITION=nodes SLURM_JOB_NUM_NODES=3 SLURM_JOBID=26 SLURM_JOB_QOS=lprio SLURM_PROCID=0 SLURM_NTASKS=3 SLURM_TOPOLOGY_ADDR=node01 SLURM_TOPOLOGY_ADDR_PATTERN=node SLURM_MEM_PER_CPU=0 SLURM_NODELIST=node[01-03] SLURM_JOB_ACCOUNT=dalco SLURM_PRIO_PROCESS=0 SLURM_NPROCS=3 SLURM_NNODES=3 SLURM_SUBMIT_HOST=master SLURM_JOB_ID=26 SLURM_NODEID=0 SLURM_CONF=/etc/slurm/slurm.conf SLURM_JOB_NAME=mpijob SLURM_JOB_GID=1002 SLURM_JOB_NODELIST=node[01-03] <<<=== why three nodes? Shouldn't this still be two nodes? Thank you.

2 5

Print Slurm Stats on Login
by Paul Edmon 29 Aug '24

29 Aug '24

We are working to make our users more aware of their usage. One of the ideas we came up with was to having some basic usage stats printed at login (usage over past day, fairshare, job efficiency, etc). Does anyone have any scripts or methods that they use to do this? Before baking my own I was curious what other sites do and if they would be willing to share their scripts and methodology. -Paul Edmon-

10 20

Multiple Counts Question
by Matteo Guglielmi 29 Aug '24

29 Aug '24

Hello, Does anyone know why this is possible in slurm: --constraint="[rack1*2&rack2*4]" and this is not: --constraint="[rack1*2|rack2*4]" ? Thank you.

1 0

REST API - get_user_environment
by jpuerto＠psc.edu 29 Aug '24

29 Aug '24

In previous versions (v0.0.36) of the REST API the job submission endpoint had a field titled "get_user_environment"; however, it doesn't appear to exist in v0.0.40. Is there an equivalent parameter that should be used? What is the suggested approach for mimicking this behavior in v0.0.40? Best regards, Juan

5 13

ResumeAfterTime - Lacking Info
by Sid Young 29 Aug '24

29 Aug '24

G'Day all, Can anyone shed light on the parameter "Resume AfterTime" returned from the command "scontrol show node XXX" Can it be used to automatically resume a "Down"ed node? Sid

1 0

Invitation for Feedback on the Slurm-Lab Project
by Patrick Pun 28 Aug '24

28 Aug '24

Hi Slurm User Group, I’m excited to share a project I’ve been working on: Slurm-Lab. It’s a containerised, easy-to-deploy environment designed for learning, testing, and experimenting with Slurm. This project has allowed me and some of my friends to explore various Slurm features directly on our laptops, and I hope it can be just as useful to some of you. Highlights: • Quick Deployment: Setting up a Slurm environment is a breeze—just clone the project and pull the image to get started. • User-Friendly Interface: Interact with your sandbox environment seamlessly through JupyterHub in your browser. • Educational Focus: Explore different Slurm features with provided examples and interactive notebooks, perfect for both new and experienced users. Why I Need Your Feedback: As members of the Slurm community, your insights and experience are incredibly valuable. I’m seeking your feedback on how Slurm-Lab can better serve our community. Whether it’s feature suggestions, bug reports, or general comments. Please take a look at the project and share your thoughts. You can contribute by opening issues or submitting merge requests directly on GitLab. GitLab Repo: https://gitlab.com/CSniper/slurm-lab GitHub Mirror: https://github.com/csniper-patrick/slurm-lab Thank you for your time and for helping make Slurm-Lab a valuable resource for the entire community! Cheers, Patrick

1 0

Spread a multistep job across clusters
by Di Bernardini, Fabio 28 Aug '24

28 Aug '24

Hi everyone, for accounting reasons, I need to create only one job across two or more federated clusters with two or more srun steps. I'm trying with hetjobs but it's not clear to me from the documentation (https://slurm.schedmd.com/heterogeneous_jobs.html) if this is possible and how to do it. I'm trying with this script, but the steps are executed on only the first cluster. Can you tell me if there is a mistake in the hetjob or if it has to be done in another way? #!/bin/bash #SBATCH hetjob #SBATCH --clusters=cluster2 srun -v --het-group=0 hostname #SBATCH hetjob #SBATCH --clusters=cluster3 srun -v --het-group=1 hostname NICE SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2096882, Capitale Sociale: 10.329,14 EUR i.v., Cod. Fisc. e P.IVA 01133050052, Societa con Socio Unico

3 2

2025

2024

slurm-users