- slurm-users - lists.schedmd.com

Restricting local disk storage of jobs
by Tim Schneider 06 Feb '24

06 Feb '24

Hi, In our SLURM cluster, we are using the job_container/tmpfs plugin to ensure that each user can use /tmp and it gets cleaned up after them. Currently, we are mapping /tmp into the nodes RAM, which means that the cgroups make sure that users can only use a certain amount of storage inside /tmp. Now we would like to use of the node's local SSD instead of its RAM to hold the files in /tmp. I have seen people define local storage as GRES, but I am wondering how to make sure that users do … [View More]

3 2

Re: [ext] Restricting local disk storage of jobs
by Tim Schneider 06 Feb '24

06 Feb '24

Hi Magnus, I understand. Thanks a lot for your suggestion. Best, Tim On 06.02.24 15:34, Hagdorn, Magnus Karl Moritz wrote: > Hi Tim, > in the end the InitScript didn't contain anything useful because > > slurmd: error: _parse_next_key: Parsing error at unrecognized key: > InitScript > > At this stage I gave up. This was with SLURM 23.02. My plan was to > setup the local scratch directory with XFS and then get the script to > apply a project quota, ie quota … [View More]

1 0

Re: [ext] Restricting local disk storage of jobs
by Tim Schneider 06 Feb '24

06 Feb '24

Hi Magnus, thanks for your reply! If you can, would you mind sharing the InitScript of your attempt at getting it to work? Best, Tim On 06.02.24 15:19, Hagdorn, Magnus Karl Moritz wrote: > Hi Tim, > we are using the container/tmpfs plugin to map /tmp to a local NVMe > drive which works great. I did consider setting up directory quotas. I > thought the InitScript [1] option should do the trick. Alas, I didn't > get it to work. If I remember correctly, slurm complained about … [View More]

2 1

Starting a job after a file is created in previous job (dependency looking for soluton)
by Amjad Syed 06 Feb '24

06 Feb '24

Hello I have the following scenario: I need to submit a sequence of up to 400 jobs where the even jobs depend on the preceeding odd job to finish and every odd job depends on the presence of a file generated by the preceding even job (availability of the file for the first of those 400 jobs is guaranteed). If I just submit all those jobs via a loop using dependencies, then I end up with a lot of pending jobs who might later not even run because no output file has been produced by the … [View More]

3 2

SLURM configuration for LDAP users
by Richard Chang 05 Feb '24

05 Feb '24

Hi, I am a little new to this, so please pardon my ignorance. I have configured slurm in my cluster and it works fine with local users. But I am not able to get it working with LDAP/SSSD authentication. User logins using ssh are working fine. An LDAP user can login to the login, slurmctld and compute nodes, but when they try to submit jobs, slurmctld logs an error about invalid account or partition for user. Someone said we need to add the user manually into the database using the … [View More]

4 4

timeout of get_user_env does not obey time limit
by urbanjost 03 Feb '24

03 Feb '24

If I use the sbatch(1) option --export=NONE or wipe the environment with "env -i /usr/bin/sbatch ..." or use --export=NIL then the environment is not properly constructed and I see the message in the /var/log/*slurm* files: [2024-02-03T11:50:33.052] _get_user_env: get env for user jsu here [2024-02-03T11:52:33.152] timeout waiting for /bin/su to complete [2024-02-03T11:52:34.152] error: Failed to load current user environment variables [2024-02-03T11:52:34.153] error: _get_user_env: Unable to … [View More]

1 0

selecting job-specific log messages
by urbanjost 03 Feb '24

03 Feb '24

log files use many strings to identify job, including not having a jobID in the message NUMBER=$SLURM_JOBID egrep "\.\<$NUMBER\>\] |\<$NUMBER\>\.batch|jobid \<$NUMBER\>|JObId=\<$NUMBER\>|job id \<$NUMBER\>|job\.\<$NUMBER\>|job \<$NUMBER\>|jobid \[\<$NUMBER\>\]|task_p_slurmd_batch_request: \<$NUMBER\>" /var/log/slurm* Even that misses cruciall data that does not even contain the jobid [2024-02-03T11:50:33.052] _get_user_env: get env for … [View More]

1 0

after upgrade to 23.11.1 nodes stuck in completion state
by Paul Raines 01 Feb '24

01 Feb '24

I finally had downtime on our cluster running 20.11.3 and decided to upgrade SLURM. All daemons were stopped on nodes and master. Rocky 8 Linux OS was updated but not changed configuration-wise in anyway. On the master, when I first installed 23.11.1 and tried to run slurmdbd -D -vvv at the command line, it balked as the more than 2 major version jump. I then installed 22.05.6 and run slurmdbd -D -vvv which took a long time but completed fine. I reinstalled 23.11.1 and did slurmdbd -D --vvv … [View More]which went quick. I stopped it then ran slurmdbd normally with systemctl. When I ran slurmctld it complained about incompatable state files. I didn't care about loosing state so ran with the -i option to ignore. Seemed happy. I then started up slurmd on my various GPU and non-GPU nodes. I ran a test interactive job just to see if I would get a shell with expected SLURM environment on the nodes and it worked. Except on the GPU nodes I got errors about not being able to get/set GPU frequences. I removed GpuFreqDef=medium from slurm.conf and that went away. I had my people start running jobs last night. This morning I come in and see a lot of nodes are stuck in completing state. All these are GPU nodes. For example on node 'sevilla' squeue reports JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3679433 lcna100 sjob_123 sg1526 CG 8:29 1 sevilla 3679432 lcna100 sjob_123 sg1526 CG 8:35 1 sevilla 3679431 lcna100 sjob_123 sg1526 CG 8:42 1 sevilla In the slurmd.log on sevilla I see every minute the lines: [2024-01-28T09:37:52.003] debug2: Start processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug: _rpc_terminate_job: uid = 1150 JobId=3679432 [2024-01-28T09:37:52.004] debug2: Start processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug: _rpc_terminate_job: uid = 1150 JobId=3679431 [2024-01-28T09:37:52.004] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug2: Start processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug: _rpc_terminate_job: uid = 1150 JobId=3679433 [2024-01-28T09:37:52.004] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB Back at the time 3679433 was cancelled by the user I see [2024-01-28T00:27:48.658] debug2: Start processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T00:27:48.658] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T00:27:48.658] debug: _rpc_terminate_job: uid = 1150 JobId=3679433 [2024-01-28T00:27:48.658] debug: credential for job 3679433 revoked [2024-01-28T00:27:48.658] debug: _rpc_terminate_job: sent SUCCESS for 3679433, waiting for prolog to finish [2024-01-28T00:27:48.658] debug: Waiting for job 3679433's prolog to complete On sevilla itself there are no processes currently running for user sg1526 so no sign of these jobs. There ARE processes running for user mu40 though [root@sevilla ~]# pgrep -f slurm_script | xargs -n1 pstree -p | grep ^slurm slurm_script(30691)---starter-suid(30764)-+-python3(30784)-+-{python3}(30817) slurm_script(30861)---starter-suid(30873)-+-python3(30888)-+-{python3}(30915) slurm_script(30963)---starter-suid(30975)-+-python3(30992)-+-{python3}(31020) [root@sevilla ~]# strings /proc/30691/cmdline /bin/sh /var/slurm/spool/d/job3679220/slurm_script /usr/bin/sacct -p --jobs=3679220 --format=JobID,User,ReqTRES,NodeList,Start,End,Elapsed,CPUTime,State,ExitCode JobID|User|ReqTRES|NodeList|Start|End|Elapsed|CPUTime|State|ExitCode| 3679220|mu40|billing=6,cpu=3,gres/gpu=1,mem=32G,node=1|sevilla|2024-01-27T22:33:27|2024-01-27T23:26:39|00:53:12|02:39:36|NODE_FAIL|1:0| 3679220.batch|||sevilla|2024-01-27T22:33:27|2024-01-27T23:26:39|00:53:12|02:39:36|CANCELLED|| 3679220.extern|||sevilla|2024-01-27T22:33:27|2024-01-27T23:26:39|00:53:12|02:39:36|CANCELLED|| The slurmd.log has at the time it started lines like: [2024-01-27T22:33:28.679] [3679220.batch] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30691, probably hasn't started yet or has already finished [2024-01-27T22:33:58.092] [3679220.extern] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30650, probably hasn't started yet or has already finished [2024-01-27T22:33:58.166] [3679220.batch] debug2: profile signaling type Task [2024-01-27T22:33:58.168] [3679220.batch] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30691, probably hasn't started yet or has already finished [2024-01-27T22:33:58.170] [3679220.batch] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30764, probably hasn't started yet or has already finished [2024-01-27T22:33:58.172] [3679220.batch] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30784, probably hasn't started yet or has already finished ... [2024-01-27T22:46:39.018] debug: _step_connect: connect() failed for /var/slurm/spool/d/sevilla_3679220.4294967292: Connection refused [2024-01-27T22:46:39.018] debug: Cleaned up stray socket /var/slurm/spool/d/sevilla_3679220.4294967292 [2024-01-27T22:46:39.018] debug: _step_connect: connect() failed for /var/slurm/spool/d/sevilla_3679220.4294967291: Connection refused [2024-01-27T22:46:39.018] debug: Cleaned up stray socket /var/slurm/spool/d/sevilla_3679220.4294967291 [2024-01-27T22:46:39.018] _handle_stray_script: Purging vestigial job script /var/slurm/spool/d/job3679220/slurm_script then later at the time of termination [2024-01-27T23:26:39.042] debug: _rpc_terminate_job: uid = 1150 JobId=3679220 [2024-01-27T23:26:39.042] debug: credential for job 3679220 revoked [2024-01-27T23:26:39.042] debug2: No steps in jobid 3679220 to send signal 998 [2024-01-27T23:26:39.043] debug2: No steps in jobid 3679220 to send signal 18 [2024-01-27T23:26:39.043] debug2: No steps in jobid 3679220 to send signal 15 [2024-01-27T23:26:39.044] debug2: set revoke expiration for jobid 3679220 to 1706416119 UTS [2024-01-27T23:26:39.044] debug: Waiting for job 3679220's prolog to complete [2024-01-27T23:26:39.044] debug: Finished wait for job 3679220's prolog to complete [2024-01-27T23:26:39.068] debug: completed epilog for jobid 3679220 [2024-01-27T23:26:39.071] debug: JobId=3679220: sent epilog complete msg: rc = 0 So I have a node stuck in 'comp' state that claims to have jobs completing that are definite NOT running on the box BUT there are jobs running on the box that SLURM thinks are done --------------------------------------------------------------- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th Street Charlestown, MA 02129 USA Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail. [View Less]

4 7

Why is Slurm 20 the latest RPM in RHEL 8/Fedora repo?
by Robert Kudyba 31 Jan '24

31 Jan '24

According to these links: https://rpmfind.net/linux/rpm2html/search.php?query=slurm https://src.fedoraproject.org/rpms/slurm Why doesn't RHEL 8 get a newer version? Can someone update the repo maintainer Philip Kovacs <https://src.fedoraproject.org/user/pkfed> < pkfed(a)fedoraproject.org>? There was a ticket at https://bugzilla.redhat.com/show_bug.cgi?id=1912491 but no movement on RHEL 8.

5 4

Mailing list upgrade - slurm-users list paused
by Tim Wickberg 30 Jan '24

30 Jan '24

Hey folks - The mailing list will be offline for about an hour as we upgrade the host, upgrade the mailing list software, and change the mail configuration around. As part of these changes, the "From: " field will no longer be the original sender, but instead use the mailing list ID itself. This is to comply with DMARC sending options, and allow us to start DKIM signing messages to ensure deliverability once Google and Yahoo impose new policy changes in February. This is the last post … [View More]

1 1

2025

2024

slurm-users ----- 2025 ----- July 2025 June 2025 May 2025 April 2025 March 2025 February 2025 January 2025 ----- 2024 ----- December 2024 November 2024 October 2024 September 2024 August 2024 July 2024 June 2024 May 2024 April 2024 March 2024 February 2024 January 2024

slurm-users