- slurm-users - lists.schedmd.com

SLURM_JOB_ACCOUNT var missing in prolog
by Jonás Arce 13 Mar '25

13 Mar '25

Hi everyone, The Slurm variable SLURM_JOB_ACCOUNT is blank in my prolog, currently i have a TaskProlog and a Prolog in my system, in TaskProlog this variable works just fine (I use TaskProlog to set env variables), but in my Prolog (which I use to create temporal directories), it's just blank, it's strange beacuse other variables such as SLURM_JOB_PARTITION, SLURM_JOB_ACCOUNT or SLURM_JOBID work just fine in both Prolog and TaskProlog, I looked into the Slurm doc and it seems that this var should work everywhere, if someone could shed some light into this or tell me another equivalent var i'd appreciate it a lot. I need it because I need to make this type of temporal directories with my Prolog: /scratch-global/$SLURM_JOB_ACCOUNT/$SLURM_JOB_USER/$SLURM_JOBID.

2 1

Job Env Vars in Slurm Core
by Bhaskar Chakraborty 13 Mar '25

13 Mar '25

Hi everyone, I have tried my best to extract custom job environment variables inside a custom select plugin.However, I am not able to get the same. I am able to get the env in a custom client filter plugin during job submission (sbatch --export) but Ineed the same in slurmctld as well. Is there any way this can be achieved. I have looked over few of the internal DS pointers (job_record, job_details etc)but none seem to provide the required functionality. Any insight is appreciated. Regards,Bhaskar.

2 1

SLURM queue/partition/qos configuration
by vittorio.confuorto＠gmail.com 12 Mar '25

12 Mar '25

Hi everyone, I need some advice on properly configuring my HPC cluster, which consists of three compute nodes (each with 256 cores and 800GB of RAM). The end users belong to three distinct groups: internal, external, and collaborator. I would like the internal users to have full (and exclusive) access to the first two compute nodes. That is, by default, if they submit a job, it should be scheduled on Node1 or Node2, depending on availability (or placed in the queue if necessary). Internal users should also have access to 100 cores and 200GB of RAM on Node3, with this limit applying as a hard cap across all their jobs. The remaining 156 cores and 600GB of RAM on Node3 should be exclusively available to external and collaborator users. Specifically, I would like the following setup: - The resources on Node3 should be split as follows: - 100 cores and 400GB of RAM for external users. - 56 cores and 200GB of RAM for collaborator users. - External users should be able to use the resources dedicated to collaborator users but with lowest priority. This means an external user can submit a job requesting the resources reserved for collaborators, but the job will only start if no collaborator jobs are in the queue. Any job submitted by a collaborator—even after an external user's job—should take priority. The same rule should apply in reverse for collaborators trying to use external resources. - There should be a group/partition/queue that includes all the remaining 156 cores and 600GB of RAM on Node3. Both external and collaborator users should be able to submit jobs to this queue, but a job submitted here should only start if there are no pending jobs in the two dedicated partitions. What is the best way to implement such an architecture? Thanks, everyone, for your support!

1 0

Slurm Job Kill intimation
by Bhaskar Chakraborty 10 Mar '25

10 Mar '25

Hello All, We wish to do some cleanup when user kills a pending slurm job using scancel. Our setup has a client filter plugin which sets some internal variables while interacting to a remote process.This needs to be undone during job termination. Is there a way to do this (using submitter credentials) ? Also, if one is to use a custom select plugin is there a way to inform select that the job is terminated.Presently, we don't find that slurmctld informs select plugin when a pending job is killed. Thanks in advance for any pointers ! Regards,Bhaskar.

1 0

Slurm upgrade using Debian packages
by Matthias Leopold 10 Mar '25

10 Mar '25

Hi, I'm building Slurm Debian packages from SchedMD sources using this tutorial https://www.schedmd.com/slurm/installation-tutorial/. Now I tried upgrading (minor release upgrade within 24.05) using these packages. https://slurm.schedmd.com/upgrades.html tells me to upgrade (a) slurmdbd (b) slurmctld (c) slurmd separately in this order, stopping each service for the upgrade. How can I follow this when the Debian packages have a dependency between slurmdbd + slurmctld that upgrades both packages at the same time? thx Matthias

4 7

allowing spyder-kernels to an interactive session without pam_slurm_adopt and older version of Slurm from OpenHPC repo; parmiko?
by Robert Kudyba 06 Mar '25

06 Mar '25

So the Spyder IDE has this great feature to connect a local/laptop/workstation to a spyder-kernels session <https://docs.spyder-ide.org/current/panes/ipythonconsole.html#using-externa…>. If a user starts an interactive, e.g., srun session, they end up on a compute node which is on one of our older clusters without the pam_slurm_adopt module. Is there a way with a library like Parmiko or this slurmjob <https://github.com/daangeijs/deepops-slurmjob> package? I see for vscode an admin posted the following option <https://github.com/microsoft/vscode-remote-release/issues/1722#issuecomment…> : if [[ -v SSH_AUTH_SOCK ]]; then if [[ ${SSH_AUTH_SOCK} =~ vscode ]]; then [ -f ~/.code-tunnel-env.bash ] && source ~/.code-tunnel-env.bash fi fi Is there a way to work around the restriction of non admins to connect to the spyder-kernel session? Is there anything similar to batchspawner for Jupyterhub <https://github.com/jupyterhub/batchspawner>? Unless anyone knows of an archive for Slurm slurm-ohpc-18.08 that includes pam_adopt?

1 0

Slurm versions 24.11.3 is now available
by Marshall Garey 06 Mar '25

06 Mar '25

We are pleased to announce the availability of Slurm version 24.11.3. 24.11.3 fixes the database cluster ID generation not being random, a regression in which slurmd -G gave no output, a long-standing crash in slurmctld after updating a reservation with an empty nodelist, and some other minor to moderate bugs. Downloads are available at https://www.schedmd.com/downloads.php . -- Marshall Garey Release Management, Support, and Development SchedMD LLC - Commercial Slurm Development and Support > * Changes in Slurm 24.11.3 > ========================== > -- Fix race condition in slurmrestd that resulted in "Requested > data_parser plugin does not support OpenAPI plugin" error being returned > for valid endpoints. > -- If multiple partitions are requested, set the SLURM_JOB_PARTITION > output environment variable to the partition in which the job is running > for salloc and srun in order to match the documentation and the behavior of > sbatch. > -- Fix regression where slurmd -G gives no output. > -- Don't print misleading errors for stepmgr enabled steps. > -- slurmrestd - Avoid connection to slurmdbd for the following > endpoints: > GET /slurm/v0.0.41/jobs > GET /slurm/v0.0.41/job/{job_id} > -- slurmrestd - Avoid connection to slurmdbd for the following > endpoints: > GET /slurm/v0.0.40/jobs > GET /slurm/v0.0.40/job/{job_id} > -- Significantly increase entropy of clusterid portion of the > sluid by seeding the random number generator > -- Avoid changing process name to "watch" from original daemon name. > This could potentially breaking some monitoring scripts. > -- Avoid slurmctld being killed by SIGALRM due to race condition > at startup. > -- Fix slurmctld crash when after updating a reservation with an empty > nodelist. The crash could occur after restarting slurmctld, or if > downing/draining a node in the reservation with the REPLACE or REPLACE_DOWN > flag. > -- Fix race between task/cgroup cpuset and jobacctgather/cgroup. > The first was removing the pid from task_X cgroup directory causing > memory limits to not be applied. > -- srun - Fixed wrongly constructed SLURM_CPU_BIND env variable > that could get propagated to downward srun calls in certain mpi > environments, causing launch failures. > -- slurmrestd - Fix possible memory leak when parsing arrays with > data_parser/v0.0.40. > -- slurmrestd - Fix possible memory leak when parsing arrays with > data_parser/v0.0.41. > -- slurmrestd - Fix possible memory leak when parsing arrays with > data_parser/v0.0.42.

1 0

broken SLURM-PMIX out-of-band communication on v24.11.0 with PMIx v5
by Bertini, Denis Dr. 06 Mar '25

06 Mar '25

Hi there, Using slurm v24.11.0 together with openMPI 5.0.7 built with openpmix v5.0.6 i am facing a systematical crash at process wiring-up phase when launching standard MPI job (OSU benchmarks ) on our new AMD compute nodes ( amd-epyc 9654, 192 phys. cores +HT ) running Rocky Linux 9.4 OS The typical error reads: slurmstepd: error: mpi/pmix_v5: pmixp_p2p_send: ccexe0094 [4]: pmixp_utils.c:469: send failed, rc=1001, exceeded the retry limit slurmstepd: error: mpi/pmix_v5: _slurm_send: ccexe0094 [4]: pmixp_server.c:1581: Cannot send message to /var/spool/slurmd/stepd.slurm.pmix.656.0, size = 46979, hostlist: (null) srun: error: Node failure on ccexe0091 after such a error as you can see the node move to state down It looks like the slurmstep pmix_server can not use the local socket at var/spool/slurmd/stepd.slurm.pmix.job_id.0 for inter-node communication . * On one AMD node ( same SLURM version, same cluster setup ) wiring up works smoothly even at core satuation (192 cores used) * On Intel node (intel,xeon,gold6248r, 48 cores ) wiring-up works even with multiple node without any problem * When the problematic AMD nodes are setup as dynamic node<https://slurm.schedmd.com/dynamic_nodes.html> the wiring-up phase with multiple nodes works perfectly, without any issue Has anybody experienced this kind of problem? Any idea what could be the reason for that? I also add that when the problematic AMD nodes are setup as dynamic node<https://slurm.schedmd.com/dynamic_nodes.html> the wiring-up phase with multiple nodes works perfectly, without any issue Cheers, Denis * --------- Denis Bertini Abteilung: CIT Ort: SB3 2.265a Tel: +49 6159 71 2240 Fax: +49 6159 71 2986 E-Mail: d.bertini(a)gsi.de GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528 Managing Directors / Geschäftsführung: Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats: Ministerialdirigent Dr. Volkmar Dietz

1 0

error in logs, does it mean anything?
by Steven Jones 06 Mar '25

06 Mar '25

[2025-03-06T00:26:06.002] error: _add_registered_cluster: trying to register a cluster (poc-cluster) with no remote port [2025-03-06T00:31:06.002] error: _add_registered_cluster: trying to register a cluster (poc-cluster) with no remote port [2025-03-06T00:36:06.001] error: _add_registered_cluster: trying to register a cluster (poc-cluster) with no remote port [2025-03-06T00:41:07.001] error: _add_registered_cluster: trying to register a cluster (poc-cluster) with no remote port regards Steven

1 0

Re: mariadb refusing access
by Steven Jones 05 Mar '25

05 Mar '25

Yes that is set, and my settings are, # slurmDBD info DbdAddr=localhost DbdHost=vuwunicoslurmd3.ods.vuw.ac.nz #DbdHost=localhost DbdPort=6819 SlurmUser=slurm #MessageTimeout=300 DebugLevel=verbose #DefaultQOS=normal,standby LogFile=/var/log/slurm/slurmdbd.log PidFile=/var/run/slurmdbd/slurmdbd.pid #PluginDir=/usr/lib/slurm #PrivateData=accounts,users,usage,jobs #TrackWCKey=yes # # Database info StorageType=accounting_storage/mysql #StorageHost=localhost StorageHost=localhost #StoragePort=1234 #StoragePort=3306 StoragePass=xxxxxxxx StorageUser=slurm StorageLoc=slurm_acct_db #ssj regards Steven ________________________________ From: Sarlo, Jeffrey S <JSarlo(a)Central.UH.EDU> Sent: Thursday, 6 March 2025 7:42 am To: Steven Jones <steven.jones(a)vuw.ac.nz> Subject: RE: [slurm-users] Re: mariadb refusing access You don't often get email from jsarlo(a)central.uh.edu. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> Do you have the correct password listed in the slurmdbd.conf file? That was toward the bottom of the Wiki that Ole sent. Jeff From: Steven Jones via slurm-users <slurm-users(a)lists.schedmd.com> Sent: Wednesday, March 5, 2025 12:39 PM To: slurm-users(a)lists.schedmd.com; Ole Holm Nielsen <Ole.H.Nielsen(a)fysik.dtu.dk> Subject: [slurm-users] Re: mariadb refusing access Sorry, I dont follow you, Pass word works fine and grant looks OK. [root@vuwunicoslurmd3 ~]# mysql -u slurm -p Enter password: Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 208 Server version: 10.11.10-MariaDB MariaDB Server Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]> show grants; +--------------------------------------------------------------------------------------------------------------+ | Grants for slurm@localhost | +--------------------------------------------------------------------------------------------------------------+ | GRANT USAGE ON *.* TO `slurm`@`localhost` IDENTIFIED BY PASSWORD '*2F4EEC707189D8B5E329829B9303058EF7585569' | | GRANT ALL PRIVILEGES ON `slurm_acct_db`.* TO `slurm`@`localhost` | +--------------------------------------------------------------------------------------------------------------+ 2 rows in set (0.000 sec) MariaDB [(none)]> Isnt there something about password encryption / password hash? regards Steven ________________________________ From: Ole Holm Nielsen via slurm-users <slurm-users(a)lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>> Sent: Wednesday, 5 March 2025 9:25 pm To: slurm-users(a)lists.schedmd.com<mailto:slurm-users@lists.schedmd.com> <slurm-users(a)lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>> Subject: [slurm-users] Re: mariadb refusing access On 3/5/25 02:23, Steven Jones via slurm-users wrote: > In the logs I am seeing, > > root@vuwunicoslurmd3 mariadb]# tail -f mariadb.log > 2025-03-04 19:01:32 12565 [Warning] Access denied for user > 'slurm'@'localhost' (using password: YES) > 2025-03-04 19:06:19 12566 [Warning] Access denied for user > 'slurm'@'localhost' (using password: YES) > > However mysql -u slurm -p works just fine so it seems to be a config > error for slurmdbd It seems that you didn't select a suitable slurm user’s database password, see https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.fysi…<https://urldefense.com/v3/__https:/apc01.safelinks.protection.outlook.com/?…> IHTH, Ole -- slurm-users mailing list -- slurm-users(a)lists.schedmd.com<mailto:slurm-users@lists.schedmd.com> To unsubscribe send an email to slurm-users-leave(a)lists.schedmd.com<mailto:slurm-users-leave@lists.schedmd.com>

1 1