RHEL8.10 V slurmctld

List overview All Threads
Download

newer

older

Submit multiple jobs odd info on...

System limits propagation

Steven Jones

29 Jan 2025 29 Jan '25

7:44 p.m.

I am using Redhat's IdM/IPA for users

Slurmctld is failing to run jobs and it is getting "invalid user id".

"2025-01-28T21:48:50.271] sched: Allocate JobId=4 NodeList=node4 #CPUs=1 Partition=debug [2025-01-28T21:48:50.280] Killing non-startable batch JobId=4: Invalid user id"

id on the slurm controller works fine.

[xxxjoness@xxx.ac.nz@hpcunidrslurmd2 ~]$ id xxxjoness@xxx.ac.nz uid=1204805830(xxxjoness@xxx.ac.nz) gid=1204805830(xxxjoness@xxx.ac.nz) groups=1204805830(xxxjoness@xxx.ac.nz) 8><---

Any ideas please? because I am out.....

I have tried RHEL9.5, this seemed to run but srun is version 22 and on rocky8 it is version20 so fails.

regards

Steven

Attachments:

attachment.html (text/html — 3.4 KB)

Show replies by date

John Hearns

30 Jan 30 Jan

10:53 a.m.

Have you run id on a computer node?

On Wed, Jan 29, 2025, 6:47 PM Steven Jones via slurm-users < slurm-users@lists.schedmd.com> wrote:

...

I am using Redhat's IdM/IPA for users

Slurmctld is failing to run jobs and it is getting "invalid user id".

"2025-01-28T21:48:50.271] sched: Allocate JobId=4 NodeList=node4 #CPUs=1 Partition=debug [2025-01-28T21:48:50.280] Killing non-startable batch JobId=4: Invalid user id"

id on the slurm controller works fine.

[xxxjoness@xxx.ac.nz@hpcunidrslurmd2 ~]$ id xxxjoness@xxx.ac.nz uid=1204805830(xxxjoness@xxx.ac.nz) gid=1204805830(xxxjoness@xxx.ac.nz) groups=1204805830(xxxjoness@xxx.ac.nz) 8><---

Any ideas please? because I am out.....

I have tried RHEL9.5, this seemed to run but srun is version 22 and on rocky8 it is version20 so fails.

regards

Steven

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Steven Jones

8:06 p.m.

Hi,

Yes, even ssh works OK.

[root@xxxunicobuildt1 warewulf]# ssh xxxjonesst@xxx.ac.nz@node1 (xxxjonesst@xxx.ac.nz@node1) Password: Last login: Wed Jan 29 01:26:21 2025 from 130.195.87.12 [xxxjonesst@xxx.ac.nz@node1 ~]$

xxxjonesst@xxx.ac.nz@node1 ~]$ whoami | id uid=1204805830(xxxjonesst@xxx.ac.nz) gid=1204805830(xxxjonesst@xxx.ac.nz)

tail -f /var/log/secure =========

Jan 30 18:19:56 node1 sshd[15443]: pam_sss(sshd:auth): authentication success; logname= uid=0 euid=0 tty=ssh ruser= rhost=130.195.87.12 user=xxxjonesst@xxx.ac.nz Jan 30 18:19:56 node1 sshd[15440]: Accepted keyboard-interactive/pam for xxxjonesst@xxx.ac.nz from 130.195.87.12 port 59402 ssh2

Would there be any relevant changes between RHEL8's slurm and RHEL9's slurm?

[root@node1 ~]# rpm -qa |grep slurm slurm-libs-20.11.9-1.el8.x86_64 slurm-slurmd-20.11.9-1.el8.x86_64 slurm-20.11.9-1.el8.x86_64 [root@node1 ~]#

I would have to go back and check but I do not think I hit this on RHEL9 what I did get was srun ver22 on the RHEL9 server didnt like srun ver20 on the rocky8 node.

Can I compile / rpm build srun ver22 to run on rocky8? or is that part of slurmd?

regards

Steven

________________________________ From: John Hearns hearnsj@gmail.com Sent: Thursday, 30 January 2025 10:53 pm To: Steven Jones steven.jones@vuw.ac.nz Cc: slurm-users@schedmd.com slurm-users@schedmd.com Subject: Re: [slurm-users] RHEL8.10 V slurmctld

You don't often get email from hearnsj@gmail.com. Learn why this is importanthttps://aka.ms/LearnAboutSenderIdentification Have you run id on a computer node?

On Wed, Jan 29, 2025, 6:47 PM Steven Jones via slurm-users <slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com> wrote: I am using Redhat's IdM/IPA for users

Slurmctld is failing to run jobs and it is getting "invalid user id".

"2025-01-28T21:48:50.271] sched: Allocate JobId=4 NodeList=node4 #CPUs=1 Partition=debug [2025-01-28T21:48:50.280] Killing non-startable batch JobId=4: Invalid user id"

id on the slurm controller works fine.

[xxxjoness@xxx.ac.nz@hpcunidrslurmd2 ~]$ id xxxjoness@xxx.ac.nzmailto:xxxjoness@xxx.ac.nz uid=1204805830(xxxjoness@xxx.ac.nzmailto:xxxjoness@xxx.ac.nz) gid=1204805830(xxxjoness@xxx.ac.nzmailto:xxxjoness@xxx.ac.nz) groups=1204805830(xxxjoness@xxx.ac.nzmailto:xxxjoness@xxx.ac.nz) 8><---

Any ideas please? because I am out.....

I have tried RHEL9.5, this seemed to run but srun is version 22 and on rocky8 it is version20 so fails.

regards

Steven

-- slurm-users mailing list -- slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.commailto:slurm-users-leave@lists.schedmd.com

Chris Samuel

2 Feb 2 Feb

10 p.m.

On 29/1/25 10:44 am, Steven Jones via slurm-users wrote:

...

"2025-01-28T21:48:50.271] sched: Allocate JobId=4 NodeList=node4 #CPUs=1 Partition=debug [2025-01-28T21:48:50.280] Killing non-startable batch JobId=4: Invalid user id"

Looking at the source code it looks like that second error is reported back by slurmctld when it sends the RPC out to the compute node and it gets a response back, so I would look at what's going on with node4 to see what's being reported there.

All the best, Chris

Steven Jones

10:54 p.m.

New subject: Fw: Re: RHEL8.10 V slurmctld

Hi,

Thanks for the reply. I already went through this 🙁. I checked all nodes, id works as does a ssh login.

[root@node4 ~]# id xxxjonesst@xxx.ac.nz uid=1204805830(xxxjonesst@xxx.ac.nz) gid=1204805830(xxxjonesst@xxx.ac.nz)

8><--- Connection to node1 closed. [root@xxxunicobuildt1 warewulf]# ssh xxxjonesst@xxx.ac.nz@node4 (xxxjonesst@xxx.ac.nz@node4) Password: [xxxjonesst@xxx.ac.nz@node4 ~]$ whoami xxxjonesst@xxx.ac.nz [xxxjonesst@xxx.ac.nz@node4 ~]$

regards

Steven

________________________________ From: Chris Samuel via slurm-users slurm-users@lists.schedmd.com Sent: Monday, 3 February 2025 10:00 am To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: RHEL8.10 V slurmctld

On 29/1/25 10:44 am, Steven Jones via slurm-users wrote:

...

"2025-01-28T21:48:50.271] sched: Allocate JobId=4 NodeList=node4 #CPUs=1 Partition=debug [2025-01-28T21:48:50.280] Killing non-startable batch JobId=4: Invalid user id"

All the best, Chris

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Chris Samuel

11:28 p.m.

New subject: Fw: Re: RHEL8.10 V slurmctld

On 2/2/25 1:54 pm, Steven Jones via slurm-users wrote:

...

Thanks for the reply. I already went through this 🙁. I checked all nodes, id works as does a ssh login.

What is in your slurmd logs on that node?

Steven Jones

11:46 p.m.

New subject: Fw: Re: RHEL8.10 V slurmctld

Hi,

2025-01-29T00:33:32.123] CPU frequency setting not configured for this node [2025-01-29T00:33:32.124] slurmd version 20.11.9 started [2025-01-29T00:33:32.125] slurmd started on Wed, 29 Jan 2025 00:33:32 +0000 [2025-01-29T00:33:32.125] CPUs=20 Boards=1 Sockets=20 Cores=1 Threads=1 Memory=48269 TmpDisk=23308 Uptime=20 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) [2025-01-29T00:40:36.557] error: Security violation, ping RPC from uid 12002 [2025-01-29T00:43:56.866] error: Security violation, ping RPC from uid 12002 [2025-01-29T00:45:36.025] error: Security violation, ping RPC from uid 12002 [2025-01-29T00:47:16.204] error: Security violation, ping RPC from uid 12002 [2025-01-29T00:48:56.351] error: Security violation, ping RPC from uid 12002 8><---- 2025-01-30T19:43:49.773] error: Security violation, ping RPC from uid 12002 [2025-01-30T19:44:03.823] error: Security violation, batch launch RPC from uid 12002 [2025-01-30T19:44:03.836] error: Security violation: kill_job(8) from uid 12002 [2025-01-30T19:44:59.974] error: Security violation: kill_job(8) from uid 12002 [2025-01-30T19:45:29.024] error: Security violation, ping RPC from uid 12002 8><----

doh, I was looking in /var/log/slurm/ and not /var/log/ I can try running a job again to get a fresh log?

regards

Steven

________________________________ From: Chris Samuel via slurm-users slurm-users@lists.schedmd.com Sent: Monday, 3 February 2025 11:28 am To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

On 2/2/25 1:54 pm, Steven Jones via slurm-users wrote:

...

Thanks for the reply. I already went through this 🙁. I checked all nodes, id works as does a ssh login.

What is in your slurmd logs on that node?

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Chris Samuel

11:59 p.m.

New subject: Fw: Re: RHEL8.10 V slurmctld

On 2/2/25 2:46 pm, Steven Jones via slurm-users wrote:

...

[2025-01-30T19:45:29.024] error: Security violation, ping RPC from uid 12002

Looking at the code that seems to come from this code:

if (!_slurm_authorized_user(msg->auth_uid)) { error("Security violation, batch launch RPC from uid %u", msg->auth_uid); rc = ESLURM_USER_ID_MISSING; /* or bad in this case */ goto done; }

and what it is calling is:

/* * Returns true if "uid" is a "slurm authorized user" - i.e. uid == 0 * or uid == slurm user id at this time. */ static bool _slurm_authorized_user(uid_t uid) { return ((uid == (uid_t) 0) || (uid == slurm_conf.slurm_user_id)); }

Is it possible you're trying to run Slurm as a user other than root or the user designated as the "SlurmUser" in your config?

Also check that whoever you have set as the SlurmUser has the same UID everywhere (in fact everyone should do).

All the best, Chris

Steven Jones

3 Feb 3 Feb

12:46 a.m.

New subject: Fw: Re: RHEL8.10 V slurmctld

Hi,

I have never done a HPC before, it is all new to me so I can be making "newbie errors". The old HPC has been dumped on us so I am trying to build it "professionally" shall we say ie documented, stable and I will train ppl to build it (all this with no money at all).

My understanding is a login as a normal user and run a job, and this worked for me last time. It is possible I have missed something,

[xxxjonesst@xxx.ac.nz@xxxunicoslurmd1 ~]$ cat testjob.sh #!/bin/bash # #SBATCH --job-name=test #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --mem=1G #SBATCH --partition=debug #SBATCH --time=00:10:00 #SBATCH --output=%x_%j.out #SBATCH --error=%x_%j.err

echo "Hello World" echo "Hello Error" 1>&2

This worked on a previous setup the outputs were in my home directory on the NFS server as expected.

regards

Steven

________________________________ From: Chris Samuel via slurm-users slurm-users@lists.schedmd.com Sent: Monday, 3 February 2025 11:59 am To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

On 2/2/25 2:46 pm, Steven Jones via slurm-users wrote:

...

[2025-01-30T19:45:29.024] error: Security violation, ping RPC from uid 12002

Looking at the code that seems to come from this code:

if (!_slurm_authorized_user(msg->auth_uid)) { error("Security violation, batch launch RPC from uid %u", msg->auth_uid); rc = ESLURM_USER_ID_MISSING; /* or bad in this case */ goto done; }

and what it is calling is:

Is it possible you're trying to run Slurm as a user other than root or the user designated as the "SlurmUser" in your config?

Also check that whoever you have set as the SlurmUser has the same UID everywhere (in fact everyone should do).

All the best, Chris

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Chris Samuel

12:52 a.m.

New subject: Fw: Re: RHEL8.10 V slurmctld

On 2/2/25 3:46 pm, Steven Jones wrote:

...

I have never done a HPC before, it is all new to me so I can be making "newbie errors". The old HPC has been dumped on us so I am trying to build it "professionally" shall we say ie documented, stable and I will train ppl to build it (all this with no money at all).

No worries at all! It would be good to know what this says:

scontrol show config | fgrep -i slurmuser

If that doesn't say "root" what does the "id" command say for that user on both the system where slurmctld is running and on node4?

Also on the node where slurmctld is running what does this say?

ps auxwww | fgrep slurmctld

Best of luck! Chris

(you can tell I'm stranded at SFO until tonight due to American Airlines pulling the plane for my morning flight out of service. Still I'd rather than than be another news headline)

Steven Jones

1:18 a.m.

New subject: Fw: Re: RHEL8.10 V slurmctld

slurm server,

[root@xxxunidrslurmd2 slurm]# scontrol show config | fgrep -i slurmuser SlurmUser = slurm(12002) [root@xxxunidrslurmd2 slurm]# id slurm uid=12002(slurm) gid=12002(slurm) groups=12002(slurm) [root@xxxunidrslurmd2 slurm]#

[root@xxxunidrslurmd2 slurm]# ps auxwww | fgrep slurmctld root 2114617 0.0 0.0 222016 1216 pts/2 S+ 00:15 0:00 grep -F --color=auto slurmctld root 2314392 0.0 0.0 217196 856 pts/1 S+ Jan28 0:02 tail -f slurmctld.log slurm 4076498 0.0 0.2 1122692 7868 ? Ssl Jan28 1:46 /usr/sbin/slurmctld -D [root@xxxunidrslurmd2 slurm]#

isn't it slurmd on the compute nodes?

[root@node4 log]# id slurm uid=12002(slurm) gid=12002(slurm) groups=12002(slurm) [root@node4 log]# ps auxwww | fgrep slurmctld root 39462 0.0 0.0 16476 1184 pts/1 S+ 00:14 0:00 grep -F --color=auto slurmctld [root@node4 log]# ps auxwww | fgrep slurmd root 792 0.0 0.0 146400 5748 ? Ss Jan29 0:09 /usr/sbin/slurmd -D root 39464 0.0 0.0 16476 1092 pts/1 S+ 00:14 0:00 grep -F --color=auto slurmd [root@node4 log]#

regards

Steven

________________________________ From: Chris Samuel via slurm-users slurm-users@lists.schedmd.com Sent: Monday, 3 February 2025 12:52 pm To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

On 2/2/25 3:46 pm, Steven Jones wrote:

...

I have never done a HPC before, it is all new to me so I can be making "newbie errors". The old HPC has been dumped on us so I am trying to build it "professionally" shall we say ie documented, stable and I will train ppl to build it (all this with no money at all).

No worries at all! It would be good to know what this says:

scontrol show config | fgrep -i slurmuser

If that doesn't say "root" what does the "id" command say for that user on both the system where slurmctld is running and on node4?

Also on the node where slurmctld is running what does this say?

ps auxwww | fgrep slurmctld

Best of luck! Chris

(you can tell I'm stranded at SFO until tonight due to American Airlines pulling the plane for my morning flight out of service. Still I'd rather than than be another news headline)

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Chris Samuel

1:44 a.m.

New subject: Fw: Re: RHEL8.10 V slurmctld

On 2/2/25 4:18 pm, Steven Jones via slurm-users wrote:

...

isn't it slurmd on the compute nodes?

It is, but as this check is (I think) happening on the compute node I was wanting to check who slurmctld was running as.

The only other thought I have is what is in the compute nodes slurm.conf as the SlurmUser? I wonder if that's set to root? If so it wouldn't know that the "slurm" user was authorised.

Usually those are in step though. Everything else you've shown seems it be in order.

All the best, Chris

Steven Jones

8:33 p.m.

New subject: Fw: Re: RHEL8.10 V slurmctld

Slurm.conf is copied between nodes.

Just built 4 x rocky9 nodes and I do not get that error (but I get another I know how to fix, I think) so holistically I am thinking the version difference is too large.

regards

Steven

________________________________ From: Chris Samuel via slurm-users slurm-users@lists.schedmd.com Sent: Monday, 3 February 2025 1:44 pm To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

On 2/2/25 4:18 pm, Steven Jones via slurm-users wrote:

...

isn't it slurmd on the compute nodes?

It is, but as this check is (I think) happening on the compute node I was wanting to check who slurmctld was running as.

The only other thought I have is what is in the compute nodes slurm.conf as the SlurmUser? I wonder if that's set to root? If so it wouldn't know that the "slurm" user was authorised.

Usually those are in step though. Everything else you've shown seems it be in order.

All the best, Chris

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Christopher Samuel

10:13 p.m.

New subject: Fw: Re: RHEL8.10 V slurmctld

On 2/3/25 2:33 pm, Steven Jones via slurm-users wrote:

...

Just built 4 x rocky9 nodes and I do not get that error (but I get another I know how to fix, I think) so holistically I am thinking the version difference is too large.

Oh I think I missed this - when you say version difference do you mean the Slurm version or the distro version?

I was assuming you were building your Slurm versions yourselves for both, but that may be way off the mark, sorry!

What are the Slurm versions everywhere?

All the best, Chris

-- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Steven Jones

10:56 p.m.

New subject: Fw: Re: RHEL8.10 V slurmctld

I rebuilt 4 nodes as rocky9.5

8><--- [2025-02-03T21:40:11.978] Node node6 now responding [2025-02-03T21:41:15.698] _slurm_rpc_submit_batch_job: JobId=17 InitPrio=4294901759 usec=501 [2025-02-03T21:41:16.055] sched: Allocate JobId=17 NodeList=node6 #CPUs=1 Partition=debug [2025-02-03T21:41:16.059] Killing non-startable batch JobId=17: Invalid user id [2025-02-03T21:41:16.059] _job_complete: JobId=17 WEXITSTATUS 1 [2025-02-03T21:41:16.060] _job_complete: JobId=17 done

So same error RHEL9.5 to Rocky9.5

🙁

Unless I am missing some sort of config setting, I am out of permutations I can try.

regards

Steven

________________________________ From: Christopher Samuel via slurm-users slurm-users@lists.schedmd.com Sent: Tuesday, 4 February 2025 10:13 am To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

On 2/3/25 2:33 pm, Steven Jones via slurm-users wrote:

...

Just built 4 x rocky9 nodes and I do not get that error (but I get another I know how to fix, I think) so holistically I am thinking the version difference is too large.

Oh I think I missed this - when you say version difference do you mean the Slurm version or the distro version?

I was assuming you were building your Slurm versions yourselves for both, but that may be way off the mark, sorry!

What are the Slurm versions everywhere?

All the best, Chris -- Chris Samuel : https://apc01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel...http://www.csamuel.org/ : Berkeley, CA, USA

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Sean Crosby

4 Feb 4 Feb

12:46 a.m.

New subject: Fw: Re: RHEL8.10 V slurmctld

Just double checking. Can you check on your worker node

1. ls -la /etc/pam.d/*slurm*

(just checking if there's a specific pam file for slurmd on your system)

1. scontrol show config | grep -i SlurmdUser

(checking if slurmd is set up with a different user to SlurmUser)

1. grep slurm /etc/passwd

Sean

________________________________ From: Steven Jones via slurm-users slurm-users@lists.schedmd.com Sent: Tuesday, 4 February 2025 08:56 To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com; Christopher Samuel chris@csamuel.org Subject: [EXT] [slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

External email: Please exercise caution

________________________________ I rebuilt 4 nodes as rocky9.5

So same error RHEL9.5 to Rocky9.5

🙁

Unless I am missing some sort of config setting, I am out of permutations I can try.

regards

Steven

On 2/3/25 2:33 pm, Steven Jones via slurm-users wrote:

...

Just built 4 x rocky9 nodes and I do not get that error (but I get another I know how to fix, I think) so holistically I am thinking the version difference is too large.

Oh I think I missed this - when you say version difference do you mean the Slurm version or the distro version?

I was assuming you were building your Slurm versions yourselves for both, but that may be way off the mark, sorry!

What are the Slurm versions everywhere?

All the best, Chris -- Chris Samuel : https://apc01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel...http://www.csamuel.org/ : Berkeley, CA, USA

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Steven Jones

1:09 a.m.

New subject: Fw: Re: RHEL8.10 V slurmctld

No,

[root@node5 log]# ls -la /etc/pam.d/*slurm* ls: cannot access '/etc/pam.d/*slurm*': No such file or directory

Slurm is installed,

[root@node5 log]# rpm -qi slurm Name : slurm Version : 22.05.9 Release : 1.el9 Architecture: x86_64 Install Date: Thu Dec 12 21:02:12 2024 Group : Unspecified Size : 6308503 License : GPLv2 and BSD Signature : RSA/SHA256, Fri May 12 03:36:18 2023, Key ID 8a3872bf3228467c Source RPM : slurm-22.05.9-1.el9.src.rpm Build Date : Fri May 12 03:21:04 2023 Build Host : buildhw-x86-16.iad2.fedoraproject.org Packager : Fedora Project Vendor : Fedora Project URL : https://slurm.schedmd.com/ Bug URL : https://bugz.fedoraproject.org/slurm Summary : Simple Linux Utility for Resource Management Description : Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters. Components include machine status, partition management, job management, scheduling and accounting modules. [root@node5 log]#

regards

Steven Jones

B.Eng (Hons)

Technical Specialist - Linux RHCE

Victoria University, Digital Solutions,

Level 8 Rankin Brown Building,

Wellington, NZ

6012

0064 4 463 6272

________________________________ From: Sean Crosby via slurm-users slurm-users@lists.schedmd.com Sent: Tuesday, 4 February 2025 12:46 pm To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

Just double checking. Can you check on your worker node

1. ls -la /etc/pam.d/*slurm*

[root@node5 log]# ls -la /etc/pam.d/*slurm* ls: cannot access '/etc/pam.d/*slurm*': No such file or directory [root@node5 log]#

(just checking if there's a specific pam file for slurmd on your system)

1. scontrol show config | grep -i SlurmdUser

Cannot run it as i attempted in rpmbuild locally and this is failing.

[root@node5 log]# scontrol show config | grep -i SlurmdUser 2. slurm_load_ctl_conf error: Zero Bytes were transmitted or received 3. [root@node5 log]#

(checking if slurmd is set up with a different user to SlurmUser)

1. grep slurm /etc/passwd

root@node5 log]# grep slurm /etc/passwd slurm:x:12002:12002::/home/slurm:/bin/bash slurm:x:12002:12002::/home/slurm:/bin/bash [root@node5 log]#

Sean

External email: Please exercise caution

________________________________ I rebuilt 4 nodes as rocky9.5

So same error RHEL9.5 to Rocky9.5

🙁

Unless I am missing some sort of config setting, I am out of permutations I can try.

regards

Steven

On 2/3/25 2:33 pm, Steven Jones via slurm-users wrote:

...

Just built 4 x rocky9 nodes and I do not get that error (but I get another I know how to fix, I think) so holistically I am thinking the version difference is too large.

Oh I think I missed this - when you say version difference do you mean the Slurm version or the distro version?

I was assuming you were building your Slurm versions yourselves for both, but that may be way off the mark, sorry!

What are the Slurm versions everywhere?

All the best, Chris -- Chris Samuel : https://apc01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel...http://www.csamuel.org/ : Berkeley, CA, USA

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Steven Jones

4:29 a.m.

New subject: How to clean up?

From the logs 2 errors,

8><--- Feb 04 03:08:48 vuwunicoslurmd1.ods.vuw.ac.nz systemd[1]: Starting Slurm controller daemon... Feb 04 03:08:48 vuwunicoslurmd1.ods.vuw.ac.nz slurmctld[1045020]: slurmctld: error: chdir(/var/log): Permission denied Feb 04 03:08:48 vuwunicoslurmd1.ods.vuw.ac.nz slurmctld[1045020]: slurmctld: slurmctld version 24.11.1 started on cluster poc-cluster(2175) Feb 04 03:08:48 vuwunicoslurmd1.ods.vuw.ac.nz systemd[1]: Started Slurm controller daemon. Feb 04 03:08:48 vuwunicoslurmd1.ods.vuw.ac.nz slurmctld[1045020]: slurmctld: fatal: Can not recover assoc_usage state, incompatible version, got 9728 need >= 9984 <= 10752, start with '-i' to ignore this. Warning: using -i will lose the data that can't be recovered. Feb 04 03:08:48 vuwunicoslurmd1.ods.vuw.ac.nz systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE Feb 04 03:08:48 vuwunicoslurmd1.ods.vuw.ac.nz systemd[1]: slurmctld.service: Failed with result 'exit-code'.

No idea on "slurmctld: error: chdir(/var/log): Permission denied" need more info but the log seems to be written to OK as we can see.

"fatal: Can not recover assoc_usage state, incompatible version,"

This seems to be me attempting to upgrade from ver22 to ver24 but google tells me ver22 "left a mess" and ver24 cant cope. Where would I go looking to clean up please?

regards

Steven

Brian Andrus

5:49 p.m.

New subject: How to clean up?

Steven,

Looks like you may have had a secondary controller that took over and changed your StateSave files.

IF you don't need the job info AND no jobs are running, you can just rename/delete your StateSaveLocation directory and things will be recreated. Job numbers will start over (unless you set FirstJobId, which you should if you want to keep your sacct data).

It also looks like your logging does not have permissions. Change SlurmctldLogFile to be something like /var/log/slurm/slurmctld.log and set the owner of /var/log/slurm to the slurm user.

Ensure all slurmctld daemons are down, then start the first. Once it is up (you can run scontrol show config) start the second. Run 'scontrol show config' again and you should see both daemons listed as 'up at the end of the output.

-Brian Andrus

On 2/3/2025 7:29 PM, Steven Jones via slurm-users wrote:

...

...
From the logs 2 errors,

8><--- Feb 04 03:08:48 vuwunicoslurmd1.ods.vuw.ac.nz systemd[1]: Starting Slurm controller daemon... Feb 04 03:08:48 vuwunicoslurmd1.ods.vuw.ac.nz slurmctld[1045020]: slurmctld: error: chdir(/var/log): Permission denied Feb 04 03:08:48 vuwunicoslurmd1.ods.vuw.ac.nz slurmctld[1045020]: slurmctld: slurmctld version 24.11.1 started on cluster poc-cluster(2175) Feb 04 03:08:48 vuwunicoslurmd1.ods.vuw.ac.nz systemd[1]: Started Slurm controller daemon. Feb 04 03:08:48 vuwunicoslurmd1.ods.vuw.ac.nz slurmctld[1045020]: slurmctld: fatal: Can not recover assoc_usage state, incompatible version, got 9728 need >= 9984 <= 10752, start with '-i' to ignore this. Warning: using -i will lose the data that can't be recovered. Feb 04 03:08:48 vuwunicoslurmd1.ods.vuw.ac.nz systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE Feb 04 03:08:48 vuwunicoslurmd1.ods.vuw.ac.nz systemd[1]: slurmctld.service: Failed with result 'exit-code'.

No idea on "slurmctld: error: chdir(/var/log): Permission denied" need more info but the log seems to be written to OK as we can see.

"fatal: Can not recover assoc_usage state, incompatible version,"

This seems to be me attempting to upgrade from ver22 to ver24 but google tells me ver22 "left a mess" and ver24 cant cope. Where would I go looking to clean up please?

regards

Steven

Steven Jones

12:50 a.m.

New subject: Installing slurm*

Hi,

After rpmbuilding slurm,

Do I need to install all of these or just slurm-slurmctld-24.11.1-1.el9.x86_64.rpm on the controller and slurm-slurmd-24.11.1-1.el9.x86_64.rpm on the compute nodes?

-rw-r--r--. 1 root root 18508016 Feb 3 23:46 slurm-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 21301 Feb 3 23:45 slurm-contribs-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 82607 Feb 3 23:45 slurm-devel-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 13443 Feb 3 23:45 slurm-example-configs-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 161374 Feb 3 23:45 slurm-libpmi-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 12829 Feb 3 23:45 slurm-openlava-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 149940 Feb 3 23:45 slurm-pam_slurm-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 839356 Feb 3 23:45 slurm-perlapi-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 104020 Feb 3 23:45 slurm-sackd-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 1613867 Feb 3 23:45 slurm-slurmctld-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 1016317 Feb 3 23:45 slurm-slurmd-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 925546 Feb 3 23:45 slurm-slurmdbd-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 133254 Feb 3 23:45 slurm-torque-24.11.1-1.el9.x86_64.rpm

regards

Steven

Marko Markoc

1 a.m.

New subject: Installing slurm*

Hi Steven,

You can find list of packages to installed based on the node role here:

https://slurm.schedmd.com/quickstart_admin.html#pkg_install

Thanks, Marko

On Mon, Feb 3, 2025 at 3:51 PM Steven Jones via slurm-users < slurm-users@lists.schedmd.com> wrote:

...

Hi,

After rpmbuilding slurm,

Do I need to install all of these or just slurm-slurmctld-24.11.1-1.el9.x86_64.rpm on the controller and slurm-slurmd-24.11.1-1.el9.x86_64.rpm on the compute nodes?

-rw-r--r--. 1 root root 18508016 Feb 3 23:46 slurm-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 21301 Feb 3 23:45 slurm-contribs-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 82607 Feb 3 23:45 slurm-devel-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 13443 Feb 3 23:45 slurm-example-configs-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 161374 Feb 3 23:45 slurm-libpmi-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 12829 Feb 3 23:45 slurm-openlava-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 149940 Feb 3 23:45 slurm-pam_slurm-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 839356 Feb 3 23:45 slurm-perlapi-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 104020 Feb 3 23:45 slurm-sackd-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 1613867 Feb 3 23:45 slurm-slurmctld-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 1016317 Feb 3 23:45 slurm-slurmd-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 925546 Feb 3 23:45 slurm-slurmdbd-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 133254 Feb 3 23:45 slurm-torque-24.11.1-1.el9.x86_64.rpm

regards

Steven

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

John Hearns

9:15 a.m.

New subject: Installing slurm*

Steven, one tip if you are just starting with Slurm: "Use the logs Luke, Use the logs"

By this I mean tail -f /var/log/slurmctl and restart the slurmctld service On a compute node tail -f /var/log/slurmd

Oh, and you probably are going to set up Munge also - which is easy.

On Tue, 4 Feb 2025 at 00:03, Marko Markoc via slurm-users < slurm-users@lists.schedmd.com> wrote:

...

Hi Steven,

You can find list of packages to installed based on the node role here:

https://slurm.schedmd.com/quickstart_admin.html#pkg_install

Thanks, Marko

On Mon, Feb 3, 2025 at 3:51 PM Steven Jones via slurm-users < slurm-users@lists.schedmd.com> wrote:

...
Hi,

After rpmbuilding slurm,

Do I need to install all of these or just slurm-slurmctld-24.11.1-1.el9.x86_64.rpm on the controller and slurm-slurmd-24.11.1-1.el9.x86_64.rpm on the compute nodes?

-rw-r--r--. 1 root root 18508016 Feb 3 23:46 slurm-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 21301 Feb 3 23:45 slurm-contribs-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 82607 Feb 3 23:45 slurm-devel-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 13443 Feb 3 23:45 slurm-example-configs-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 161374 Feb 3 23:45 slurm-libpmi-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 12829 Feb 3 23:45 slurm-openlava-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 149940 Feb 3 23:45 slurm-pam_slurm-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 839356 Feb 3 23:45 slurm-perlapi-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 104020 Feb 3 23:45 slurm-sackd-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 1613867 Feb 3 23:45 slurm-slurmctld-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 1016317 Feb 3 23:45 slurm-slurmd-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 925546 Feb 3 23:45 slurm-slurmdbd-24.11.1-1.el9.x86_64.rpm -rw-r--r--. 1 root root 133254 Feb 3 23:45 slurm-torque-24.11.1-1.el9.x86_64.rpm

regards

Steven

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Renfro, Michael

3 Feb 3 Feb

8:51 p.m.

New subject: Fw: Re: RHEL8.10 V slurmctld

Late to the party here, but depending on how much time you have invested, how much you can tolerate reformats or other more destructive work, etc., you might consider OpenHPC and its install guide ([1] for RHEL 8 variants, [2] or [3] for RHEL 9 variants, depending on which version of Warewulf you prefer). I’ve also got some workshop materials on building login nodes, GPU drivers, stateful provisioning, etc. for OpenHPC 3 and Warewulf 3 at [4].

At least in an isolated VirtualBox environment with no outside IdP or other dependencies, my student workers have usually been able to get their first batch job running within a day.

[1] https://github.com/openhpc/ohpc/releases/download/v2.9.GA/Install_guide-Rock... [2] https://github.com/openhpc/ohpc/releases/download/v3.2.GA/Install_guide-Rock... [3] https://github.com/openhpc/ohpc/releases/download/v3.2.GA/Install_guide-Rock... [4] https://github.com/mikerenfro/openhpc-beyond-the-install-guide/blob/main/ohp...

From: Steven Jones via slurm-users slurm-users@lists.schedmd.com Date: Sunday, February 2, 2025 at 5:48 PM To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com, Chris Samuel chris@csamuel.org Subject: [slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

External Email Warning

This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.

________________________________ Hi,

My understanding is a login as a normal user and run a job, and this worked for me last time. It is possible I have missed something,

echo "Hello World" echo "Hello Error" 1>&2

This worked on a previous setup the outputs were in my home directory on the NFS server as expected.

regards

Steven

On 2/2/25 2:46 pm, Steven Jones via slurm-users wrote:

...

[2025-01-30T19:45:29.024] error: Security violation, ping RPC from uid 12002

Looking at the code that seems to come from this code:

if (!_slurm_authorized_user(msg->auth_uid)) { error("Security violation, batch launch RPC from uid %u", msg->auth_uid); rc = ESLURM_USER_ID_MISSING; /* or bad in this case */ goto done; }

and what it is calling is:

Is it possible you're trying to run Slurm as a user other than root or the user designated as the "SlurmUser" in your config?

Also check that whoever you have set as the SlurmUser has the same UID everywhere (in fact everyone should do).

All the best, Chris

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Steven Jones

9:14 p.m.

New subject: Fw: Re: RHEL8.10 V slurmctld

Hi,

Thanks, but isolated isnt the goal in my case. The goal is to save admin time we cant afford and to have a far reaching setup.

So I have to link the HPC to IPA/Idm and on to AD in a trust that way user admins can jsut drop a student or staff member into an AD group and job done. That also means we can use Globus to transfer large lumps of data globally in and out of the HPC.

I have taken your notes as they look interesting for "extras" like I have not looked at making GPUs work yet. If I can get the basics going then I'll look at the icing.

regards

Steven

________________________________ From: Renfro, Michael Renfro@tntech.edu Sent: Tuesday, 4 February 2025 8:51 am To: Steven Jones steven.jones@vuw.ac.nz; slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com; Chris Samuel chris@csamuel.org Subject: Re: [slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

You don't often get email from renfro@tntech.edu. Learn why this is importanthttps://aka.ms/LearnAboutSenderIdentification

At least in an isolated VirtualBox environment with no outside IdP or other dependencies, my student workers have usually been able to get their first batch job running within a day.

[1] https://github.com/openhpc/ohpc/releases/download/v2.9.GA/Install_guide-Rock...

[2] https://github.com/openhpc/ohpc/releases/download/v3.2.GA/Install_guide-Rock...

[3] https://github.com/openhpc/ohpc/releases/download/v3.2.GA/Install_guide-Rock...

[4] https://github.com/mikerenfro/openhpc-beyond-the-install-guide/blob/main/ohp...

External Email Warning

This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.

________________________________

Hi,

My understanding is a login as a normal user and run a job, and this worked for me last time. It is possible I have missed something,

[xxxjonesst@xxx.ac.nz@xxxunicoslurmd1 ~]$ cat testjob.sh

#!/bin/bash

#SBATCH --job-name=test

#SBATCH --nodes=1

#SBATCH --ntasks=1

#SBATCH --cpus-per-task=1

#SBATCH --mem=1G

#SBATCH --partition=debug

#SBATCH --time=00:10:00

#SBATCH --output=%x_%j.out

#SBATCH --error=%x_%j.err

echo "Hello World"

echo "Hello Error" 1>&2

This worked on a previous setup the outputs were in my home directory on the NFS server as expected.

regards

Steven

________________________________

From: Chris Samuel via slurm-users slurm-users@lists.schedmd.com Sent: Monday, 3 February 2025 11:59 am To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

On 2/2/25 2:46 pm, Steven Jones via slurm-users wrote:

...

[2025-01-30T19:45:29.024] error: Security violation, ping RPC from uid 12002

Looking at the code that seems to come from this code:

if (!_slurm_authorized_user(msg->auth_uid)) { error("Security violation, batch launch RPC from uid %u", msg->auth_uid); rc = ESLURM_USER_ID_MISSING; /* or bad in this case */ goto done; }

and what it is calling is:

Is it possible you're trying to run Slurm as a user other than root or the user designated as the "SlurmUser" in your config?

Also check that whoever you have set as the SlurmUser has the same UID everywhere (in fact everyone should do).

All the best, Chris

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Renfro, Michael

9:17 p.m.

New subject: Fw: Re: RHEL8.10 V slurmctld

We only do isolated on the students’ VirtualBox setups because it’s simpler for them to get started with. Our production HPC with OpenHPC is definitely integrated with our Active Directory (directly via sssd, not with an intermediate product), etc. Not everyone does it that way, but our scale is small enough to where we’ve never had a load or other performance issue with our AD.

From: Steven Jones steven.jones@vuw.ac.nz Date: Monday, February 3, 2025 at 2:14 PM To: Renfro, Michael Renfro@tntech.edu, slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com, Chris Samuel chris@csamuel.org Subject: Re: [slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

External Email Warning

This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.

________________________________ Hi,

Thanks, but isolated isnt the goal in my case. The goal is to save admin time we cant afford and to have a far reaching setup.

I have taken your notes as they look interesting for "extras" like I have not looked at making GPUs work yet. If I can get the basics going then I'll look at the icing.

regards

Steven

You don't often get email from renfro@tntech.edu. Learn why this is importanthttps://aka.ms/LearnAboutSenderIdentification

At least in an isolated VirtualBox environment with no outside IdP or other dependencies, my student workers have usually been able to get their first batch job running within a day.

[1] https://github.com/openhpc/ohpc/releases/download/v2.9.GA/Install_guide-Rock...

[2] https://github.com/openhpc/ohpc/releases/download/v3.2.GA/Install_guide-Rock...

[3] https://github.com/openhpc/ohpc/releases/download/v3.2.GA/Install_guide-Rock...

[4] https://github.com/mikerenfro/openhpc-beyond-the-install-guide/blob/main/ohp...

External Email Warning

This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.

________________________________

Hi,

My understanding is a login as a normal user and run a job, and this worked for me last time. It is possible I have missed something,

[xxxjonesst@xxx.ac.nz@xxxunicoslurmd1 ~]$ cat testjob.sh

#!/bin/bash

#SBATCH --job-name=test

#SBATCH --nodes=1

#SBATCH --ntasks=1

#SBATCH --cpus-per-task=1

#SBATCH --mem=1G

#SBATCH --partition=debug

#SBATCH --time=00:10:00

#SBATCH --output=%x_%j.out

#SBATCH --error=%x_%j.err

echo "Hello World"

echo "Hello Error" 1>&2

This worked on a previous setup the outputs were in my home directory on the NFS server as expected.

regards

Steven

________________________________

On 2/2/25 2:46 pm, Steven Jones via slurm-users wrote:

...

[2025-01-30T19:45:29.024] error: Security violation, ping RPC from uid 12002

Looking at the code that seems to come from this code:

if (!_slurm_authorized_user(msg->auth_uid)) { error("Security violation, batch launch RPC from uid %u", msg->auth_uid); rc = ESLURM_USER_ID_MISSING; /* or bad in this case */ goto done; }

and what it is calling is:

Is it possible you're trying to run Slurm as a user other than root or the user designated as the "SlurmUser" in your config?

Also check that whoever you have set as the SlurmUser has the same UID everywhere (in fact everyone should do).

All the best, Chris

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

251

Age (days ago)

257

Last active (days ago)

slurm-users@lists.schedmd.com

24 comments

8 participants

tags (0)

participants (8)

Brian Andrus
Chris Samuel
Christopher Samuel
John Hearns
Marko Markoc
Renfro, Michael
Sean Crosby
Steven Jones