Forget what I just said.   slurmctld had not been restarted in a month of Sundays and it was logging mismatched in the slurm.conf
Slurm reconfig and a restart f all slurmd and problem looks fixed.

On Sun, 10 Nov 2024 at 14:50, John Hearns <hearnsj@gmail.com> wrote:
I have cluster which uses Slurm 23.11.6

When I submit a multi-node job and run something like
clush -b -w $SLURM_JOB_NODELIST "date"
very often the ssh command fails with:
 Access denied by pam_slurm_adopt: you have no active jobs on this node

This will happen maybe on 50% of the nodes
There is the same behaviour of I salloc a number of nodes then try to ssh to a node.

I have traced this to slurmstepd spawning a long sleep, which I believe allows proctrackd to 'see' if a job is active.
On nodes that I can ssh into:
root        3211       1  0 Nov08 ?        00:00:00 /usr/sbin/slurmd --systemd
root        3227       1  0 Nov08 ?        00:00:00 /usr/sbin/slurmstepd infinity
root       24322       1  0 15:40 ?        00:00:00 slurmstepd: [15709.extern]
root       24326   24322  0 15:40 ?        00:00:00  \_ sleep 100000000

On nodes where I cannot ssh:
root        3226       1  0 Nov08 ?        00:00:00 /usr/sbin/slurmd --systemd
root        3258       1  0 Nov08 ?        00:00:00 /usr/sbin/slurmstepd infinity

Maybe I am not understanding something here?

ps. I ahve tried to run the pam_slurm_adopt module with options to debug, and have not found anything useful

John H