I have cluster which uses Slurm 23.11.6
When I submit a multi-node job and run something like clush -b -w $SLURM_JOB_NODELIST "date" very often the ssh command fails with: Access denied by pam_slurm_adopt: you have no active jobs on this node
This will happen maybe on 50% of the nodes There is the same behaviour of I salloc a number of nodes then try to ssh to a node.
I have traced this to slurmstepd spawning a long sleep, which I believe allows proctrackd to 'see' if a job is active. On nodes that I can ssh into: root 3211 1 0 Nov08 ? 00:00:00 /usr/sbin/slurmd --systemd root 3227 1 0 Nov08 ? 00:00:00 /usr/sbin/slurmstepd infinity root 24322 1 0 15:40 ? 00:00:00 slurmstepd: [15709.extern] root 24326 24322 0 15:40 ? 00:00:00 _ sleep 100000000
On nodes where I cannot ssh: root 3226 1 0 Nov08 ? 00:00:00 /usr/sbin/slurmd --systemd root 3258 1 0 Nov08 ? 00:00:00 /usr/sbin/slurmstepd infinity
Maybe I am not understanding something here?
ps. I ahve tried to run the pam_slurm_adopt module with options to debug, and have not found anything useful
John H