[slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

Sun Jan 28 14:51:28 UTC 2024

I finally had downtime on our cluster running 20.11.3 and decided to
upgrade SLURM.  All daemons were stopped on nodes and master.
Rocky 8 Linux OS was updated but not changed configuration-wise
in anyway.

On the master, when I first installed 23.11.1 and tried to run
slurmdbd -D -vvv at the command line, it balked as the more than
2 major version jump.  I then installed 22.05.6 and run slurmdbd -D -vvv
which took a long time but completed fine.  I reinstalled 23.11.1
and did slurmdbd -D --vvv which went quick. I stopped it then
ran slurmdbd normally with systemctl.

When I ran slurmctld it complained about incompatable state files.
I didn't care about loosing state so ran with the -i option to
ignore.  Seemed happy.

I then started up slurmd on my various GPU and non-GPU nodes.
I ran a test interactive job just to see if I would get a shell
with expected SLURM environment on the nodes and it worked.
Except on the GPU nodes I got errors about not being able
to get/set GPU frequences.  I removed GpuFreqDef=medium from slurm.conf
and that went away.

I had my people start running jobs last night.

This morning I come in and see a lot of nodes are stuck in completing 
state.  All these are GPU nodes.

For example on node 'sevilla' squeue reports

   JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
3679433   lcna100 sjob_123   sg1526 CG       8:29      1 sevilla
3679432   lcna100 sjob_123   sg1526 CG       8:35      1 sevilla
3679431   lcna100 sjob_123   sg1526 CG       8:42      1 sevilla

In the slurmd.log on sevilla I see every minute the lines:

[2024-01-28T09:37:52.003] debug2: Start processing RPC: REQUEST_TERMINATE_JOB
[2024-01-28T09:37:52.004] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2024-01-28T09:37:52.004] debug:  _rpc_terminate_job: uid = 1150 JobId=3679432
[2024-01-28T09:37:52.004] debug2: Start processing RPC: REQUEST_TERMINATE_JOB
[2024-01-28T09:37:52.004] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2024-01-28T09:37:52.004] debug:  _rpc_terminate_job: uid = 1150 JobId=3679431
[2024-01-28T09:37:52.004] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB
[2024-01-28T09:37:52.004] debug2: Start processing RPC: REQUEST_TERMINATE_JOB
[2024-01-28T09:37:52.004] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2024-01-28T09:37:52.004] debug:  _rpc_terminate_job: uid = 1150 JobId=3679433
[2024-01-28T09:37:52.004] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB
[2024-01-28T09:37:52.004] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB

Back at the time 3679433 was cancelled by the user I see

[2024-01-28T00:27:48.658] debug2: Start processing RPC: REQUEST_TERMINATE_JOB
[2024-01-28T00:27:48.658] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2024-01-28T00:27:48.658] debug:  _rpc_terminate_job: uid = 1150 JobId=3679433
[2024-01-28T00:27:48.658] debug:  credential for job 3679433 revoked
[2024-01-28T00:27:48.658] debug:  _rpc_terminate_job: sent SUCCESS for 3679433, waiting for prolog to finish
[2024-01-28T00:27:48.658] debug:  Waiting for job 3679433's prolog to complete

On sevilla itself there are no processes currently running for user
sg1526 so no sign of these jobs.  There ARE processes running
for user mu40 though

[root at sevilla ~]# pgrep -f slurm_script | xargs -n1 pstree -p | grep 
^slurm
slurm_script(30691)---starter-suid(30764)-+-python3(30784)-+-{python3}(30817)
slurm_script(30861)---starter-suid(30873)-+-python3(30888)-+-{python3}(30915)
slurm_script(30963)---starter-suid(30975)-+-python3(30992)-+-{python3}(31020)

[root at sevilla ~]# strings /proc/30691/cmdline
/bin/sh
/var/slurm/spool/d/job3679220/slurm_script

/usr/bin/sacct -p --jobs=3679220 
--format=JobID,User,ReqTRES,NodeList,Start,End,Elapsed,CPUTime,State,ExitCode 
JobID|User|ReqTRES|NodeList|Start|End|Elapsed|CPUTime|State|ExitCode|
3679220|mu40|billing=6,cpu=3,gres/gpu=1,mem=32G,node=1|sevilla|2024-01-27T22:33:27|2024-01-27T23:26:39|00:53:12|02:39:36|NODE_FAIL|1:0|
3679220.batch|||sevilla|2024-01-27T22:33:27|2024-01-27T23:26:39|00:53:12|02:39:36|CANCELLED||
3679220.extern|||sevilla|2024-01-27T22:33:27|2024-01-27T23:26:39|00:53:12|02:39:36|CANCELLED||

The slurmd.log has at the time it started lines like:

[2024-01-27T22:33:28.679] [3679220.batch] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30691, probably hasn't started yet or has already finished
[2024-01-27T22:33:58.092] [3679220.extern] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30650, probably hasn't started yet or has already finished
[2024-01-27T22:33:58.166] [3679220.batch] debug2: profile signaling type Task
[2024-01-27T22:33:58.168] [3679220.batch] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30691, probably hasn't started yet or has already finished
[2024-01-27T22:33:58.170] [3679220.batch] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30764, probably hasn't started yet or has already finished
[2024-01-27T22:33:58.172] [3679220.batch] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30784, probably hasn't started yet or has already finished
...
[2024-01-27T22:46:39.018] debug:  _step_connect: connect() failed for /var/slurm/spool/d/sevilla_3679220.4294967292: Connection refused
[2024-01-27T22:46:39.018] debug:  Cleaned up stray socket /var/slurm/spool/d/sevilla_3679220.4294967292
[2024-01-27T22:46:39.018] debug:  _step_connect: connect() failed for /var/slurm/spool/d/sevilla_3679220.4294967291: Connection refused
[2024-01-27T22:46:39.018] debug:  Cleaned up stray socket /var/slurm/spool/d/sevilla_3679220.4294967291
[2024-01-27T22:46:39.018] _handle_stray_script: Purging vestigial job script /var/slurm/spool/d/job3679220/slurm_script

then later at the time of termination

[2024-01-27T23:26:39.042] debug:  _rpc_terminate_job: uid = 1150 JobId=3679220
[2024-01-27T23:26:39.042] debug:  credential for job 3679220 revoked
[2024-01-27T23:26:39.042] debug2: No steps in jobid 3679220 to send signal 998
[2024-01-27T23:26:39.043] debug2: No steps in jobid 3679220 to send signal 18
[2024-01-27T23:26:39.043] debug2: No steps in jobid 3679220 to send signal 15
[2024-01-27T23:26:39.044] debug2: set revoke expiration for jobid 3679220 to 1706416119 UTS
[2024-01-27T23:26:39.044] debug:  Waiting for job 3679220's prolog to complete
[2024-01-27T23:26:39.044] debug:  Finished wait for job 3679220's prolog to complete
[2024-01-27T23:26:39.068] debug:  completed epilog for jobid 3679220
[2024-01-27T23:26:39.071] debug:  JobId=3679220: sent epilog complete msg: rc = 0

So I have a node stuck in 'comp' state that claims to have
jobs completing that are definite NOT running on the box BUT
there are jobs running on the box that SLURM thinks are done

---------------------------------------------------------------
Paul Raines                     http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street     Charlestown, MA 02129	    USA

Please note that this e-mail is not secure (encrypted).  If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately.  Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.