Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

30 Jan 2024


      We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in
a completing state and slurmd daemons can't be killed because they are left
in a CLOSE-WAIT state. See my previous mail to the mailing list for the
details. And also https://bugs.schedmd.com/show_bug.cgi?id=18561 for
another site having issues.
We've now downgraded the clients (slurmd and login nodes) to 23.02.7 which
gets rid of most issues. If possible, I would try to also downgrade
slurmctld to an earlier release, but this requires getting rid of all
running and queued jobs.
Kind regards,
Fokke
Op ma 29 jan 2024 om 01:00 schreef Paul Raines raines@nmr.mgh.harvard.edu:
...
Some more info on what I am seeing after the 23.11.3 upgrade.
Here is a case where a job is cancelled but seems permanently
stuck in 'CG' state in squeue
[2024-01-28T17:34:11.002] debug3: sched: JobId=3679903 initiated
[2024-01-28T17:34:11.002] sched: Allocate JobId=3679903 NodeList=rtx-06
#CPUs=4 Partition=rtx8000
[2024-01-28T17:34:11.002] debug3: create_mmap_buf: loaded file
`/var/slurm/spool/ctld/hash.3/job.3679903/script` as buf_t
[2024-01-28T17:42:27.724] _slurm_rpc_kill_job: REQUEST_KILL_JOB
JobId=3679903 uid 5875902
[2024-01-28T17:42:27.725] debug:  email msg to sg1526: Slurm
Job_id=3679903 Name=sjob_1246 Ended, Run time 00:08:17, CANCELLED,
ExitCode 0
[2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job:
JobId=3679903 action:normal
[2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job:
removed JobId=3679903 from part rtx8000 row 0
[2024-01-28T17:42:27.726] job_signal: 9 of running JobId=3679903
successful 0x8004
[2024-01-28T17:43:19.000] Resending TERMINATE_JOB request JobId=3679903
Nodelist=rtx-06
[2024-01-28T17:44:20.000] Resending TERMINATE_JOB request JobId=3679903
Nodelist=rtx-06
[2024-01-28T17:45:20.000] Resending TERMINATE_JOB request JobId=3679903
Nodelist=rtx-06
[2024-01-28T17:46:20.000] Resending TERMINATE_JOB request JobId=3679903
Nodelist=rtx-06
[2024-01-28T17:47:20.000] Resending TERMINATE_JOB request JobId=3679903
Nodelist=rtx-06
So at 17:42 the user must of done an scancel.  In the slurmd log on the
node I see:
[2024-01-28T17:42:27.727] debug:  _rpc_terminate_job: uid = 1150
JobId=3679903
[2024-01-28T17:42:27.728] debug:  credential for job 3679903 revoked
[2024-01-28T17:42:27.728] debug:  _rpc_terminate_job: sent SUCCESS for
3679903, waiting for prolog to finish
[2024-01-28T17:42:27.728] debug:  Waiting for job 3679903's prolog to
complete
[2024-01-28T17:43:19.002] debug:  _rpc_terminate_job: uid = 1150
_JobId=3679903
[2024-01-28T17:44:20.001] debug:  _rpc_terminate_job: uid = 1150
JobId=3679903
[2024-01-28T17:45:20.002] debug:  _rpc_terminate_job: uid = 1150
JobId=3679903
[2024-01-28T17:46:20.001] debug:  _rpc_terminate_job: uid = 1150
JobId=3679903
[2024-01-28T17:47:20.002] debug:  _rpc_terminate_job: uid = 1150
JobId=3679903
Strange that a prolog is being called on job cancel
slurmd seems to be getting the repeated calls to terminate the job
from slurmctld but it is not happening.  Also the process table has
[root@rtx-06 ~]# ps auxw | grep slurmd
root      161784  0.0  0.0 436748 21720 ?        Ssl  13:44   0:00
/usr/sbin/slurmd --systemd
root      190494  0.0  0.0      0     0 ?        Zs   17:34   0:00
[slurmd] <defunct>
where there is now a zombie slurmd process I cannot kill even with kill -9
If I do a 'systemctl stop slurmd' it takes a long time but eventually stop
slurmd and gets rid of the zombie process but kills the "good" running
jobs too with NODE_FAIL.
Another case is where a job will be cancelled and SLURM acts like it
is cancelled with it not showing up in squeue but the process keep running
on the box.
# pstree -u sg1526 -p | grep ^slurm
slurm_script(185763)---python(185796)-+-{python}(185797)
# strings /proc/185763/environ | grep JOB_ID
SLURM_JOB_ID=3679888
# squeue -j 3679888
slurm_load_jobs error: Invalid job id specified
sacct shows that job being cancelled.  In the slurmd log we see
[2024-01-28T17:33:58.757] debug:  _rpc_terminate_job: uid = 1150
JobId=3679888
[2024-01-28T17:33:58.757] debug:  credential for job 3679888 revoked
[2024-01-28T17:33:58.757] debug:  _step_connect: connect() failed for
/var/slurm/spool/d/rtx-06_3679888.4294967292: Connection refused
[2024-01-28T17:33:58.757] debug:  Cleaned up stray socket
/var/slurm/spool/d/rtx-06_3679888.4294967292
[2024-01-28T17:33:58.757] debug:  signal for nonexistent
StepId=3679888.extern stepd_connect failed: Connection refused
[2024-01-28T17:33:58.757] debug:  _step_connect: connect() failed for
/var/slurm/spool/d/rtx-06_3679888.4294967291: Connection refused
[2024-01-28T17:33:58.757] debug:  Cleaned up stray socket
/var/slurm/spool/d/rtx-06_3679888.4294967291
[2024-01-28T17:33:58.757] _handle_stray_script: Purging vestigial job
script /var/slurm/spool/d/job3679888/slurm_script
[2024-01-28T17:33:58.757] debug:  signal for nonexistent
StepId=3679888.batch stepd_connect failed: Connection refused
[2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 were able to
be signaled with 18
[2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 to send signal
15
[2024-01-28T17:33:58.757] debug2: set revoke expiration for jobid 3679888
to 1706481358 UTS
[2024-01-28T17:33:58.757] debug:  Waiting for job 3679888's prolog to
complete
[2024-01-28T17:33:58.757] debug:  Finished wait for job 3679888's prolog
to complete
[2024-01-28T17:33:58.771] debug:  completed epilog for jobid 3679888
[2024-01-28T17:33:58.774] debug:  JobId=3679888: sent epilog complete msg:
rc = 0
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
Please note that this e-mail is not secure (encrypted).  If you do not
wish to continue communication over unencrypted e-mail, please notify the
sender of this message immediately.  Continuing to send or respond to
e-mail after receiving this message means you understand and accept this
risk and wish to continue to communicate over unencrypted e-mail.
-- 
Fokke Dijkstra f.dijkstra@rug.nl f.dijkstra@rug.nl
Team High Performance Computing
Center for Information Technology, University of Groningen
Postbus 11044, 9700 CA  Groningen, The Netherlands

2025

2024

Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state