<div dir="ltr">We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in a completing state and slurmd daemons can't be killed because they are left in a CLOSE-WAIT state. See my previous mail to the mailing list for the details. And also <a href="https://bugs.schedmd.com/show_bug.cgi?id=18561">https://bugs.schedmd.com/show_bug.cgi?id=18561</a> for another site having issues.<div>We've now downgraded the clients (slurmd and login nodes) to 23.02.7 which gets rid of most issues. If possible, I would try to also downgrade slurmctld to an earlier release, but this requires getting rid of all running and queued jobs.</div><div><br></div><div>Kind regards,</div><div><br></div><div>Fokke</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Op ma 29 jan 2024 om 01:00 schreef Paul Raines <<a href="mailto:raines@nmr.mgh.harvard.edu" target="_blank">raines@nmr.mgh.harvard.edu</a>>:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Some more info on what I am seeing after the 23.11.3 upgrade.<br>
<br>
Here is a case where a job is cancelled but seems permanently<br>
stuck in 'CG' state in squeue<br>
<br>
[2024-01-28T17:34:11.002] debug3: sched: JobId=3679903 initiated<br>
[2024-01-28T17:34:11.002] sched: Allocate JobId=3679903 NodeList=rtx-06<br>
#CPUs=4 Partition=rtx8000<br>
[2024-01-28T17:34:11.002] debug3: create_mmap_buf: loaded file<br>
`/var/slurm/spool/ctld/hash.3/job.3679903/script` as buf_t<br>
[2024-01-28T17:42:27.724] _slurm_rpc_kill_job: REQUEST_KILL_JOB<br>
JobId=3679903 uid 5875902<br>
[2024-01-28T17:42:27.725] debug: email msg to sg1526: Slurm<br>
Job_id=3679903 Name=sjob_1246 Ended, Run time 00:08:17, CANCELLED,<br>
ExitCode 0<br>
[2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job:<br>
JobId=3679903 action:normal<br>
[2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job:<br>
removed JobId=3679903 from part rtx8000 row 0<br>
[2024-01-28T17:42:27.726] job_signal: 9 of running JobId=3679903<br>
successful 0x8004<br>
[2024-01-28T17:43:19.000] Resending TERMINATE_JOB request JobId=3679903<br>
Nodelist=rtx-06<br>
[2024-01-28T17:44:20.000] Resending TERMINATE_JOB request JobId=3679903<br>
Nodelist=rtx-06<br>
[2024-01-28T17:45:20.000] Resending TERMINATE_JOB request JobId=3679903<br>
Nodelist=rtx-06<br>
[2024-01-28T17:46:20.000] Resending TERMINATE_JOB request JobId=3679903<br>
Nodelist=rtx-06<br>
[2024-01-28T17:47:20.000] Resending TERMINATE_JOB request JobId=3679903<br>
Nodelist=rtx-06<br>
<br>
<br>
So at 17:42 the user must of done an scancel. In the slurmd log on the<br>
node I see:<br>
<br>
[2024-01-28T17:42:27.727] debug: _rpc_terminate_job: uid = 1150<br>
JobId=3679903<br>
[2024-01-28T17:42:27.728] debug: credential for job 3679903 revoked<br>
[2024-01-28T17:42:27.728] debug: _rpc_terminate_job: sent SUCCESS for<br>
3679903, waiting for prolog to finish<br>
[2024-01-28T17:42:27.728] debug: Waiting for job 3679903's prolog to<br>
complete<br>
[2024-01-28T17:43:19.002] debug: _rpc_terminate_job: uid = 1150<br>
_JobId=3679903<br>
[2024-01-28T17:44:20.001] debug: _rpc_terminate_job: uid = 1150<br>
JobId=3679903<br>
[2024-01-28T17:45:20.002] debug: _rpc_terminate_job: uid = 1150<br>
JobId=3679903<br>
[2024-01-28T17:46:20.001] debug: _rpc_terminate_job: uid = 1150<br>
JobId=3679903<br>
[2024-01-28T17:47:20.002] debug: _rpc_terminate_job: uid = 1150<br>
JobId=3679903<br>
<br>
Strange that a prolog is being called on job cancel<br>
<br>
slurmd seems to be getting the repeated calls to terminate the job<br>
from slurmctld but it is not happening. Also the process table has<br>
<br>
[root@rtx-06 ~]# ps auxw | grep slurmd<br>
root 161784 0.0 0.0 436748 21720 ? Ssl 13:44 0:00 /usr/sbin/slurmd --systemd<br>
root 190494 0.0 0.0 0 0 ? Zs 17:34 0:00 [slurmd] <defunct><br>
<br>
where there is now a zombie slurmd process I cannot kill even with kill -9<br>
<br>
If I do a 'systemctl stop slurmd' it takes a long time but eventually stop<br>
slurmd and gets rid of the zombie process but kills the "good" running<br>
jobs too with NODE_FAIL.<br>
<br>
Another case is where a job will be cancelled and SLURM acts like it<br>
is cancelled with it not showing up in squeue but the process keep running<br>
on the box.<br>
<br>
# pstree -u sg1526 -p | grep ^slurm<br>
slurm_script(185763)---python(185796)-+-{python}(185797)<br>
# strings /proc/185763/environ | grep JOB_ID<br>
SLURM_JOB_ID=3679888<br>
# squeue -j 3679888<br>
slurm_load_jobs error: Invalid job id specified<br>
<br>
sacct shows that job being cancelled. In the slurmd log we see<br>
<br>
[2024-01-28T17:33:58.757] debug: _rpc_terminate_job: uid = 1150 JobId=3679888<br>
[2024-01-28T17:33:58.757] debug: credential for job 3679888 revoked<br>
[2024-01-28T17:33:58.757] debug: _step_connect: connect() failed for /var/slurm/spool/d/rtx-06_3679888.4294967292: Connection refused<br>
[2024-01-28T17:33:58.757] debug: Cleaned up stray socket /var/slurm/spool/d/rtx-06_3679888.4294967292<br>
[2024-01-28T17:33:58.757] debug: signal for nonexistent StepId=3679888.extern stepd_connect failed: Connection refused<br>
[2024-01-28T17:33:58.757] debug: _step_connect: connect() failed for /var/slurm/spool/d/rtx-06_3679888.4294967291: Connection refused<br>
[2024-01-28T17:33:58.757] debug: Cleaned up stray socket /var/slurm/spool/d/rtx-06_3679888.4294967291<br>
[2024-01-28T17:33:58.757] _handle_stray_script: Purging vestigial job script /var/slurm/spool/d/job3679888/slurm_script<br>
[2024-01-28T17:33:58.757] debug: signal for nonexistent StepId=3679888.batch stepd_connect failed: Connection refused<br>
[2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 were able to be signaled with 18<br>
[2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 to send signal 15<br>
[2024-01-28T17:33:58.757] debug2: set revoke expiration for jobid 3679888 to 1706481358 UTS<br>
[2024-01-28T17:33:58.757] debug: Waiting for job 3679888's prolog to complete<br>
[2024-01-28T17:33:58.757] debug: Finished wait for job 3679888's prolog to complete<br>
[2024-01-28T17:33:58.771] debug: completed epilog for jobid 3679888<br>
[2024-01-28T17:33:58.774] debug: JobId=3679888: sent epilog complete msg: rc = 0<br>
<br>
<br>
-- Paul Raines (<a href="http://help.nmr.mgh.harvard.edu" rel="noreferrer" target="_blank">http://help.nmr.mgh.harvard.edu</a>)<br>
<br>
<br>
<br>
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail. <br>
<br>
<br>
</blockquote></div></div><br clear="all"><div><br></div><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr">Fokke Dijkstra <a href="mailto:f.dijkstra@rug.nl" target="_blank"><f.dijkstra@rug.nl></a>
<br>Team High Performance Computing</div><div dir="ltr">Center for Information Technology, University of Groningen
<br>Postbus 11044, 9700 CA Groningen, The Netherlands
<br><br>
</div></div></div></div>