<div dir="ltr">We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in a completing state and slurmd daemons can't be killed because they are left in a CLOSE-WAIT state. See my previous mail to the mailing list for the details. And also <a href="https://bugs.schedmd.com/show_bug.cgi?id=18561">https://bugs.schedmd.com/show_bug.cgi?id=18561</a> for another site having issues.<div>We've now downgraded the clients (slurmd and login nodes) to 23.02.7 which gets rid of most issues. If possible, I would try to also downgrade slurmctld to an earlier release, but this requires getting rid of all running and queued jobs.</div><div><br></div><div>Kind regards,</div><div><br></div><div>Fokke</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Op ma 29 jan 2024 om 01:00 schreef Paul Raines <<a href="mailto:raines@nmr.mgh.harvard.edu" target="_blank">raines@nmr.mgh.harvard.edu</a>>:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Some more info on what I am seeing after the 23.11.3 upgrade.<br>

<br>

Here is a case where a job is cancelled but seems permanently<br>

stuck in 'CG' state in squeue<br>

<br>

[2024-01-28T17:34:11.002] debug3: sched: JobId=3679903 initiated<br>

[2024-01-28T17:34:11.002] sched: Allocate JobId=3679903 NodeList=rtx-06<br>

#CPUs=4 Partition=rtx8000<br>

[2024-01-28T17:34:11.002] debug3: create_mmap_buf: loaded file<br>

`/var/slurm/spool/ctld/hash.3/job.3679903/script` as buf_t<br>

[2024-01-28T17:42:27.724] _slurm_rpc_kill_job: REQUEST_KILL_JOB<br>

JobId=3679903 uid 5875902<br>

[2024-01-28T17:42:27.725] debug:  email msg to sg1526: Slurm<br>

Job_id=3679903 Name=sjob_1246 Ended, Run time 00:08:17, CANCELLED,<br>

ExitCode 0<br>

[2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job:<br>

JobId=3679903 action:normal<br>

[2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job:<br>

removed JobId=3679903 from part rtx8000 row 0<br>

[2024-01-28T17:42:27.726] job_signal: 9 of running JobId=3679903<br>

successful 0x8004<br>

[2024-01-28T17:43:19.000] Resending TERMINATE_JOB request JobId=3679903<br>

Nodelist=rtx-06<br>

[2024-01-28T17:44:20.000] Resending TERMINATE_JOB request JobId=3679903<br>

Nodelist=rtx-06<br>

[2024-01-28T17:45:20.000] Resending TERMINATE_JOB request JobId=3679903<br>

Nodelist=rtx-06<br>

[2024-01-28T17:46:20.000] Resending TERMINATE_JOB request JobId=3679903<br>

Nodelist=rtx-06<br>

[2024-01-28T17:47:20.000] Resending TERMINATE_JOB request JobId=3679903<br>

Nodelist=rtx-06<br>

<br>

<br>

So at 17:42 the user must of done an scancel.  In the slurmd log on the<br>

node I see:<br>

<br>

[2024-01-28T17:42:27.727] debug:  _rpc_terminate_job: uid = 1150<br>

JobId=3679903<br>

[2024-01-28T17:42:27.728] debug:  credential for job 3679903 revoked<br>

[2024-01-28T17:42:27.728] debug:  _rpc_terminate_job: sent SUCCESS for<br>

3679903, waiting for prolog to finish<br>

[2024-01-28T17:42:27.728] debug:  Waiting for job 3679903's prolog to<br>

complete<br>

[2024-01-28T17:43:19.002] debug:  _rpc_terminate_job: uid = 1150<br>

_JobId=3679903<br>

[2024-01-28T17:44:20.001] debug:  _rpc_terminate_job: uid = 1150<br>

JobId=3679903<br>

[2024-01-28T17:45:20.002] debug:  _rpc_terminate_job: uid = 1150<br>

JobId=3679903<br>

[2024-01-28T17:46:20.001] debug:  _rpc_terminate_job: uid = 1150<br>

JobId=3679903<br>

[2024-01-28T17:47:20.002] debug:  _rpc_terminate_job: uid = 1150<br>

JobId=3679903<br>

<br>

Strange that a prolog is being called on job cancel<br>

<br>

slurmd seems to be getting the repeated calls to terminate the job<br>

from slurmctld but it is not happening.  Also the process table has<br>

<br>

[root@rtx-06 ~]# ps auxw | grep slurmd<br>

root      161784  0.0  0.0 436748 21720 ?        Ssl  13:44   0:00 /usr/sbin/slurmd --systemd<br>

root      190494  0.0  0.0      0     0 ?        Zs   17:34   0:00 [slurmd] <defunct><br>

<br>

where there is now a zombie slurmd process I cannot kill even with kill -9<br>

<br>

If I do a 'systemctl stop slurmd' it takes a long time but eventually stop<br>

slurmd and gets rid of the zombie process but kills the "good" running<br>

jobs too with NODE_FAIL.<br>

<br>

Another case is where a job will be cancelled and SLURM acts like it<br>

is cancelled with it not showing up in squeue but the process keep running<br>

on the box.<br>

<br>

# pstree -u sg1526 -p | grep ^slurm<br>

slurm_script(185763)---python(185796)-+-{python}(185797)<br>

# strings /proc/185763/environ | grep JOB_ID<br>

SLURM_JOB_ID=3679888<br>

# squeue -j 3679888<br>

slurm_load_jobs error: Invalid job id specified<br>

<br>

sacct shows that job being cancelled.  In the slurmd log we see<br>

<br>

[2024-01-28T17:33:58.757] debug:  _rpc_terminate_job: uid = 1150 JobId=3679888<br>

[2024-01-28T17:33:58.757] debug:  credential for job 3679888 revoked<br>

[2024-01-28T17:33:58.757] debug:  _step_connect: connect() failed for /var/slurm/spool/d/rtx-06_3679888.4294967292: Connection refused<br>

[2024-01-28T17:33:58.757] debug:  Cleaned up stray socket /var/slurm/spool/d/rtx-06_3679888.4294967292<br>

[2024-01-28T17:33:58.757] debug:  signal for nonexistent StepId=3679888.extern stepd_connect failed: Connection refused<br>

[2024-01-28T17:33:58.757] debug:  _step_connect: connect() failed for /var/slurm/spool/d/rtx-06_3679888.4294967291: Connection refused<br>

[2024-01-28T17:33:58.757] debug:  Cleaned up stray socket /var/slurm/spool/d/rtx-06_3679888.4294967291<br>

[2024-01-28T17:33:58.757] _handle_stray_script: Purging vestigial job script /var/slurm/spool/d/job3679888/slurm_script<br>

[2024-01-28T17:33:58.757] debug:  signal for nonexistent StepId=3679888.batch stepd_connect failed: Connection refused<br>

[2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 were able to be signaled with 18<br>

[2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 to send signal 15<br>

[2024-01-28T17:33:58.757] debug2: set revoke expiration for jobid 3679888 to 1706481358 UTS<br>

[2024-01-28T17:33:58.757] debug:  Waiting for job 3679888's prolog to complete<br>

[2024-01-28T17:33:58.757] debug:  Finished wait for job 3679888's prolog to complete<br>

[2024-01-28T17:33:58.771] debug:  completed epilog for jobid 3679888<br>

[2024-01-28T17:33:58.774] debug:  JobId=3679888: sent epilog complete msg: rc = 0<br>

<br>

<br>

-- Paul Raines (<a href="http://help.nmr.mgh.harvard.edu" rel="noreferrer" target="_blank">http://help.nmr.mgh.harvard.edu</a>)<br>

<br>

<br>

<br>

Please note that this e-mail is not secure (encrypted).  If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately.  Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail. <br>

<br>

<br>

</blockquote></div></div><br clear="all"><div><br></div><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr">Fokke Dijkstra <a href="mailto:f.dijkstra@rug.nl" target="_blank"><f.dijkstra@rug.nl></a>

<br>Team High Performance Computing</div><div dir="ltr">Center for Information Technology, University of Groningen

<br>Postbus 11044, 9700 CA  Groningen, The Netherlands

<br><br>

</div></div></div></div>