[slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

Tue Jan 30 14:01:59 UTC 2024

I built 23.02.7 and tried that and had the same problems.

BTW, I am using the slurm.spec rpm build method (built on Rocky 8 boxes 
with NVIDIA 535.54.03 proprietary drives installed).

The behavior I was seeing was one would start a GPU job. It was fine at 
first but at some point the slurmstepd processes for it would crash/die. 
Sometimes the user process for it would die too, sometimes not.  In an 
interactive job you would sometimes see some final line about "malloc" and 
"invalid value" and the terminal would be hung until the job was 
'scancel'ed.   'ps' would show a 'slurmd <defunct>' process that was
unkillable (killing the main slurmd process would get rid of it)

How the slurm controller saw the job seemed random.  Sometimes it saw it 
as a crashed job and it would be reported like that in the system. 
Sometimes it was stuck as a permanently CG (completing) job. Sometimes it 
did not notice anything wrong and just stayed as a seemingly perfect 
running job according to slurmctld (I never waited for the TimeLimit to 
hit to see what happened then but did scancel).

I scancelled a "failed" job in the CG or R state, it would not actually
kill the user processes on the node but it would clear the job from
squeue.

Jobs on my non-GPU Rocky 8 nodes or on my Ubuntu GPU nodes (slurm 23.11.3 
install built sepearately on these) have all been working fine so far.
The install on the Ubuntu GPU boxes was built separately on them and also 
another difference is they are still using NVIDIA 470 drivers.

I tried downgrading a Rocky 8 GPU box to NVIDIA 470 and rebuilding
slurm 23.11.3 there and installing it to see if that worked
to fix things.  It did not.

I then tried installing old 22.05.6 RPMs I had built on my Rocky 8
box back in Nov 2022 on all Rocky 8 GPU boxes.  This seems to
have fixed the problem and jobs are no longer showing the issue.
Not and ideal solution but good enough for now.

Both those 22.05.6 RPMs and the 23.11.3 RPMs are built on the
same Rocky 8 GPU box.  So the differences are:

    1) slurm version obviously
    2) built using different gcc/lib versions mostly likely
       due to OS updates between Nov 2022 and now
    3) built with a different NVIDIA driver/cuda installed
       between then and now but I am not sure what I had
       in Nov 2022

I highly suspect #2 or #3 as the underlying issue here and I wonder if the 
NVML library at the time of build is the key (though like I said I tried
rebuiling with NVIDIA 470 and that still had the issue)

-- Paul Raines (http://help.nmr.mgh.harvard.edu)

On Tue, 30 Jan 2024 3:36am, Fokke Dijkstra wrote:

>        External Email - Use Caution 
>
> We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in
> a completing state and slurmd daemons can't be killed because they are left
> in a CLOSE-WAIT state. See my previous mail to the mailing list for the
> details. And also https://secure-web.cisco.com/16zDED-XPBGKgr-e6-N-CrY1J1qxgnDyBeTc5vVc0ghVjnVNEtDKVr66UkuO1IT_ALDy93Yc0AmTeUXzIZviB36h_FFVEoHsmy36XMo6j8usUs-kk5AjTgDUP4KUbfOzFSwWiOoSZGqRWubpW1s47fI9tY-S6zRRk3tl2C_RZ7VpbkyCXnPlNaSsYQvxM_MkVKpbsUrwVDWQ0aqr4jIDrKr-ddHpVhwaN7YmukIpwG4dXoXlh5yqq1VznG-xDJ2Xabhr08FF6AdRHwHQXYPWTR1hLchTEGLxENISVpgGs_PwosuU-VzgMEGp9KSRIjCM9y3MV5gXoJhG44TKX9jkYQg/https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D18561 for
> another site having issues.
> We've now downgraded the clients (slurmd and login nodes) to 23.02.7 which
> gets rid of most issues. If possible, I would try to also downgrade
> slurmctld to an earlier release, but this requires getting rid of all
> running and queued jobs.
>
> Kind regards,
>
> Fokke
>
> Op ma 29 jan 2024 om 01:00 schreef Paul Raines <raines at nmr.mgh.harvard.edu>:
>
>> Some more info on what I am seeing after the 23.11.3 upgrade.
>>
>> Here is a case where a job is cancelled but seems permanently
>> stuck in 'CG' state in squeue
>>
>> [2024-01-28T17:34:11.002] debug3: sched: JobId=3679903 initiated
>> [2024-01-28T17:34:11.002] sched: Allocate JobId=3679903 NodeList=rtx-06
>> #CPUs=4 Partition=rtx8000
>> [2024-01-28T17:34:11.002] debug3: create_mmap_buf: loaded file
>> `/var/slurm/spool/ctld/hash.3/job.3679903/script` as buf_t
>> [2024-01-28T17:42:27.724] _slurm_rpc_kill_job: REQUEST_KILL_JOB
>> JobId=3679903 uid 5875902
>> [2024-01-28T17:42:27.725] debug:  email msg to sg1526: Slurm
>> Job_id=3679903 Name=sjob_1246 Ended, Run time 00:08:17, CANCELLED,
>> ExitCode 0
>> [2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job:
>> JobId=3679903 action:normal
>> [2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job:
>> removed JobId=3679903 from part rtx8000 row 0
>> [2024-01-28T17:42:27.726] job_signal: 9 of running JobId=3679903
>> successful 0x8004
>> [2024-01-28T17:43:19.000] Resending TERMINATE_JOB request JobId=3679903
>> Nodelist=rtx-06
>> [2024-01-28T17:44:20.000] Resending TERMINATE_JOB request JobId=3679903
>> Nodelist=rtx-06
>> [2024-01-28T17:45:20.000] Resending TERMINATE_JOB request JobId=3679903
>> Nodelist=rtx-06
>> [2024-01-28T17:46:20.000] Resending TERMINATE_JOB request JobId=3679903
>> Nodelist=rtx-06
>> [2024-01-28T17:47:20.000] Resending TERMINATE_JOB request JobId=3679903
>> Nodelist=rtx-06
>>
>>
>> So at 17:42 the user must of done an scancel.  In the slurmd log on the
>> node I see:
>>
>> [2024-01-28T17:42:27.727] debug:  _rpc_terminate_job: uid = 1150
>> JobId=3679903
>> [2024-01-28T17:42:27.728] debug:  credential for job 3679903 revoked
>> [2024-01-28T17:42:27.728] debug:  _rpc_terminate_job: sent SUCCESS for
>> 3679903, waiting for prolog to finish
>> [2024-01-28T17:42:27.728] debug:  Waiting for job 3679903's prolog to
>> complete
>> [2024-01-28T17:43:19.002] debug:  _rpc_terminate_job: uid = 1150
>> _JobId=3679903
>> [2024-01-28T17:44:20.001] debug:  _rpc_terminate_job: uid = 1150
>> JobId=3679903
>> [2024-01-28T17:45:20.002] debug:  _rpc_terminate_job: uid = 1150
>> JobId=3679903
>> [2024-01-28T17:46:20.001] debug:  _rpc_terminate_job: uid = 1150
>> JobId=3679903
>> [2024-01-28T17:47:20.002] debug:  _rpc_terminate_job: uid = 1150
>> JobId=3679903
>>
>> Strange that a prolog is being called on job cancel
>>
>> slurmd seems to be getting the repeated calls to terminate the job
>> from slurmctld but it is not happening.  Also the process table has
>>
>> [root at rtx-06 ~]# ps auxw | grep slurmd
>> root      161784  0.0  0.0 436748 21720 ?        Ssl  13:44   0:00
>> /usr/sbin/slurmd --systemd
>> root      190494  0.0  0.0      0     0 ?        Zs   17:34   0:00
>> [slurmd] <defunct>
>>
>> where there is now a zombie slurmd process I cannot kill even with kill -9
>>
>> If I do a 'systemctl stop slurmd' it takes a long time but eventually stop
>> slurmd and gets rid of the zombie process but kills the "good" running
>> jobs too with NODE_FAIL.
>>
>> Another case is where a job will be cancelled and SLURM acts like it
>> is cancelled with it not showing up in squeue but the process keep running
>> on the box.
>>
>> # pstree -u sg1526 -p | grep ^slurm
>> slurm_script(185763)---python(185796)-+-{python}(185797)
>> # strings /proc/185763/environ | grep JOB_ID
>> SLURM_JOB_ID=3679888
>> # squeue -j 3679888
>> slurm_load_jobs error: Invalid job id specified
>>
>> sacct shows that job being cancelled.  In the slurmd log we see
>>
>> [2024-01-28T17:33:58.757] debug:  _rpc_terminate_job: uid = 1150
>> JobId=3679888
>> [2024-01-28T17:33:58.757] debug:  credential for job 3679888 revoked
>> [2024-01-28T17:33:58.757] debug:  _step_connect: connect() failed for
>> /var/slurm/spool/d/rtx-06_3679888.4294967292: Connection refused
>> [2024-01-28T17:33:58.757] debug:  Cleaned up stray socket
>> /var/slurm/spool/d/rtx-06_3679888.4294967292
>> [2024-01-28T17:33:58.757] debug:  signal for nonexistent
>> StepId=3679888.extern stepd_connect failed: Connection refused
>> [2024-01-28T17:33:58.757] debug:  _step_connect: connect() failed for
>> /var/slurm/spool/d/rtx-06_3679888.4294967291: Connection refused
>> [2024-01-28T17:33:58.757] debug:  Cleaned up stray socket
>> /var/slurm/spool/d/rtx-06_3679888.4294967291
>> [2024-01-28T17:33:58.757] _handle_stray_script: Purging vestigial job
>> script /var/slurm/spool/d/job3679888/slurm_script
>> [2024-01-28T17:33:58.757] debug:  signal for nonexistent
>> StepId=3679888.batch stepd_connect failed: Connection refused
>> [2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 were able to
>> be signaled with 18
>> [2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 to send signal
>> 15
>> [2024-01-28T17:33:58.757] debug2: set revoke expiration for jobid 3679888
>> to 1706481358 UTS
>> [2024-01-28T17:33:58.757] debug:  Waiting for job 3679888's prolog to
>> complete
>> [2024-01-28T17:33:58.757] debug:  Finished wait for job 3679888's prolog
>> to complete
>> [2024-01-28T17:33:58.771] debug:  completed epilog for jobid 3679888
>> [2024-01-28T17:33:58.774] debug:  JobId=3679888: sent epilog complete msg:
>> rc = 0
>>
>>
>> -- Paul Raines (http://secure-web.cisco.com/19HZZ6loaWDjqm3fm3-mLdIlOO486WzCozFRPfjXcoODk0os6A1hBP3WXDCQ4u_tmUyP3Bf-6x2dhzz0q8D1mA84xHUvIyvr2xldvTcSkypQnYUyPES7j9mmdxcCaxjp7TP6LMC_KgCVSvIMODRvyZ5H-YNzmogGLC0uiZeyvm7U_3jb1TSfOVFL4vQRJ45wu2nK3upi8FNEvNVCu43hEKoEPJq37vkmZKq1lPK8drgpcN1dEqGGOXp-iuaEdLoEqJwAzdocv_Ozep4OchdGzU7KjRMe2J-pKuXJV3_GICfbHOXHqtAV6KtJy2d_LTozW4EqyODnrwPH8GHd01bFLpw/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu)
>>
>>
>>
>> Please note that this e-mail is not secure (encrypted).  If you do not
>> wish to continue communication over unencrypted e-mail, please notify the
>> sender of this message immediately.  Continuing to send or respond to
>> e-mail after receiving this message means you understand and accept this
>> risk and wish to continue to communicate over unencrypted e-mail.
>>
>>
>>
>
> -- 
> Fokke Dijkstra <f.dijkstra at rug.nl> <f.dijkstra at rug.nl>
> Team High Performance Computing
> Center for Information Technology, University of Groningen
> Postbus 11044, 9700 CA  Groningen, The Netherlands
The information in this e-mail is intended only for the person to whom it is addressed.  If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted).  If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately.  Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.