[slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

Tue Jan 30 17:19:11 UTC 2024

This is definitely a NVML thing crashing slurmstepd.  Here is what I find
doing an strace of the slurmstepd: [3681401.0] process at the point the 
crash happens:

[pid 1132920] fcntl(10, F_SETFD, FD_CLOEXEC) = 0
[pid 1132920] read(10, "1132950 (bash) S 1132919 1132950"..., 511) = 339
[pid 1132920] openat(AT_FDCWD, "/proc/1132950/status", O_RDONLY) = 12
[pid 1132920] read(12, "Name:\tbash\nUmask:\t0002\nState:\tS "..., 4095) = 
1431
[pid 1132920] close(12)                 = 0
[pid 1132920] close(10)                 = 0
[pid 1132920] getpid()                  = 1132919
[pid 1132920] getpid()                  = 1132919
[pid 1132920] getpid()                  = 1132919
[pid 1132920] getpid()                  = 1132919
[pid 1132920] ioctl(16, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 
0x14dc6db683f0) = 0
[pid 1132920] getpid()                  = 1132919
[pid 1132920] ioctl(16, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 
0x14dc6db683f0) = 0
[pid 1132920] ioctl(16, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 
0x14dc6db683f0) = 0
[pid 1132920] writev(2, [{iov_base="free(): invalid next size (fast)", 
iov_len=32}, {iov_base="\n", iov_len=1}], 2) = 33
[pid 1132924] <... poll resumed>)       = 1 ([{fd=13, revents=POLLIN}])
[pid 1132920] mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
[pid 1132924] read(13, "free(): invalid next size (fast)"..., 3850) = 34
[pid 1132920] <... mmap resumed>)       = 0x14dc6da6d000
[pid 1132920] rt_sigprocmask(SIG_UNBLOCK, [ABRT],  <unfinished ...>

>From "ls -l /proc/1132919/fd" before the crash happens I can tell
that FileDescriptor 16 is /dev/nvidiactl.  SO it is doing an ioctl
to /dev/nvidiactl write before this crash.  In some cases the error
is a "free(): invalid next size (fast)" and sometimes it is malloc()

Job submitted as:

srun -p rtx6000 -A sysadm -N 1 --ntasks-per-node=1 --mem=20G 
--time=1-10:00:00  --cpus-per-task=2 --gpus=8 --nodelist=rtx-02 --pty 
/bin/bash

Command run in interactive shell is:  gpuburn 30

In this case I am not getting the defunct slurmd process left behind
but there is a strange 'sleep 100000000' left behind I have to kill

[root at rtx-02 ~]# find $(find /sys/fs/cgroup/ -name job_3681401 ) -name 
cgroup.procs -exec cat {} \; | sort | uniq
1132916
[root at rtx-02 ~]# ps -f -p 1132916
UID          PID    PPID  C STIME TTY          TIME CMD
root     1132916       1  0 11:13 ?        00:00:00 sleep 100000000

The NVML library I have installed is

$ rpm -qa | grep NVML
nvidia-driver-NVML-535.54.03-1.el8.x86_64

on both the box where the SLURM binaries were built and on this box
where slurmstepd is crashing.  /usr/local/cuda has a 11.6 CUDA

Okay, I just upgraded the NVIDIA driver on rtx-02 with

   dnf --enablerepo=cuda module reset nvidia-driver
   dnf --enablerepo=cuda module install nvidia-driver:535-dkms

Restarted everything and it appears with my initial couple of
tests the problem has gone away.  Going to need to have real users
test with real jobs.

-- Paul Raines (http://help.nmr.mgh.harvard.edu)

On Tue, 30 Jan 2024 9:01am, Paul Raines wrote:

>        External Email - Use Caution 
>
> I built 23.02.7 and tried that and had the same problems.
>
> BTW, I am using the slurm.spec rpm build method (built on Rocky 8 boxes with 
> NVIDIA 535.54.03 proprietary drives installed).
>
> The behavior I was seeing was one would start a GPU job. It was fine at first 
> but at some point the slurmstepd processes for it would crash/die. Sometimes 
> the user process for it would die too, sometimes not.  In an interactive job 
> you would sometimes see some final line about "malloc" and "invalid value" 
> and the terminal would be hung until the job was 'scancel'ed.   'ps' would 
> show a 'slurmd <defunct>' process that was
> unkillable (killing the main slurmd process would get rid of it)
>
> How the slurm controller saw the job seemed random.  Sometimes it saw it as a 
> crashed job and it would be reported like that in the system. Sometimes it 
> was stuck as a permanently CG (completing) job. Sometimes it did not notice 
> anything wrong and just stayed as a seemingly perfect running job according 
> to slurmctld (I never waited for the TimeLimit to hit to see what happened 
> then but did scancel).
>
> I scancelled a "failed" job in the CG or R state, it would not actually
> kill the user processes on the node but it would clear the job from
> squeue.
>
> Jobs on my non-GPU Rocky 8 nodes or on my Ubuntu GPU nodes (slurm 23.11.3 
> install built sepearately on these) have all been working fine so far.
> The install on the Ubuntu GPU boxes was built separately on them and also 
> another difference is they are still using NVIDIA 470 drivers.
>
> I tried downgrading a Rocky 8 GPU box to NVIDIA 470 and rebuilding
> slurm 23.11.3 there and installing it to see if that worked
> to fix things.  It did not.
>
> I then tried installing old 22.05.6 RPMs I had built on my Rocky 8
> box back in Nov 2022 on all Rocky 8 GPU boxes.  This seems to
> have fixed the problem and jobs are no longer showing the issue.
> Not and ideal solution but good enough for now.
>
> Both those 22.05.6 RPMs and the 23.11.3 RPMs are built on the
> same Rocky 8 GPU box.  So the differences are:
>
>    1) slurm version obviously
>    2) built using different gcc/lib versions mostly likely
>       due to OS updates between Nov 2022 and now
>    3) built with a different NVIDIA driver/cuda installed
>       between then and now but I am not sure what I had
>       in Nov 2022
>
> I highly suspect #2 or #3 as the underlying issue here and I wonder if the 
> NVML library at the time of build is the key (though like I said I tried
> rebuiling with NVIDIA 470 and that still had the issue)
>
>
> -- Paul Raines 
> (http://secure-web.cisco.com/1GE8WXzrTSCgdg4bXwEtusq8klIWCEm8v6O6fPZ-8pGQdt3EiPIzh2fEycZgQe9gpMjQVx_qJ9e8BA4dxSsWfvOcr-dlirci-hv1Y6dX1fgns1NxvC8tfoPr8lR2XHgTkpgIRa_pN63RyokWy_qIjTM2mVC4FMxUWpXqyagaOJHZ8KH00QD3aqmzBI_Bm-woUp0fpU7XY77kZevKNxJRXQu6auNmruuqoVPjKqz6q76L0ewhBJ0r6RFjS5Z2uOmyNXUbfvMiViP7ageT7hs_D26CdwUIq5r1PLQV8Iey_ciZvP1kcuTf0jlYIMlyrEPFwGY4ZaAT-9ratApciW-QvNw/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu)
>
>
>
> On Tue, 30 Jan 2024 3:36am, Fokke Dijkstra wrote:
>
>>         External Email - Use Caution 
>>
>>  We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in
>>  a completing state and slurmd daemons can't be killed because they are
>>  left
>>  in a CLOSE-WAIT state. See my previous mail to the mailing list for the
>>  details. And also
>>  https://secure-web.cisco.com/16zDED-XPBGKgr-e6-N-CrY1J1qxgnDyBeTc5vVc0ghVjnVNEtDKVr66UkuO1IT_ALDy93Yc0AmTeUXzIZviB36h_FFVEoHsmy36XMo6j8usUs-kk5AjTgDUP4KUbfOzFSwWiOoSZGqRWubpW1s47fI9tY-S6zRRk3tl2C_RZ7VpbkyCXnPlNaSsYQvxM_MkVKpbsUrwVDWQ0aqr4jIDrKr-ddHpVhwaN7YmukIpwG4dXoXlh5yqq1VznG-xDJ2Xabhr08FF6AdRHwHQXYPWTR1hLchTEGLxENISVpgGs_PwosuU-VzgMEGp9KSRIjCM9y3MV5gXoJhG44TKX9jkYQg/https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D18561
>>  for
>>  another site having issues.
>>  We've now downgraded the clients (slurmd and login nodes) to 23.02.7 which
>>  gets rid of most issues. If possible, I would try to also downgrade
>>  slurmctld to an earlier release, but this requires getting rid of all
>>  running and queued jobs.
>>
>>  Kind regards,
>>
>>  Fokke
>>
>>  Op ma 29 jan 2024 om 01:00 schreef Paul Raines
>>  <raines at nmr.mgh.harvard.edu>:
>>
>>>  Some more info on what I am seeing after the 23.11.3 upgrade.
>>>
>>>  Here is a case where a job is cancelled but seems permanently
>>>  stuck in 'CG' state in squeue
>>>
>>>  [2024-01-28T17:34:11.002] debug3: sched: JobId=3679903 initiated
>>>  [2024-01-28T17:34:11.002] sched: Allocate JobId=3679903 NodeList=rtx-06
>>>  #CPUs=4 Partition=rtx8000
>>>  [2024-01-28T17:34:11.002] debug3: create_mmap_buf: loaded file
>>>  `/var/slurm/spool/ctld/hash.3/job.3679903/script` as buf_t
>>>  [2024-01-28T17:42:27.724] _slurm_rpc_kill_job: REQUEST_KILL_JOB
>>>  JobId=3679903 uid 5875902
>>>  [2024-01-28T17:42:27.725] debug:  email msg to sg1526: Slurm
>>>  Job_id=3679903 Name=sjob_1246 Ended, Run time 00:08:17, CANCELLED,
>>>  ExitCode 0
>>>  [2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job:
>>>  JobId=3679903 action:normal
>>>  [2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job:
>>>  removed JobId=3679903 from part rtx8000 row 0
>>>  [2024-01-28T17:42:27.726] job_signal: 9 of running JobId=3679903
>>>  successful 0x8004
>>>  [2024-01-28T17:43:19.000] Resending TERMINATE_JOB request JobId=3679903
>>>  Nodelist=rtx-06
>>>  [2024-01-28T17:44:20.000] Resending TERMINATE_JOB request JobId=3679903
>>>  Nodelist=rtx-06
>>>  [2024-01-28T17:45:20.000] Resending TERMINATE_JOB request JobId=3679903
>>>  Nodelist=rtx-06
>>>  [2024-01-28T17:46:20.000] Resending TERMINATE_JOB request JobId=3679903
>>>  Nodelist=rtx-06
>>>  [2024-01-28T17:47:20.000] Resending TERMINATE_JOB request JobId=3679903
>>>  Nodelist=rtx-06
>>> 
>>>
>>>  So at 17:42 the user must of done an scancel.  In the slurmd log on the
>>>  node I see:
>>>
>>>  [2024-01-28T17:42:27.727] debug:  _rpc_terminate_job: uid = 1150
>>>  JobId=3679903
>>>  [2024-01-28T17:42:27.728] debug:  credential for job 3679903 revoked
>>>  [2024-01-28T17:42:27.728] debug:  _rpc_terminate_job: sent SUCCESS for
>>>  3679903, waiting for prolog to finish
>>>  [2024-01-28T17:42:27.728] debug:  Waiting for job 3679903's prolog to
>>>  complete
>>>  [2024-01-28T17:43:19.002] debug:  _rpc_terminate_job: uid = 1150
>>>  _JobId=3679903
>>>  [2024-01-28T17:44:20.001] debug:  _rpc_terminate_job: uid = 1150
>>>  JobId=3679903
>>>  [2024-01-28T17:45:20.002] debug:  _rpc_terminate_job: uid = 1150
>>>  JobId=3679903
>>>  [2024-01-28T17:46:20.001] debug:  _rpc_terminate_job: uid = 1150
>>>  JobId=3679903
>>>  [2024-01-28T17:47:20.002] debug:  _rpc_terminate_job: uid = 1150
>>>  JobId=3679903
>>>
>>>  Strange that a prolog is being called on job cancel
>>>
>>>  slurmd seems to be getting the repeated calls to terminate the job
>>>  from slurmctld but it is not happening.  Also the process table has
>>>
>>>  [root at rtx-06 ~]# ps auxw | grep slurmd
>>>  root      161784  0.0  0.0 436748 21720 ?        Ssl  13:44   0:00
>>>  /usr/sbin/slurmd --systemd
>>>  root      190494  0.0  0.0      0     0 ?        Zs   17:34   0:00
>>>  [slurmd] <defunct>
>>>
>>>  where there is now a zombie slurmd process I cannot kill even with kill
>>>  -9
>>>
>>>  If I do a 'systemctl stop slurmd' it takes a long time but eventually
>>>  stop
>>>  slurmd and gets rid of the zombie process but kills the "good" running
>>>  jobs too with NODE_FAIL.
>>>
>>>  Another case is where a job will be cancelled and SLURM acts like it
>>>  is cancelled with it not showing up in squeue but the process keep
>>>  running
>>>  on the box.
>>>
>>>  # pstree -u sg1526 -p | grep ^slurm
>>>  slurm_script(185763)---python(185796)-+-{python}(185797)
>>>  # strings /proc/185763/environ | grep JOB_ID
>>>  SLURM_JOB_ID=3679888
>>>  # squeue -j 3679888
>>>  slurm_load_jobs error: Invalid job id specified
>>>
>>>  sacct shows that job being cancelled.  In the slurmd log we see
>>>
>>>  [2024-01-28T17:33:58.757] debug:  _rpc_terminate_job: uid = 1150
>>>  JobId=3679888
>>>  [2024-01-28T17:33:58.757] debug:  credential for job 3679888 revoked
>>>  [2024-01-28T17:33:58.757] debug:  _step_connect: connect() failed for
>>>  /var/slurm/spool/d/rtx-06_3679888.4294967292: Connection refused
>>>  [2024-01-28T17:33:58.757] debug:  Cleaned up stray socket
>>>  /var/slurm/spool/d/rtx-06_3679888.4294967292
>>>  [2024-01-28T17:33:58.757] debug:  signal for nonexistent
>>>  StepId=3679888.extern stepd_connect failed: Connection refused
>>>  [2024-01-28T17:33:58.757] debug:  _step_connect: connect() failed for
>>>  /var/slurm/spool/d/rtx-06_3679888.4294967291: Connection refused
>>>  [2024-01-28T17:33:58.757] debug:  Cleaned up stray socket
>>>  /var/slurm/spool/d/rtx-06_3679888.4294967291
>>>  [2024-01-28T17:33:58.757] _handle_stray_script: Purging vestigial job
>>>  script /var/slurm/spool/d/job3679888/slurm_script
>>>  [2024-01-28T17:33:58.757] debug:  signal for nonexistent
>>>  StepId=3679888.batch stepd_connect failed: Connection refused
>>>  [2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 were able to
>>>  be signaled with 18
>>>  [2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 to send
>>>  signal
>>>  15
>>>  [2024-01-28T17:33:58.757] debug2: set revoke expiration for jobid 3679888
>>>  to 1706481358 UTS
>>>  [2024-01-28T17:33:58.757] debug:  Waiting for job 3679888's prolog to
>>>  complete
>>>  [2024-01-28T17:33:58.757] debug:  Finished wait for job 3679888's prolog
>>>  to complete
>>>  [2024-01-28T17:33:58.771] debug:  completed epilog for jobid 3679888
>>>  [2024-01-28T17:33:58.774] debug:  JobId=3679888: sent epilog complete
>>>  msg:
>>>  rc = 0
>>> 
>>>
>>>  -- Paul Raines
>>>  (http://secure-web.cisco.com/19HZZ6loaWDjqm3fm3-mLdIlOO486WzCozFRPfjXcoODk0os6A1hBP3WXDCQ4u_tmUyP3Bf-6x2dhzz0q8D1mA84xHUvIyvr2xldvTcSkypQnYUyPES7j9mmdxcCaxjp7TP6LMC_KgCVSvIMODRvyZ5H-YNzmogGLC0uiZeyvm7U_3jb1TSfOVFL4vQRJ45wu2nK3upi8FNEvNVCu43hEKoEPJq37vkmZKq1lPK8drgpcN1dEqGGOXp-iuaEdLoEqJwAzdocv_Ozep4OchdGzU7KjRMe2J-pKuXJV3_GICfbHOXHqtAV6KtJy2d_LTozW4EqyODnrwPH8GHd01bFLpw/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu)
>>> 
>>> 
>>>
>>>  Please note that this e-mail is not secure (encrypted).  If you do not
>>>  wish to continue communication over unencrypted e-mail, please notify the
>>>  sender of this message immediately.  Continuing to send or respond to
>>>  e-mail after receiving this message means you understand and accept this
>>>  risk and wish to continue to communicate over unencrypted e-mail.
>>> 
>>> 
>>> 
>>
>>  --
>>  Fokke Dijkstra <f.dijkstra at rug.nl> <f.dijkstra at rug.nl>
>>  Team High Performance Computing
>>  Center for Information Technology, University of Groningen
>>  Postbus 11044, 9700 CA  Groningen, The Netherlands
> The information in this e-mail is intended only for the person to whom it is 
> addressed.  If you believe this e-mail was sent to you in error and the 
> e-mail contains patient information, please contact the Mass General Brigham 
> Compliance HelpLine at 
> https://secure-web.cisco.com/11OYSv2CguxXNJry_EVOZ5uvHKNdNra8ghPWx85BJidC0AxmDyZhhZ5_qUr3zAp1EfvIHpG6XP4pLWy1YvG1EFbSjAc9rKQnAqpexk2uGNHCHEpH8PKrdG-nXdxPDvsO55pYZBpDup55sj3j_IqdFPzWiBQ7e6XYDrV3l3Wn-PM2LcZm6hMhx1duKOp4rgLMiq6VGc4GpNxdx0qg7VR5GSfvV-gNzrqVBTUEI7IUca3hoccjpQdpsyU29lgwDGOwGjQd670rwqnjMxnSuPlViUlVhYH0DqG9BReEwsT19d0dWQtFrvAdVM9kL2x3lUbgKcajhcFfUsXj9eVNj--FGLQ/https%3A%2F%2Fwww.massgeneralbrigham.org%2Fcomplianceline 
> <https://secure-web.cisco.com/11OYSv2CguxXNJry_EVOZ5uvHKNdNra8ghPWx85BJidC0AxmDyZhhZ5_qUr3zAp1EfvIHpG6XP4pLWy1YvG1EFbSjAc9rKQnAqpexk2uGNHCHEpH8PKrdG-nXdxPDvsO55pYZBpDup55sj3j_IqdFPzWiBQ7e6XYDrV3l3Wn-PM2LcZm6hMhx1duKOp4rgLMiq6VGc4GpNxdx0qg7VR5GSfvV-gNzrqVBTUEI7IUca3hoccjpQdpsyU29lgwDGOwGjQd670rwqnjMxnSuPlViUlVhYH0DqG9BReEwsT19d0dWQtFrvAdVM9kL2x3lUbgKcajhcFfUsXj9eVNj--FGLQ/https%3A%2F%2Fwww.massgeneralbrigham.org%2Fcomplianceline> 
> .
> Please note that this e-mail is not secure (encrypted).  If you do not wish 
> to continue communication over unencrypted e-mail, please notify the sender 
> of this message immediately.  Continuing to send or respond to e-mail after 
> receiving this message means you understand and accept this risk and wish to 
> continue to communicate over unencrypted e-mail. 
>
>
>
>
The information in this e-mail is intended only for the person to whom it is addressed.  If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted).  If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately.  Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.