[slurm-users] after upgrade to 23.11.1 nodes stuck in completion state
Paul Raines
raines at nmr.mgh.harvard.edu
Tue Jan 30 17:19:11 UTC 2024
This is definitely a NVML thing crashing slurmstepd. Here is what I find
doing an strace of the slurmstepd: [3681401.0] process at the point the
crash happens:
[pid 1132920] fcntl(10, F_SETFD, FD_CLOEXEC) = 0
[pid 1132920] read(10, "1132950 (bash) S 1132919 1132950"..., 511) = 339
[pid 1132920] openat(AT_FDCWD, "/proc/1132950/status", O_RDONLY) = 12
[pid 1132920] read(12, "Name:\tbash\nUmask:\t0002\nState:\tS "..., 4095) =
1431
[pid 1132920] close(12) = 0
[pid 1132920] close(10) = 0
[pid 1132920] getpid() = 1132919
[pid 1132920] getpid() = 1132919
[pid 1132920] getpid() = 1132919
[pid 1132920] getpid() = 1132919
[pid 1132920] ioctl(16, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20),
0x14dc6db683f0) = 0
[pid 1132920] getpid() = 1132919
[pid 1132920] ioctl(16, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20),
0x14dc6db683f0) = 0
[pid 1132920] ioctl(16, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20),
0x14dc6db683f0) = 0
[pid 1132920] writev(2, [{iov_base="free(): invalid next size (fast)",
iov_len=32}, {iov_base="\n", iov_len=1}], 2) = 33
[pid 1132924] <... poll resumed>) = 1 ([{fd=13, revents=POLLIN}])
[pid 1132920] mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
[pid 1132924] read(13, "free(): invalid next size (fast)"..., 3850) = 34
[pid 1132920] <... mmap resumed>) = 0x14dc6da6d000
[pid 1132920] rt_sigprocmask(SIG_UNBLOCK, [ABRT], <unfinished ...>
>From "ls -l /proc/1132919/fd" before the crash happens I can tell
that FileDescriptor 16 is /dev/nvidiactl. SO it is doing an ioctl
to /dev/nvidiactl write before this crash. In some cases the error
is a "free(): invalid next size (fast)" and sometimes it is malloc()
Job submitted as:
srun -p rtx6000 -A sysadm -N 1 --ntasks-per-node=1 --mem=20G
--time=1-10:00:00 --cpus-per-task=2 --gpus=8 --nodelist=rtx-02 --pty
/bin/bash
Command run in interactive shell is: gpuburn 30
In this case I am not getting the defunct slurmd process left behind
but there is a strange 'sleep 100000000' left behind I have to kill
[root at rtx-02 ~]# find $(find /sys/fs/cgroup/ -name job_3681401 ) -name
cgroup.procs -exec cat {} \; | sort | uniq
1132916
[root at rtx-02 ~]# ps -f -p 1132916
UID PID PPID C STIME TTY TIME CMD
root 1132916 1 0 11:13 ? 00:00:00 sleep 100000000
The NVML library I have installed is
$ rpm -qa | grep NVML
nvidia-driver-NVML-535.54.03-1.el8.x86_64
on both the box where the SLURM binaries were built and on this box
where slurmstepd is crashing. /usr/local/cuda has a 11.6 CUDA
Okay, I just upgraded the NVIDIA driver on rtx-02 with
dnf --enablerepo=cuda module reset nvidia-driver
dnf --enablerepo=cuda module install nvidia-driver:535-dkms
Restarted everything and it appears with my initial couple of
tests the problem has gone away. Going to need to have real users
test with real jobs.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Tue, 30 Jan 2024 9:01am, Paul Raines wrote:
> External Email - Use Caution
>
> I built 23.02.7 and tried that and had the same problems.
>
> BTW, I am using the slurm.spec rpm build method (built on Rocky 8 boxes with
> NVIDIA 535.54.03 proprietary drives installed).
>
> The behavior I was seeing was one would start a GPU job. It was fine at first
> but at some point the slurmstepd processes for it would crash/die. Sometimes
> the user process for it would die too, sometimes not. In an interactive job
> you would sometimes see some final line about "malloc" and "invalid value"
> and the terminal would be hung until the job was 'scancel'ed. 'ps' would
> show a 'slurmd <defunct>' process that was
> unkillable (killing the main slurmd process would get rid of it)
>
> How the slurm controller saw the job seemed random. Sometimes it saw it as a
> crashed job and it would be reported like that in the system. Sometimes it
> was stuck as a permanently CG (completing) job. Sometimes it did not notice
> anything wrong and just stayed as a seemingly perfect running job according
> to slurmctld (I never waited for the TimeLimit to hit to see what happened
> then but did scancel).
>
> I scancelled a "failed" job in the CG or R state, it would not actually
> kill the user processes on the node but it would clear the job from
> squeue.
>
> Jobs on my non-GPU Rocky 8 nodes or on my Ubuntu GPU nodes (slurm 23.11.3
> install built sepearately on these) have all been working fine so far.
> The install on the Ubuntu GPU boxes was built separately on them and also
> another difference is they are still using NVIDIA 470 drivers.
>
> I tried downgrading a Rocky 8 GPU box to NVIDIA 470 and rebuilding
> slurm 23.11.3 there and installing it to see if that worked
> to fix things. It did not.
>
> I then tried installing old 22.05.6 RPMs I had built on my Rocky 8
> box back in Nov 2022 on all Rocky 8 GPU boxes. This seems to
> have fixed the problem and jobs are no longer showing the issue.
> Not and ideal solution but good enough for now.
>
> Both those 22.05.6 RPMs and the 23.11.3 RPMs are built on the
> same Rocky 8 GPU box. So the differences are:
>
> 1) slurm version obviously
> 2) built using different gcc/lib versions mostly likely
> due to OS updates between Nov 2022 and now
> 3) built with a different NVIDIA driver/cuda installed
> between then and now but I am not sure what I had
> in Nov 2022
>
> I highly suspect #2 or #3 as the underlying issue here and I wonder if the
> NVML library at the time of build is the key (though like I said I tried
> rebuiling with NVIDIA 470 and that still had the issue)
>
>
> -- Paul Raines
> (http://secure-web.cisco.com/1GE8WXzrTSCgdg4bXwEtusq8klIWCEm8v6O6fPZ-8pGQdt3EiPIzh2fEycZgQe9gpMjQVx_qJ9e8BA4dxSsWfvOcr-dlirci-hv1Y6dX1fgns1NxvC8tfoPr8lR2XHgTkpgIRa_pN63RyokWy_qIjTM2mVC4FMxUWpXqyagaOJHZ8KH00QD3aqmzBI_Bm-woUp0fpU7XY77kZevKNxJRXQu6auNmruuqoVPjKqz6q76L0ewhBJ0r6RFjS5Z2uOmyNXUbfvMiViP7ageT7hs_D26CdwUIq5r1PLQV8Iey_ciZvP1kcuTf0jlYIMlyrEPFwGY4ZaAT-9ratApciW-QvNw/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu)
>
>
>
> On Tue, 30 Jan 2024 3:36am, Fokke Dijkstra wrote:
>
>> External Email - Use Caution
>>
>> We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in
>> a completing state and slurmd daemons can't be killed because they are
>> left
>> in a CLOSE-WAIT state. See my previous mail to the mailing list for the
>> details. And also
>> https://secure-web.cisco.com/16zDED-XPBGKgr-e6-N-CrY1J1qxgnDyBeTc5vVc0ghVjnVNEtDKVr66UkuO1IT_ALDy93Yc0AmTeUXzIZviB36h_FFVEoHsmy36XMo6j8usUs-kk5AjTgDUP4KUbfOzFSwWiOoSZGqRWubpW1s47fI9tY-S6zRRk3tl2C_RZ7VpbkyCXnPlNaSsYQvxM_MkVKpbsUrwVDWQ0aqr4jIDrKr-ddHpVhwaN7YmukIpwG4dXoXlh5yqq1VznG-xDJ2Xabhr08FF6AdRHwHQXYPWTR1hLchTEGLxENISVpgGs_PwosuU-VzgMEGp9KSRIjCM9y3MV5gXoJhG44TKX9jkYQg/https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D18561
>> for
>> another site having issues.
>> We've now downgraded the clients (slurmd and login nodes) to 23.02.7 which
>> gets rid of most issues. If possible, I would try to also downgrade
>> slurmctld to an earlier release, but this requires getting rid of all
>> running and queued jobs.
>>
>> Kind regards,
>>
>> Fokke
>>
>> Op ma 29 jan 2024 om 01:00 schreef Paul Raines
>> <raines at nmr.mgh.harvard.edu>:
>>
>>> Some more info on what I am seeing after the 23.11.3 upgrade.
>>>
>>> Here is a case where a job is cancelled but seems permanently
>>> stuck in 'CG' state in squeue
>>>
>>> [2024-01-28T17:34:11.002] debug3: sched: JobId=3679903 initiated
>>> [2024-01-28T17:34:11.002] sched: Allocate JobId=3679903 NodeList=rtx-06
>>> #CPUs=4 Partition=rtx8000
>>> [2024-01-28T17:34:11.002] debug3: create_mmap_buf: loaded file
>>> `/var/slurm/spool/ctld/hash.3/job.3679903/script` as buf_t
>>> [2024-01-28T17:42:27.724] _slurm_rpc_kill_job: REQUEST_KILL_JOB
>>> JobId=3679903 uid 5875902
>>> [2024-01-28T17:42:27.725] debug: email msg to sg1526: Slurm
>>> Job_id=3679903 Name=sjob_1246 Ended, Run time 00:08:17, CANCELLED,
>>> ExitCode 0
>>> [2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job:
>>> JobId=3679903 action:normal
>>> [2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job:
>>> removed JobId=3679903 from part rtx8000 row 0
>>> [2024-01-28T17:42:27.726] job_signal: 9 of running JobId=3679903
>>> successful 0x8004
>>> [2024-01-28T17:43:19.000] Resending TERMINATE_JOB request JobId=3679903
>>> Nodelist=rtx-06
>>> [2024-01-28T17:44:20.000] Resending TERMINATE_JOB request JobId=3679903
>>> Nodelist=rtx-06
>>> [2024-01-28T17:45:20.000] Resending TERMINATE_JOB request JobId=3679903
>>> Nodelist=rtx-06
>>> [2024-01-28T17:46:20.000] Resending TERMINATE_JOB request JobId=3679903
>>> Nodelist=rtx-06
>>> [2024-01-28T17:47:20.000] Resending TERMINATE_JOB request JobId=3679903
>>> Nodelist=rtx-06
>>>
>>>
>>> So at 17:42 the user must of done an scancel. In the slurmd log on the
>>> node I see:
>>>
>>> [2024-01-28T17:42:27.727] debug: _rpc_terminate_job: uid = 1150
>>> JobId=3679903
>>> [2024-01-28T17:42:27.728] debug: credential for job 3679903 revoked
>>> [2024-01-28T17:42:27.728] debug: _rpc_terminate_job: sent SUCCESS for
>>> 3679903, waiting for prolog to finish
>>> [2024-01-28T17:42:27.728] debug: Waiting for job 3679903's prolog to
>>> complete
>>> [2024-01-28T17:43:19.002] debug: _rpc_terminate_job: uid = 1150
>>> _JobId=3679903
>>> [2024-01-28T17:44:20.001] debug: _rpc_terminate_job: uid = 1150
>>> JobId=3679903
>>> [2024-01-28T17:45:20.002] debug: _rpc_terminate_job: uid = 1150
>>> JobId=3679903
>>> [2024-01-28T17:46:20.001] debug: _rpc_terminate_job: uid = 1150
>>> JobId=3679903
>>> [2024-01-28T17:47:20.002] debug: _rpc_terminate_job: uid = 1150
>>> JobId=3679903
>>>
>>> Strange that a prolog is being called on job cancel
>>>
>>> slurmd seems to be getting the repeated calls to terminate the job
>>> from slurmctld but it is not happening. Also the process table has
>>>
>>> [root at rtx-06 ~]# ps auxw | grep slurmd
>>> root 161784 0.0 0.0 436748 21720 ? Ssl 13:44 0:00
>>> /usr/sbin/slurmd --systemd
>>> root 190494 0.0 0.0 0 0 ? Zs 17:34 0:00
>>> [slurmd] <defunct>
>>>
>>> where there is now a zombie slurmd process I cannot kill even with kill
>>> -9
>>>
>>> If I do a 'systemctl stop slurmd' it takes a long time but eventually
>>> stop
>>> slurmd and gets rid of the zombie process but kills the "good" running
>>> jobs too with NODE_FAIL.
>>>
>>> Another case is where a job will be cancelled and SLURM acts like it
>>> is cancelled with it not showing up in squeue but the process keep
>>> running
>>> on the box.
>>>
>>> # pstree -u sg1526 -p | grep ^slurm
>>> slurm_script(185763)---python(185796)-+-{python}(185797)
>>> # strings /proc/185763/environ | grep JOB_ID
>>> SLURM_JOB_ID=3679888
>>> # squeue -j 3679888
>>> slurm_load_jobs error: Invalid job id specified
>>>
>>> sacct shows that job being cancelled. In the slurmd log we see
>>>
>>> [2024-01-28T17:33:58.757] debug: _rpc_terminate_job: uid = 1150
>>> JobId=3679888
>>> [2024-01-28T17:33:58.757] debug: credential for job 3679888 revoked
>>> [2024-01-28T17:33:58.757] debug: _step_connect: connect() failed for
>>> /var/slurm/spool/d/rtx-06_3679888.4294967292: Connection refused
>>> [2024-01-28T17:33:58.757] debug: Cleaned up stray socket
>>> /var/slurm/spool/d/rtx-06_3679888.4294967292
>>> [2024-01-28T17:33:58.757] debug: signal for nonexistent
>>> StepId=3679888.extern stepd_connect failed: Connection refused
>>> [2024-01-28T17:33:58.757] debug: _step_connect: connect() failed for
>>> /var/slurm/spool/d/rtx-06_3679888.4294967291: Connection refused
>>> [2024-01-28T17:33:58.757] debug: Cleaned up stray socket
>>> /var/slurm/spool/d/rtx-06_3679888.4294967291
>>> [2024-01-28T17:33:58.757] _handle_stray_script: Purging vestigial job
>>> script /var/slurm/spool/d/job3679888/slurm_script
>>> [2024-01-28T17:33:58.757] debug: signal for nonexistent
>>> StepId=3679888.batch stepd_connect failed: Connection refused
>>> [2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 were able to
>>> be signaled with 18
>>> [2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 to send
>>> signal
>>> 15
>>> [2024-01-28T17:33:58.757] debug2: set revoke expiration for jobid 3679888
>>> to 1706481358 UTS
>>> [2024-01-28T17:33:58.757] debug: Waiting for job 3679888's prolog to
>>> complete
>>> [2024-01-28T17:33:58.757] debug: Finished wait for job 3679888's prolog
>>> to complete
>>> [2024-01-28T17:33:58.771] debug: completed epilog for jobid 3679888
>>> [2024-01-28T17:33:58.774] debug: JobId=3679888: sent epilog complete
>>> msg:
>>> rc = 0
>>>
>>>
>>> -- Paul Raines
>>> (http://secure-web.cisco.com/19HZZ6loaWDjqm3fm3-mLdIlOO486WzCozFRPfjXcoODk0os6A1hBP3WXDCQ4u_tmUyP3Bf-6x2dhzz0q8D1mA84xHUvIyvr2xldvTcSkypQnYUyPES7j9mmdxcCaxjp7TP6LMC_KgCVSvIMODRvyZ5H-YNzmogGLC0uiZeyvm7U_3jb1TSfOVFL4vQRJ45wu2nK3upi8FNEvNVCu43hEKoEPJq37vkmZKq1lPK8drgpcN1dEqGGOXp-iuaEdLoEqJwAzdocv_Ozep4OchdGzU7KjRMe2J-pKuXJV3_GICfbHOXHqtAV6KtJy2d_LTozW4EqyODnrwPH8GHd01bFLpw/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu)
>>>
>>>
>>>
>>> Please note that this e-mail is not secure (encrypted). If you do not
>>> wish to continue communication over unencrypted e-mail, please notify the
>>> sender of this message immediately. Continuing to send or respond to
>>> e-mail after receiving this message means you understand and accept this
>>> risk and wish to continue to communicate over unencrypted e-mail.
>>>
>>>
>>>
>>
>> --
>> Fokke Dijkstra <f.dijkstra at rug.nl> <f.dijkstra at rug.nl>
>> Team High Performance Computing
>> Center for Information Technology, University of Groningen
>> Postbus 11044, 9700 CA Groningen, The Netherlands
> The information in this e-mail is intended only for the person to whom it is
> addressed. If you believe this e-mail was sent to you in error and the
> e-mail contains patient information, please contact the Mass General Brigham
> Compliance HelpLine at
> https://secure-web.cisco.com/11OYSv2CguxXNJry_EVOZ5uvHKNdNra8ghPWx85BJidC0AxmDyZhhZ5_qUr3zAp1EfvIHpG6XP4pLWy1YvG1EFbSjAc9rKQnAqpexk2uGNHCHEpH8PKrdG-nXdxPDvsO55pYZBpDup55sj3j_IqdFPzWiBQ7e6XYDrV3l3Wn-PM2LcZm6hMhx1duKOp4rgLMiq6VGc4GpNxdx0qg7VR5GSfvV-gNzrqVBTUEI7IUca3hoccjpQdpsyU29lgwDGOwGjQd670rwqnjMxnSuPlViUlVhYH0DqG9BReEwsT19d0dWQtFrvAdVM9kL2x3lUbgKcajhcFfUsXj9eVNj--FGLQ/https%3A%2F%2Fwww.massgeneralbrigham.org%2Fcomplianceline
> <https://secure-web.cisco.com/11OYSv2CguxXNJry_EVOZ5uvHKNdNra8ghPWx85BJidC0AxmDyZhhZ5_qUr3zAp1EfvIHpG6XP4pLWy1YvG1EFbSjAc9rKQnAqpexk2uGNHCHEpH8PKrdG-nXdxPDvsO55pYZBpDup55sj3j_IqdFPzWiBQ7e6XYDrV3l3Wn-PM2LcZm6hMhx1duKOp4rgLMiq6VGc4GpNxdx0qg7VR5GSfvV-gNzrqVBTUEI7IUca3hoccjpQdpsyU29lgwDGOwGjQd670rwqnjMxnSuPlViUlVhYH0DqG9BReEwsT19d0dWQtFrvAdVM9kL2x3lUbgKcajhcFfUsXj9eVNj--FGLQ/https%3A%2F%2Fwww.massgeneralbrigham.org%2Fcomplianceline>
> .
> Please note that this e-mail is not secure (encrypted). If you do not wish
> to continue communication over unencrypted e-mail, please notify the sender
> of this message immediately. Continuing to send or respond to e-mail after
> receiving this message means you understand and accept this risk and wish to
> continue to communicate over unencrypted e-mail.
>
>
>
>
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
More information about the slurm-users
mailing list