I finally had downtime on our cluster running 20.11.3 and decided to upgrade SLURM. All daemons were stopped on nodes and master. Rocky 8 Linux OS was updated but not changed configuration-wise in anyway.
On the master, when I first installed 23.11.1 and tried to run slurmdbd -D -vvv at the command line, it balked as the more than 2 major version jump. I then installed 22.05.6 and run slurmdbd -D -vvv which took a long time but completed fine. I reinstalled 23.11.1 and did slurmdbd -D --vvv which went quick. I stopped it then ran slurmdbd normally with systemctl.
When I ran slurmctld it complained about incompatable state files. I didn't care about loosing state so ran with the -i option to ignore. Seemed happy.
I then started up slurmd on my various GPU and non-GPU nodes. I ran a test interactive job just to see if I would get a shell with expected SLURM environment on the nodes and it worked. Except on the GPU nodes I got errors about not being able to get/set GPU frequences. I removed GpuFreqDef=medium from slurm.conf and that went away.
I had my people start running jobs last night.
This morning I come in and see a lot of nodes are stuck in completing state. All these are GPU nodes.
For example on node 'sevilla' squeue reports
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3679433 lcna100 sjob_123 sg1526 CG 8:29 1 sevilla 3679432 lcna100 sjob_123 sg1526 CG 8:35 1 sevilla 3679431 lcna100 sjob_123 sg1526 CG 8:42 1 sevilla
In the slurmd.log on sevilla I see every minute the lines:
[2024-01-28T09:37:52.003] debug2: Start processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug: _rpc_terminate_job: uid = 1150 JobId=3679432 [2024-01-28T09:37:52.004] debug2: Start processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug: _rpc_terminate_job: uid = 1150 JobId=3679431 [2024-01-28T09:37:52.004] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug2: Start processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug: _rpc_terminate_job: uid = 1150 JobId=3679433 [2024-01-28T09:37:52.004] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB
Back at the time 3679433 was cancelled by the user I see
[2024-01-28T00:27:48.658] debug2: Start processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T00:27:48.658] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T00:27:48.658] debug: _rpc_terminate_job: uid = 1150 JobId=3679433 [2024-01-28T00:27:48.658] debug: credential for job 3679433 revoked [2024-01-28T00:27:48.658] debug: _rpc_terminate_job: sent SUCCESS for 3679433, waiting for prolog to finish [2024-01-28T00:27:48.658] debug: Waiting for job 3679433's prolog to complete
On sevilla itself there are no processes currently running for user sg1526 so no sign of these jobs. There ARE processes running for user mu40 though
[root@sevilla ~]# pgrep -f slurm_script | xargs -n1 pstree -p | grep ^slurm slurm_script(30691)---starter-suid(30764)-+-python3(30784)-+-{python3}(30817) slurm_script(30861)---starter-suid(30873)-+-python3(30888)-+-{python3}(30915) slurm_script(30963)---starter-suid(30975)-+-python3(30992)-+-{python3}(31020)
[root@sevilla ~]# strings /proc/30691/cmdline /bin/sh /var/slurm/spool/d/job3679220/slurm_script
/usr/bin/sacct -p --jobs=3679220 --format=JobID,User,ReqTRES,NodeList,Start,End,Elapsed,CPUTime,State,ExitCode JobID|User|ReqTRES|NodeList|Start|End|Elapsed|CPUTime|State|ExitCode| 3679220|mu40|billing=6,cpu=3,gres/gpu=1,mem=32G,node=1|sevilla|2024-01-27T22:33:27|2024-01-27T23:26:39|00:53:12|02:39:36|NODE_FAIL|1:0| 3679220.batch|||sevilla|2024-01-27T22:33:27|2024-01-27T23:26:39|00:53:12|02:39:36|CANCELLED|| 3679220.extern|||sevilla|2024-01-27T22:33:27|2024-01-27T23:26:39|00:53:12|02:39:36|CANCELLED||
The slurmd.log has at the time it started lines like:
[2024-01-27T22:33:28.679] [3679220.batch] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30691, probably hasn't started yet or has already finished [2024-01-27T22:33:58.092] [3679220.extern] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30650, probably hasn't started yet or has already finished [2024-01-27T22:33:58.166] [3679220.batch] debug2: profile signaling type Task [2024-01-27T22:33:58.168] [3679220.batch] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30691, probably hasn't started yet or has already finished [2024-01-27T22:33:58.170] [3679220.batch] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30764, probably hasn't started yet or has already finished [2024-01-27T22:33:58.172] [3679220.batch] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30784, probably hasn't started yet or has already finished ... [2024-01-27T22:46:39.018] debug: _step_connect: connect() failed for /var/slurm/spool/d/sevilla_3679220.4294967292: Connection refused [2024-01-27T22:46:39.018] debug: Cleaned up stray socket /var/slurm/spool/d/sevilla_3679220.4294967292 [2024-01-27T22:46:39.018] debug: _step_connect: connect() failed for /var/slurm/spool/d/sevilla_3679220.4294967291: Connection refused [2024-01-27T22:46:39.018] debug: Cleaned up stray socket /var/slurm/spool/d/sevilla_3679220.4294967291 [2024-01-27T22:46:39.018] _handle_stray_script: Purging vestigial job script /var/slurm/spool/d/job3679220/slurm_script
then later at the time of termination
[2024-01-27T23:26:39.042] debug: _rpc_terminate_job: uid = 1150 JobId=3679220 [2024-01-27T23:26:39.042] debug: credential for job 3679220 revoked [2024-01-27T23:26:39.042] debug2: No steps in jobid 3679220 to send signal 998 [2024-01-27T23:26:39.043] debug2: No steps in jobid 3679220 to send signal 18 [2024-01-27T23:26:39.043] debug2: No steps in jobid 3679220 to send signal 15 [2024-01-27T23:26:39.044] debug2: set revoke expiration for jobid 3679220 to 1706416119 UTS [2024-01-27T23:26:39.044] debug: Waiting for job 3679220's prolog to complete [2024-01-27T23:26:39.044] debug: Finished wait for job 3679220's prolog to complete [2024-01-27T23:26:39.068] debug: completed epilog for jobid 3679220 [2024-01-27T23:26:39.071] debug: JobId=3679220: sent epilog complete msg: rc = 0
So I have a node stuck in 'comp' state that claims to have jobs completing that are definite NOT running on the box BUT there are jobs running on the box that SLURM thinks are done
--------------------------------------------------------------- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th Street Charlestown, MA 02129 USA
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
Some more info on what I am seeing after the 23.11.3 upgrade.
Here is a case where a job is cancelled but seems permanently stuck in 'CG' state in squeue
[2024-01-28T17:34:11.002] debug3: sched: JobId=3679903 initiated [2024-01-28T17:34:11.002] sched: Allocate JobId=3679903 NodeList=rtx-06 #CPUs=4 Partition=rtx8000 [2024-01-28T17:34:11.002] debug3: create_mmap_buf: loaded file `/var/slurm/spool/ctld/hash.3/job.3679903/script` as buf_t [2024-01-28T17:42:27.724] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=3679903 uid 5875902 [2024-01-28T17:42:27.725] debug: email msg to sg1526: Slurm Job_id=3679903 Name=sjob_1246 Ended, Run time 00:08:17, CANCELLED, ExitCode 0 [2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job: JobId=3679903 action:normal [2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job: removed JobId=3679903 from part rtx8000 row 0 [2024-01-28T17:42:27.726] job_signal: 9 of running JobId=3679903 successful 0x8004 [2024-01-28T17:43:19.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:44:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:45:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:46:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:47:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06
So at 17:42 the user must of done an scancel. In the slurmd log on the node I see:
[2024-01-28T17:42:27.727] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:42:27.728] debug: credential for job 3679903 revoked [2024-01-28T17:42:27.728] debug: _rpc_terminate_job: sent SUCCESS for 3679903, waiting for prolog to finish [2024-01-28T17:42:27.728] debug: Waiting for job 3679903's prolog to complete [2024-01-28T17:43:19.002] debug: _rpc_terminate_job: uid = 1150 _JobId=3679903 [2024-01-28T17:44:20.001] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:45:20.002] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:46:20.001] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:47:20.002] debug: _rpc_terminate_job: uid = 1150 JobId=3679903
Strange that a prolog is being called on job cancel
slurmd seems to be getting the repeated calls to terminate the job from slurmctld but it is not happening. Also the process table has
[root@rtx-06 ~]# ps auxw | grep slurmd root 161784 0.0 0.0 436748 21720 ? Ssl 13:44 0:00 /usr/sbin/slurmd --systemd root 190494 0.0 0.0 0 0 ? Zs 17:34 0:00 [slurmd] <defunct>
where there is now a zombie slurmd process I cannot kill even with kill -9
If I do a 'systemctl stop slurmd' it takes a long time but eventually stop slurmd and gets rid of the zombie process but kills the "good" running jobs too with NODE_FAIL.
Another case is where a job will be cancelled and SLURM acts like it is cancelled with it not showing up in squeue but the process keep running on the box.
# pstree -u sg1526 -p | grep ^slurm slurm_script(185763)---python(185796)-+-{python}(185797) # strings /proc/185763/environ | grep JOB_ID SLURM_JOB_ID=3679888 # squeue -j 3679888 slurm_load_jobs error: Invalid job id specified
sacct shows that job being cancelled. In the slurmd log we see
[2024-01-28T17:33:58.757] debug: _rpc_terminate_job: uid = 1150 JobId=3679888 [2024-01-28T17:33:58.757] debug: credential for job 3679888 revoked [2024-01-28T17:33:58.757] debug: _step_connect: connect() failed for /var/slurm/spool/d/rtx-06_3679888.4294967292: Connection refused [2024-01-28T17:33:58.757] debug: Cleaned up stray socket /var/slurm/spool/d/rtx-06_3679888.4294967292 [2024-01-28T17:33:58.757] debug: signal for nonexistent StepId=3679888.extern stepd_connect failed: Connection refused [2024-01-28T17:33:58.757] debug: _step_connect: connect() failed for /var/slurm/spool/d/rtx-06_3679888.4294967291: Connection refused [2024-01-28T17:33:58.757] debug: Cleaned up stray socket /var/slurm/spool/d/rtx-06_3679888.4294967291 [2024-01-28T17:33:58.757] _handle_stray_script: Purging vestigial job script /var/slurm/spool/d/job3679888/slurm_script [2024-01-28T17:33:58.757] debug: signal for nonexistent StepId=3679888.batch stepd_connect failed: Connection refused [2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 were able to be signaled with 18 [2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 to send signal 15 [2024-01-28T17:33:58.757] debug2: set revoke expiration for jobid 3679888 to 1706481358 UTS [2024-01-28T17:33:58.757] debug: Waiting for job 3679888's prolog to complete [2024-01-28T17:33:58.757] debug: Finished wait for job 3679888's prolog to complete [2024-01-28T17:33:58.771] debug: completed epilog for jobid 3679888 [2024-01-28T17:33:58.774] debug: JobId=3679888: sent epilog complete msg: rc = 0
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in a completing state and slurmd daemons can't be killed because they are left in a CLOSE-WAIT state. See my previous mail to the mailing list for the details. And also https://bugs.schedmd.com/show_bug.cgi?id=18561 for another site having issues. We've now downgraded the clients (slurmd and login nodes) to 23.02.7 which gets rid of most issues. If possible, I would try to also downgrade slurmctld to an earlier release, but this requires getting rid of all running and queued jobs.
Kind regards,
Fokke
Op ma 29 jan 2024 om 01:00 schreef Paul Raines raines@nmr.mgh.harvard.edu:
Some more info on what I am seeing after the 23.11.3 upgrade.
Here is a case where a job is cancelled but seems permanently stuck in 'CG' state in squeue
[2024-01-28T17:34:11.002] debug3: sched: JobId=3679903 initiated [2024-01-28T17:34:11.002] sched: Allocate JobId=3679903 NodeList=rtx-06 #CPUs=4 Partition=rtx8000 [2024-01-28T17:34:11.002] debug3: create_mmap_buf: loaded file `/var/slurm/spool/ctld/hash.3/job.3679903/script` as buf_t [2024-01-28T17:42:27.724] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=3679903 uid 5875902 [2024-01-28T17:42:27.725] debug: email msg to sg1526: Slurm Job_id=3679903 Name=sjob_1246 Ended, Run time 00:08:17, CANCELLED, ExitCode 0 [2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job: JobId=3679903 action:normal [2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job: removed JobId=3679903 from part rtx8000 row 0 [2024-01-28T17:42:27.726] job_signal: 9 of running JobId=3679903 successful 0x8004 [2024-01-28T17:43:19.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:44:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:45:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:46:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:47:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06
So at 17:42 the user must of done an scancel. In the slurmd log on the node I see:
[2024-01-28T17:42:27.727] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:42:27.728] debug: credential for job 3679903 revoked [2024-01-28T17:42:27.728] debug: _rpc_terminate_job: sent SUCCESS for 3679903, waiting for prolog to finish [2024-01-28T17:42:27.728] debug: Waiting for job 3679903's prolog to complete [2024-01-28T17:43:19.002] debug: _rpc_terminate_job: uid = 1150 _JobId=3679903 [2024-01-28T17:44:20.001] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:45:20.002] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:46:20.001] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:47:20.002] debug: _rpc_terminate_job: uid = 1150 JobId=3679903
Strange that a prolog is being called on job cancel
slurmd seems to be getting the repeated calls to terminate the job from slurmctld but it is not happening. Also the process table has
[root@rtx-06 ~]# ps auxw | grep slurmd root 161784 0.0 0.0 436748 21720 ? Ssl 13:44 0:00 /usr/sbin/slurmd --systemd root 190494 0.0 0.0 0 0 ? Zs 17:34 0:00 [slurmd] <defunct>
where there is now a zombie slurmd process I cannot kill even with kill -9
If I do a 'systemctl stop slurmd' it takes a long time but eventually stop slurmd and gets rid of the zombie process but kills the "good" running jobs too with NODE_FAIL.
Another case is where a job will be cancelled and SLURM acts like it is cancelled with it not showing up in squeue but the process keep running on the box.
# pstree -u sg1526 -p | grep ^slurm slurm_script(185763)---python(185796)-+-{python}(185797) # strings /proc/185763/environ | grep JOB_ID SLURM_JOB_ID=3679888 # squeue -j 3679888 slurm_load_jobs error: Invalid job id specified
sacct shows that job being cancelled. In the slurmd log we see
[2024-01-28T17:33:58.757] debug: _rpc_terminate_job: uid = 1150 JobId=3679888 [2024-01-28T17:33:58.757] debug: credential for job 3679888 revoked [2024-01-28T17:33:58.757] debug: _step_connect: connect() failed for /var/slurm/spool/d/rtx-06_3679888.4294967292: Connection refused [2024-01-28T17:33:58.757] debug: Cleaned up stray socket /var/slurm/spool/d/rtx-06_3679888.4294967292 [2024-01-28T17:33:58.757] debug: signal for nonexistent StepId=3679888.extern stepd_connect failed: Connection refused [2024-01-28T17:33:58.757] debug: _step_connect: connect() failed for /var/slurm/spool/d/rtx-06_3679888.4294967291: Connection refused [2024-01-28T17:33:58.757] debug: Cleaned up stray socket /var/slurm/spool/d/rtx-06_3679888.4294967291 [2024-01-28T17:33:58.757] _handle_stray_script: Purging vestigial job script /var/slurm/spool/d/job3679888/slurm_script [2024-01-28T17:33:58.757] debug: signal for nonexistent StepId=3679888.batch stepd_connect failed: Connection refused [2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 were able to be signaled with 18 [2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 to send signal 15 [2024-01-28T17:33:58.757] debug2: set revoke expiration for jobid 3679888 to 1706481358 UTS [2024-01-28T17:33:58.757] debug: Waiting for job 3679888's prolog to complete [2024-01-28T17:33:58.757] debug: Finished wait for job 3679888's prolog to complete [2024-01-28T17:33:58.771] debug: completed epilog for jobid 3679888 [2024-01-28T17:33:58.774] debug: JobId=3679888: sent epilog complete msg: rc = 0
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
On 1/30/24 09:36, Fokke Dijkstra wrote:
We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in a completing state and slurmd daemons can't be killed because they are left in a CLOSE-WAIT state. See my previous mail to the mailing list for the details. And also https://bugs.schedmd.com/show_bug.cgi?id=18561 https://bugs.schedmd.com/show_bug.cgi?id=18561 for another site having issues.
Bug 18561 was submitted by a user with no support contract, so it's unlikely that SchedMD will look into it.
I guess many sites are considering the upgrade to 23.11, and if there is an issue as reported, a site with a valid support contract needs to open a support case. I'm very interested in hearing about any progress with 23.11!
Thanks, Ole
These are scary news. I just updated to 23.11.1, but couldn't confirm the problems described so far. I'll do some more extensive and intensive tests. In case of desaster: Does anyone knows how to rollback the DB, as some new DB 'objects' attributes are introduced in 23.11.1. I never had the chance to do this before :-0 As we have support contract I would open a ticket.
-----Original Message----- From: slurm-users slurm-users-bounces@lists.schedmd.com On Behalf Of Ole Holm Nielsen Sent: Tuesday, 30 January 2024 10:04 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state
On 1/30/24 09:36, Fokke Dijkstra wrote:
We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in a completing state and slurmd daemons can't be killed because they are left in a CLOSE-WAIT state. See my previous mail to the mailing list for the details. And also https://bugs.schedmd.com/show_bug.cgi?id=18561 https://bugs.schedmd.com/show_bug.cgi?id=18561 for another site having issues.
Bug 18561 was submitted by a user with no support contract, so it's unlikely that SchedMD will look into it.
I guess many sites are considering the upgrade to 23.11, and if there is an issue as reported, a site with a valid support contract needs to open a support case. I'm very interested in hearing about any progress with 23.11!
Thanks, Ole
I built 23.02.7 and tried that and had the same problems.
BTW, I am using the slurm.spec rpm build method (built on Rocky 8 boxes with NVIDIA 535.54.03 proprietary drives installed).
The behavior I was seeing was one would start a GPU job. It was fine at first but at some point the slurmstepd processes for it would crash/die. Sometimes the user process for it would die too, sometimes not. In an interactive job you would sometimes see some final line about "malloc" and "invalid value" and the terminal would be hung until the job was 'scancel'ed. 'ps' would show a 'slurmd <defunct>' process that was unkillable (killing the main slurmd process would get rid of it)
How the slurm controller saw the job seemed random. Sometimes it saw it as a crashed job and it would be reported like that in the system. Sometimes it was stuck as a permanently CG (completing) job. Sometimes it did not notice anything wrong and just stayed as a seemingly perfect running job according to slurmctld (I never waited for the TimeLimit to hit to see what happened then but did scancel).
I scancelled a "failed" job in the CG or R state, it would not actually kill the user processes on the node but it would clear the job from squeue.
Jobs on my non-GPU Rocky 8 nodes or on my Ubuntu GPU nodes (slurm 23.11.3 install built sepearately on these) have all been working fine so far. The install on the Ubuntu GPU boxes was built separately on them and also another difference is they are still using NVIDIA 470 drivers.
I tried downgrading a Rocky 8 GPU box to NVIDIA 470 and rebuilding slurm 23.11.3 there and installing it to see if that worked to fix things. It did not.
I then tried installing old 22.05.6 RPMs I had built on my Rocky 8 box back in Nov 2022 on all Rocky 8 GPU boxes. This seems to have fixed the problem and jobs are no longer showing the issue. Not and ideal solution but good enough for now.
Both those 22.05.6 RPMs and the 23.11.3 RPMs are built on the same Rocky 8 GPU box. So the differences are:
1) slurm version obviously 2) built using different gcc/lib versions mostly likely due to OS updates between Nov 2022 and now 3) built with a different NVIDIA driver/cuda installed between then and now but I am not sure what I had in Nov 2022
I highly suspect #2 or #3 as the underlying issue here and I wonder if the NVML library at the time of build is the key (though like I said I tried rebuiling with NVIDIA 470 and that still had the issue)
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Tue, 30 Jan 2024 3:36am, Fokke Dijkstra wrote:
External Email - Use Caution
We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in a completing state and slurmd daemons can't be killed because they are left in a CLOSE-WAIT state. See my previous mail to the mailing list for the details. And also https://secure-web.cisco.com/16zDED-XPBGKgr-e6-N-CrY1J1qxgnDyBeTc5vVc0ghVjnV... for another site having issues. We've now downgraded the clients (slurmd and login nodes) to 23.02.7 which gets rid of most issues. If possible, I would try to also downgrade slurmctld to an earlier release, but this requires getting rid of all running and queued jobs.
Kind regards,
Fokke
Op ma 29 jan 2024 om 01:00 schreef Paul Raines raines@nmr.mgh.harvard.edu:
Some more info on what I am seeing after the 23.11.3 upgrade.
Here is a case where a job is cancelled but seems permanently stuck in 'CG' state in squeue
[2024-01-28T17:34:11.002] debug3: sched: JobId=3679903 initiated [2024-01-28T17:34:11.002] sched: Allocate JobId=3679903 NodeList=rtx-06 #CPUs=4 Partition=rtx8000 [2024-01-28T17:34:11.002] debug3: create_mmap_buf: loaded file `/var/slurm/spool/ctld/hash.3/job.3679903/script` as buf_t [2024-01-28T17:42:27.724] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=3679903 uid 5875902 [2024-01-28T17:42:27.725] debug: email msg to sg1526: Slurm Job_id=3679903 Name=sjob_1246 Ended, Run time 00:08:17, CANCELLED, ExitCode 0 [2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job: JobId=3679903 action:normal [2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job: removed JobId=3679903 from part rtx8000 row 0 [2024-01-28T17:42:27.726] job_signal: 9 of running JobId=3679903 successful 0x8004 [2024-01-28T17:43:19.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:44:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:45:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:46:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:47:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06
So at 17:42 the user must of done an scancel. In the slurmd log on the node I see:
[2024-01-28T17:42:27.727] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:42:27.728] debug: credential for job 3679903 revoked [2024-01-28T17:42:27.728] debug: _rpc_terminate_job: sent SUCCESS for 3679903, waiting for prolog to finish [2024-01-28T17:42:27.728] debug: Waiting for job 3679903's prolog to complete [2024-01-28T17:43:19.002] debug: _rpc_terminate_job: uid = 1150 _JobId=3679903 [2024-01-28T17:44:20.001] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:45:20.002] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:46:20.001] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:47:20.002] debug: _rpc_terminate_job: uid = 1150 JobId=3679903
Strange that a prolog is being called on job cancel
slurmd seems to be getting the repeated calls to terminate the job from slurmctld but it is not happening. Also the process table has
[root@rtx-06 ~]# ps auxw | grep slurmd root 161784 0.0 0.0 436748 21720 ? Ssl 13:44 0:00 /usr/sbin/slurmd --systemd root 190494 0.0 0.0 0 0 ? Zs 17:34 0:00 [slurmd] <defunct>
where there is now a zombie slurmd process I cannot kill even with kill -9
If I do a 'systemctl stop slurmd' it takes a long time but eventually stop slurmd and gets rid of the zombie process but kills the "good" running jobs too with NODE_FAIL.
Another case is where a job will be cancelled and SLURM acts like it is cancelled with it not showing up in squeue but the process keep running on the box.
# pstree -u sg1526 -p | grep ^slurm slurm_script(185763)---python(185796)-+-{python}(185797) # strings /proc/185763/environ | grep JOB_ID SLURM_JOB_ID=3679888 # squeue -j 3679888 slurm_load_jobs error: Invalid job id specified
sacct shows that job being cancelled. In the slurmd log we see
[2024-01-28T17:33:58.757] debug: _rpc_terminate_job: uid = 1150 JobId=3679888 [2024-01-28T17:33:58.757] debug: credential for job 3679888 revoked [2024-01-28T17:33:58.757] debug: _step_connect: connect() failed for /var/slurm/spool/d/rtx-06_3679888.4294967292: Connection refused [2024-01-28T17:33:58.757] debug: Cleaned up stray socket /var/slurm/spool/d/rtx-06_3679888.4294967292 [2024-01-28T17:33:58.757] debug: signal for nonexistent StepId=3679888.extern stepd_connect failed: Connection refused [2024-01-28T17:33:58.757] debug: _step_connect: connect() failed for /var/slurm/spool/d/rtx-06_3679888.4294967291: Connection refused [2024-01-28T17:33:58.757] debug: Cleaned up stray socket /var/slurm/spool/d/rtx-06_3679888.4294967291 [2024-01-28T17:33:58.757] _handle_stray_script: Purging vestigial job script /var/slurm/spool/d/job3679888/slurm_script [2024-01-28T17:33:58.757] debug: signal for nonexistent StepId=3679888.batch stepd_connect failed: Connection refused [2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 were able to be signaled with 18 [2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 to send signal 15 [2024-01-28T17:33:58.757] debug2: set revoke expiration for jobid 3679888 to 1706481358 UTS [2024-01-28T17:33:58.757] debug: Waiting for job 3679888's prolog to complete [2024-01-28T17:33:58.757] debug: Finished wait for job 3679888's prolog to complete [2024-01-28T17:33:58.771] debug: completed epilog for jobid 3679888 [2024-01-28T17:33:58.774] debug: JobId=3679888: sent epilog complete msg: rc = 0
-- Paul Raines (http://secure-web.cisco.com/19HZZ6loaWDjqm3fm3-mLdIlOO486WzCozFRPfjXcoODk0os...)
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
-- Fokke Dijkstra f.dijkstra@rug.nl f.dijkstra@rug.nl Team High Performance Computing Center for Information Technology, University of Groningen Postbus 11044, 9700 CA Groningen, The Netherlands
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
This is definitely a NVML thing crashing slurmstepd. Here is what I find doing an strace of the slurmstepd: [3681401.0] process at the point the crash happens:
[pid 1132920] fcntl(10, F_SETFD, FD_CLOEXEC) = 0 [pid 1132920] read(10, "1132950 (bash) S 1132919 1132950"..., 511) = 339 [pid 1132920] openat(AT_FDCWD, "/proc/1132950/status", O_RDONLY) = 12 [pid 1132920] read(12, "Name:\tbash\nUmask:\t0002\nState:\tS "..., 4095) = 1431 [pid 1132920] close(12) = 0 [pid 1132920] close(10) = 0 [pid 1132920] getpid() = 1132919 [pid 1132920] getpid() = 1132919 [pid 1132920] getpid() = 1132919 [pid 1132920] getpid() = 1132919 [pid 1132920] ioctl(16, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x14dc6db683f0) = 0 [pid 1132920] getpid() = 1132919 [pid 1132920] ioctl(16, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x14dc6db683f0) = 0 [pid 1132920] ioctl(16, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x14dc6db683f0) = 0 [pid 1132920] writev(2, [{iov_base="free(): invalid next size (fast)", iov_len=32}, {iov_base="\n", iov_len=1}], 2) = 33 [pid 1132924] <... poll resumed>) = 1 ([{fd=13, revents=POLLIN}]) [pid 1132920] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...> [pid 1132924] read(13, "free(): invalid next size (fast)"..., 3850) = 34 [pid 1132920] <... mmap resumed>) = 0x14dc6da6d000 [pid 1132920] rt_sigprocmask(SIG_UNBLOCK, [ABRT], <unfinished ...>
From "ls -l /proc/1132919/fd" before the crash happens I can tell
that FileDescriptor 16 is /dev/nvidiactl. SO it is doing an ioctl to /dev/nvidiactl write before this crash. In some cases the error is a "free(): invalid next size (fast)" and sometimes it is malloc()
Job submitted as:
srun -p rtx6000 -A sysadm -N 1 --ntasks-per-node=1 --mem=20G --time=1-10:00:00 --cpus-per-task=2 --gpus=8 --nodelist=rtx-02 --pty /bin/bash
Command run in interactive shell is: gpuburn 30
In this case I am not getting the defunct slurmd process left behind but there is a strange 'sleep 100000000' left behind I have to kill
[root@rtx-02 ~]# find $(find /sys/fs/cgroup/ -name job_3681401 ) -name cgroup.procs -exec cat {} ; | sort | uniq 1132916 [root@rtx-02 ~]# ps -f -p 1132916 UID PID PPID C STIME TTY TIME CMD root 1132916 1 0 11:13 ? 00:00:00 sleep 100000000
The NVML library I have installed is
$ rpm -qa | grep NVML nvidia-driver-NVML-535.54.03-1.el8.x86_64
on both the box where the SLURM binaries were built and on this box where slurmstepd is crashing. /usr/local/cuda has a 11.6 CUDA
Okay, I just upgraded the NVIDIA driver on rtx-02 with
dnf --enablerepo=cuda module reset nvidia-driver dnf --enablerepo=cuda module install nvidia-driver:535-dkms
Restarted everything and it appears with my initial couple of tests the problem has gone away. Going to need to have real users test with real jobs.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Tue, 30 Jan 2024 9:01am, Paul Raines wrote:
External Email - Use Caution
I built 23.02.7 and tried that and had the same problems.
BTW, I am using the slurm.spec rpm build method (built on Rocky 8 boxes with NVIDIA 535.54.03 proprietary drives installed).
The behavior I was seeing was one would start a GPU job. It was fine at first but at some point the slurmstepd processes for it would crash/die. Sometimes the user process for it would die too, sometimes not. In an interactive job you would sometimes see some final line about "malloc" and "invalid value" and the terminal would be hung until the job was 'scancel'ed. 'ps' would show a 'slurmd <defunct>' process that was unkillable (killing the main slurmd process would get rid of it)
How the slurm controller saw the job seemed random. Sometimes it saw it as a crashed job and it would be reported like that in the system. Sometimes it was stuck as a permanently CG (completing) job. Sometimes it did not notice anything wrong and just stayed as a seemingly perfect running job according to slurmctld (I never waited for the TimeLimit to hit to see what happened then but did scancel).
I scancelled a "failed" job in the CG or R state, it would not actually kill the user processes on the node but it would clear the job from squeue.
Jobs on my non-GPU Rocky 8 nodes or on my Ubuntu GPU nodes (slurm 23.11.3 install built sepearately on these) have all been working fine so far. The install on the Ubuntu GPU boxes was built separately on them and also another difference is they are still using NVIDIA 470 drivers.
I tried downgrading a Rocky 8 GPU box to NVIDIA 470 and rebuilding slurm 23.11.3 there and installing it to see if that worked to fix things. It did not.
I then tried installing old 22.05.6 RPMs I had built on my Rocky 8 box back in Nov 2022 on all Rocky 8 GPU boxes. This seems to have fixed the problem and jobs are no longer showing the issue. Not and ideal solution but good enough for now.
Both those 22.05.6 RPMs and the 23.11.3 RPMs are built on the same Rocky 8 GPU box. So the differences are:
- slurm version obviously
- built using different gcc/lib versions mostly likely due to OS updates between Nov 2022 and now
- built with a different NVIDIA driver/cuda installed between then and now but I am not sure what I had in Nov 2022
I highly suspect #2 or #3 as the underlying issue here and I wonder if the NVML library at the time of build is the key (though like I said I tried rebuiling with NVIDIA 470 and that still had the issue)
-- Paul Raines (http://secure-web.cisco.com/1GE8WXzrTSCgdg4bXwEtusq8klIWCEm8v6O6fPZ-8pGQdt3E...)
On Tue, 30 Jan 2024 3:36am, Fokke Dijkstra wrote:
External Email - Use Caution
We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in a completing state and slurmd daemons can't be killed because they are left in a CLOSE-WAIT state. See my previous mail to the mailing list for the details. And also https://secure-web.cisco.com/16zDED-XPBGKgr-e6-N-CrY1J1qxgnDyBeTc5vVc0ghVjnV... for another site having issues. We've now downgraded the clients (slurmd and login nodes) to 23.02.7 which gets rid of most issues. If possible, I would try to also downgrade slurmctld to an earlier release, but this requires getting rid of all running and queued jobs.
Kind regards,
Fokke
Op ma 29 jan 2024 om 01:00 schreef Paul Raines raines@nmr.mgh.harvard.edu:
Some more info on what I am seeing after the 23.11.3 upgrade.
Here is a case where a job is cancelled but seems permanently stuck in 'CG' state in squeue
[2024-01-28T17:34:11.002] debug3: sched: JobId=3679903 initiated [2024-01-28T17:34:11.002] sched: Allocate JobId=3679903 NodeList=rtx-06 #CPUs=4 Partition=rtx8000 [2024-01-28T17:34:11.002] debug3: create_mmap_buf: loaded file `/var/slurm/spool/ctld/hash.3/job.3679903/script` as buf_t [2024-01-28T17:42:27.724] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=3679903 uid 5875902 [2024-01-28T17:42:27.725] debug: email msg to sg1526: Slurm Job_id=3679903 Name=sjob_1246 Ended, Run time 00:08:17, CANCELLED, ExitCode 0 [2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job: JobId=3679903 action:normal [2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job: removed JobId=3679903 from part rtx8000 row 0 [2024-01-28T17:42:27.726] job_signal: 9 of running JobId=3679903 successful 0x8004 [2024-01-28T17:43:19.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:44:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:45:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:46:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:47:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06
So at 17:42 the user must of done an scancel. In the slurmd log on the node I see:
[2024-01-28T17:42:27.727] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:42:27.728] debug: credential for job 3679903 revoked [2024-01-28T17:42:27.728] debug: _rpc_terminate_job: sent SUCCESS for 3679903, waiting for prolog to finish [2024-01-28T17:42:27.728] debug: Waiting for job 3679903's prolog to complete [2024-01-28T17:43:19.002] debug: _rpc_terminate_job: uid = 1150 _JobId=3679903 [2024-01-28T17:44:20.001] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:45:20.002] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:46:20.001] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:47:20.002] debug: _rpc_terminate_job: uid = 1150 JobId=3679903
Strange that a prolog is being called on job cancel
slurmd seems to be getting the repeated calls to terminate the job from slurmctld but it is not happening. Also the process table has
[root@rtx-06 ~]# ps auxw | grep slurmd root 161784 0.0 0.0 436748 21720 ? Ssl 13:44 0:00 /usr/sbin/slurmd --systemd root 190494 0.0 0.0 0 0 ? Zs 17:34 0:00 [slurmd] <defunct>
where there is now a zombie slurmd process I cannot kill even with kill -9
If I do a 'systemctl stop slurmd' it takes a long time but eventually stop slurmd and gets rid of the zombie process but kills the "good" running jobs too with NODE_FAIL.
Another case is where a job will be cancelled and SLURM acts like it is cancelled with it not showing up in squeue but the process keep running on the box.
# pstree -u sg1526 -p | grep ^slurm slurm_script(185763)---python(185796)-+-{python}(185797) # strings /proc/185763/environ | grep JOB_ID SLURM_JOB_ID=3679888 # squeue -j 3679888 slurm_load_jobs error: Invalid job id specified
sacct shows that job being cancelled. In the slurmd log we see
[2024-01-28T17:33:58.757] debug: _rpc_terminate_job: uid = 1150 JobId=3679888 [2024-01-28T17:33:58.757] debug: credential for job 3679888 revoked [2024-01-28T17:33:58.757] debug: _step_connect: connect() failed for /var/slurm/spool/d/rtx-06_3679888.4294967292: Connection refused [2024-01-28T17:33:58.757] debug: Cleaned up stray socket /var/slurm/spool/d/rtx-06_3679888.4294967292 [2024-01-28T17:33:58.757] debug: signal for nonexistent StepId=3679888.extern stepd_connect failed: Connection refused [2024-01-28T17:33:58.757] debug: _step_connect: connect() failed for /var/slurm/spool/d/rtx-06_3679888.4294967291: Connection refused [2024-01-28T17:33:58.757] debug: Cleaned up stray socket /var/slurm/spool/d/rtx-06_3679888.4294967291 [2024-01-28T17:33:58.757] _handle_stray_script: Purging vestigial job script /var/slurm/spool/d/job3679888/slurm_script [2024-01-28T17:33:58.757] debug: signal for nonexistent StepId=3679888.batch stepd_connect failed: Connection refused [2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 were able to be signaled with 18 [2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 to send signal 15 [2024-01-28T17:33:58.757] debug2: set revoke expiration for jobid 3679888 to 1706481358 UTS [2024-01-28T17:33:58.757] debug: Waiting for job 3679888's prolog to complete [2024-01-28T17:33:58.757] debug: Finished wait for job 3679888's prolog to complete [2024-01-28T17:33:58.771] debug: completed epilog for jobid 3679888 [2024-01-28T17:33:58.774] debug: JobId=3679888: sent epilog complete msg: rc = 0
-- Paul Raines (http://secure-web.cisco.com/19HZZ6loaWDjqm3fm3-mLdIlOO486WzCozFRPfjXcoODk0os...)
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
-- Fokke Dijkstra f.dijkstra@rug.nl f.dijkstra@rug.nl Team High Performance Computing Center for Information Technology, University of Groningen Postbus 11044, 9700 CA Groningen, The Netherlands
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://secure-web.cisco.com/11OYSv2CguxXNJry_EVOZ5uvHKNdNra8ghPWx85BJidC0Ax... https://secure-web.cisco.com/11OYSv2CguxXNJry_EVOZ5uvHKNdNra8ghPWx85BJidC0AxmDyZhhZ5_qUr3zAp1EfvIHpG6XP4pLWy1YvG1EFbSjAc9rKQnAqpexk2uGNHCHEpH8PKrdG-nXdxPDvsO55pYZBpDup55sj3j_IqdFPzWiBQ7e6XYDrV3l3Wn-PM2LcZm6hMhx1duKOp4rgLMiq6VGc4GpNxdx0qg7VR5GSfvV-gNzrqVBTUEI7IUca3hoccjpQdpsyU29lgwDGOwGjQd670rwqnjMxnSuPlViUlVhYH0DqG9BReEwsT19d0dWQtFrvAdVM9kL2x3lUbgKcajhcFfUsXj9eVNj--FGLQ/https%3A%2F%2Fwww.massgeneralbrigham.org%2Fcomplianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
Several jobs have run on the node with the upgrade NVIDIA driver and have gone through fine.
This issue may be a bug in the NVML library itself for the nvidia-driver-NVML-535.54.03-1.el8.x86_64 driver or in SLURM 23.11's new NVML code that fetches GPU smUtil (_get_gpuutil) in gpu_nvml.c
When I used SLURM 22.05.6 which does not have that NVML code the crash in slurmstepd went away. But it also went away in 23.11.3 after updating the NVIDIA driver to nvidia-driver-NVML-535.154.05-1.el8.x86_64. But that could just be "tickling" the stack enough to avoid the invalid malloc/free caused by the bug in the gpu_nvml.c code
So it is not truly clear where the underlying issue really is though but seems most likely a bug in the older version of NVML I had installed.
Ideally though SLURM would have better handling of the slurmstepd processes crashing.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Tue, 30 Jan 2024 12:19pm, Paul Raines wrote:
This is definitely a NVML thing crashing slurmstepd. Here is what I find doing an strace of the slurmstepd: [3681401.0] process at the point the crash happens:
[pid 1132920] fcntl(10, F_SETFD, FD_CLOEXEC) = 0 [pid 1132920] read(10, "1132950 (bash) S 1132919 1132950"..., 511) = 339 [pid 1132920] openat(AT_FDCWD, "/proc/1132950/status", O_RDONLY) = 12 [pid 1132920] read(12, "Name:\tbash\nUmask:\t0002\nState:\tS "..., 4095) = 1431 [pid 1132920] close(12) = 0 [pid 1132920] close(10) = 0 [pid 1132920] getpid() = 1132919 [pid 1132920] getpid() = 1132919 [pid 1132920] getpid() = 1132919 [pid 1132920] getpid() = 1132919 [pid 1132920] ioctl(16, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x14dc6db683f0) = 0 [pid 1132920] getpid() = 1132919 [pid 1132920] ioctl(16, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x14dc6db683f0) = 0 [pid 1132920] ioctl(16, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x14dc6db683f0) = 0 [pid 1132920] writev(2, [{iov_base="free(): invalid next size (fast)", iov_len=32}, {iov_base="\n", iov_len=1}], 2) = 33 [pid 1132924] <... poll resumed>) = 1 ([{fd=13, revents=POLLIN}]) [pid 1132920] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...> [pid 1132924] read(13, "free(): invalid next size (fast)"..., 3850) = 34 [pid 1132920] <... mmap resumed>) = 0x14dc6da6d000 [pid 1132920] rt_sigprocmask(SIG_UNBLOCK, [ABRT], <unfinished ...>
From "ls -l /proc/1132919/fd" before the crash happens I can tell that FileDescriptor 16 is /dev/nvidiactl. SO it is doing an ioctl to /dev/nvidiactl write before this crash. In some cases the error is a "free(): invalid next size (fast)" and sometimes it is malloc()
Job submitted as:
srun -p rtx6000 -A sysadm -N 1 --ntasks-per-node=1 --mem=20G --time=1-10:00:00 --cpus-per-task=2 --gpus=8 --nodelist=rtx-02 --pty /bin/bash
Command run in interactive shell is: gpuburn 30
In this case I am not getting the defunct slurmd process left behind but there is a strange 'sleep 100000000' left behind I have to kill
[root@rtx-02 ~]# find $(find /sys/fs/cgroup/ -name job_3681401 ) -name cgroup.procs -exec cat {} ; | sort | uniq 1132916 [root@rtx-02 ~]# ps -f -p 1132916 UID PID PPID C STIME TTY TIME CMD root 1132916 1 0 11:13 ? 00:00:00 sleep 100000000
The NVML library I have installed is
$ rpm -qa | grep NVML nvidia-driver-NVML-535.54.03-1.el8.x86_64
on both the box where the SLURM binaries were built and on this box where slurmstepd is crashing. /usr/local/cuda has a 11.6 CUDA
Okay, I just upgraded the NVIDIA driver on rtx-02 with
dnf --enablerepo=cuda module reset nvidia-driver dnf --enablerepo=cuda module install nvidia-driver:535-dkms
Restarted everything and it appears with my initial couple of tests the problem has gone away. Going to need to have real users test with real jobs.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Tue, 30 Jan 2024 9:01am, Paul Raines wrote:
External Email - Use Caution
I built 23.02.7 and tried that and had the same problems.
BTW, I am using the slurm.spec rpm build method (built on Rocky 8 boxes with NVIDIA 535.54.03 proprietary drives installed).
The behavior I was seeing was one would start a GPU job. It was fine at first but at some point the slurmstepd processes for it would crash/die. Sometimes the user process for it would die too, sometimes not. In an interactive job you would sometimes see some final line about "malloc" and "invalid value" and the terminal would be hung until the job was 'scancel'ed. 'ps' would show a 'slurmd <defunct>' process that was unkillable (killing the main slurmd process would get rid of it)
How the slurm controller saw the job seemed random. Sometimes it saw it as a crashed job and it would be reported like that in the system. Sometimes it was stuck as a permanently CG (completing) job. Sometimes it did not notice anything wrong and just stayed as a seemingly perfect running job according to slurmctld (I never waited for the TimeLimit to hit to see what happened then but did scancel).
I scancelled a "failed" job in the CG or R state, it would not actually kill the user processes on the node but it would clear the job from squeue.
Jobs on my non-GPU Rocky 8 nodes or on my Ubuntu GPU nodes (slurm 23.11.3 install built sepearately on these) have all been working fine so far. The install on the Ubuntu GPU boxes was built separately on them and also another difference is they are still using NVIDIA 470 drivers.
I tried downgrading a Rocky 8 GPU box to NVIDIA 470 and rebuilding slurm 23.11.3 there and installing it to see if that worked to fix things. It did not.
I then tried installing old 22.05.6 RPMs I had built on my Rocky 8 box back in Nov 2022 on all Rocky 8 GPU boxes. This seems to have fixed the problem and jobs are no longer showing the issue. Not and ideal solution but good enough for now.
Both those 22.05.6 RPMs and the 23.11.3 RPMs are built on the same Rocky 8 GPU box. So the differences are:
1) slurm version obviously 2) built using different gcc/lib versions mostly likely due to OS updates between Nov 2022 and now 3) built with a different NVIDIA driver/cuda installed between then and now but I am not sure what I had in Nov 2022
I highly suspect #2 or #3 as the underlying issue here and I wonder if the NVML library at the time of build is the key (though like I said I tried rebuiling with NVIDIA 470 and that still had the issue)
-- Paul Raines (http://secure-web.cisco.com/1GE8WXzrTSCgdg4bXwEtusq8klIWCEm8v6O6fPZ-8pGQdt3E...)
On Tue, 30 Jan 2024 3:36am, Fokke Dijkstra wrote:
External Email - Use Caution
We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in a completing state and slurmd daemons can't be killed because they are left in a CLOSE-WAIT state. See my previous mail to the mailing list for the details. And also https://secure-web.cisco.com/16zDED-XPBGKgr-e6-N-CrY1J1qxgnDyBeTc5vVc0ghVjnV... for another site having issues. We've now downgraded the clients (slurmd and login nodes) to 23.02.7 which gets rid of most issues. If possible, I would try to also downgrade slurmctld to an earlier release, but this requires getting rid of all running and queued jobs.
Kind regards,
Fokke
Op ma 29 jan 2024 om 01:00 schreef Paul Raines raines@nmr.mgh.harvard.edu:
Some more info on what I am seeing after the 23.11.3 upgrade.
Here is a case where a job is cancelled but seems permanently stuck in 'CG' state in squeue
[2024-01-28T17:34:11.002] debug3: sched: JobId=3679903 initiated [2024-01-28T17:34:11.002] sched: Allocate JobId=3679903 NodeList=rtx-06 #CPUs=4 Partition=rtx8000 [2024-01-28T17:34:11.002] debug3: create_mmap_buf: loaded file `/var/slurm/spool/ctld/hash.3/job.3679903/script` as buf_t [2024-01-28T17:42:27.724] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=3679903 uid 5875902 [2024-01-28T17:42:27.725] debug: email msg to sg1526: Slurm Job_id=3679903 Name=sjob_1246 Ended, Run time 00:08:17, CANCELLED, ExitCode 0 [2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job: JobId=3679903 action:normal [2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job: removed JobId=3679903 from part rtx8000 row 0 [2024-01-28T17:42:27.726] job_signal: 9 of running JobId=3679903 successful 0x8004 [2024-01-28T17:43:19.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:44:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:45:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:46:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:47:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06
So at 17:42 the user must of done an scancel. In the slurmd log on the node I see:
[2024-01-28T17:42:27.727] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:42:27.728] debug: credential for job 3679903 revoked [2024-01-28T17:42:27.728] debug: _rpc_terminate_job: sent SUCCESS for 3679903, waiting for prolog to finish [2024-01-28T17:42:27.728] debug: Waiting for job 3679903's prolog to complete [2024-01-28T17:43:19.002] debug: _rpc_terminate_job: uid = 1150 _JobId=3679903 [2024-01-28T17:44:20.001] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:45:20.002] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:46:20.001] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:47:20.002] debug: _rpc_terminate_job: uid = 1150 JobId=3679903
Strange that a prolog is being called on job cancel
slurmd seems to be getting the repeated calls to terminate the job from slurmctld but it is not happening. Also the process table has
[root@rtx-06 ~]# ps auxw | grep slurmd root 161784 0.0 0.0 436748 21720 ? Ssl 13:44 0:00 /usr/sbin/slurmd --systemd root 190494 0.0 0.0 0 0 ? Zs 17:34 0:00 [slurmd] <defunct>
where there is now a zombie slurmd process I cannot kill even with kill -9
If I do a 'systemctl stop slurmd' it takes a long time but eventually stop slurmd and gets rid of the zombie process but kills the "good" running jobs too with NODE_FAIL.
Another case is where a job will be cancelled and SLURM acts like it is cancelled with it not showing up in squeue but the process keep running on the box.
# pstree -u sg1526 -p | grep ^slurm slurm_script(185763)---python(185796)-+-{python}(185797) # strings /proc/185763/environ | grep JOB_ID SLURM_JOB_ID=3679888 # squeue -j 3679888 slurm_load_jobs error: Invalid job id specified
sacct shows that job being cancelled. In the slurmd log we see
[2024-01-28T17:33:58.757] debug: _rpc_terminate_job: uid = 1150 JobId=3679888 [2024-01-28T17:33:58.757] debug: credential for job 3679888 revoked [2024-01-28T17:33:58.757] debug: _step_connect: connect() failed for /var/slurm/spool/d/rtx-06_3679888.4294967292: Connection refused [2024-01-28T17:33:58.757] debug: Cleaned up stray socket /var/slurm/spool/d/rtx-06_3679888.4294967292 [2024-01-28T17:33:58.757] debug: signal for nonexistent StepId=3679888.extern stepd_connect failed: Connection refused [2024-01-28T17:33:58.757] debug: _step_connect: connect() failed for /var/slurm/spool/d/rtx-06_3679888.4294967291: Connection refused [2024-01-28T17:33:58.757] debug: Cleaned up stray socket /var/slurm/spool/d/rtx-06_3679888.4294967291 [2024-01-28T17:33:58.757] _handle_stray_script: Purging vestigial job script /var/slurm/spool/d/job3679888/slurm_script [2024-01-28T17:33:58.757] debug: signal for nonexistent StepId=3679888.batch stepd_connect failed: Connection refused [2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 were able to be signaled with 18 [2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 to send signal 15 [2024-01-28T17:33:58.757] debug2: set revoke expiration for jobid 3679888 to 1706481358 UTS [2024-01-28T17:33:58.757] debug: Waiting for job 3679888's prolog to complete [2024-01-28T17:33:58.757] debug: Finished wait for job 3679888's prolog to complete [2024-01-28T17:33:58.771] debug: completed epilog for jobid 3679888 [2024-01-28T17:33:58.774] debug: JobId=3679888: sent epilog complete msg: rc = 0
-- Paul Raines (http://secure-web.cisco.com/19HZZ6loaWDjqm3fm3-mLdIlOO486WzCozFRPfjXcoODk0os...)
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
-- Fokke Dijkstra f.dijkstra@rug.nl f.dijkstra@rug.nl Team High Performance Computing Center for Information Technology, University of Groningen Postbus 11044, 9700 CA Groningen, The Netherlands
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://secure-web.cisco.com/11OYSv2CguxXNJry_EVOZ5uvHKNdNra8ghPWx85BJidC0Ax... https://secure-web.cisco.com/11OYSv2CguxXNJry_EVOZ5uvHKNdNra8ghPWx85BJidC0AxmDyZhhZ5_qUr3zAp1EfvIHpG6XP4pLWy1YvG1EFbSjAc9rKQnAqpexk2uGNHCHEpH8PKrdG-nXdxPDvsO55pYZBpDup55sj3j_IqdFPzWiBQ7e6XYDrV3l3Wn-PM2LcZm6hMhx1duKOp4rgLMiq6VGc4GpNxdx0qg7VR5GSfvV-gNzrqVBTUEI7IUca3hoccjpQdpsyU29lgwDGOwGjQd670rwqnjMxnSuPlViUlVhYH0DqG9BReEwsT19d0dWQtFrvAdVM9kL2x3lUbgKcajhcFfUsXj9eVNj--FGLQ/https%3A%2F%2Fwww.massgeneralbrigham.org%2Fcomplianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.