[slurm-users] Issues with orphaned jobs after update
Jeffrey McDonald
jmcdonal at umn.edu
Wed Dec 6 14:26:57 UTC 2023
Hi,
Yesterday, an upgrade to slurm from 22.05.4 to 23.11.0 went sideways and I
ended up losing a number of jobs on the compute nodes. Ultimately, the
installation seems to be successful but I now have some issues with job
remnants it appears. About once per minute (per job), the slurmctld
daemon is logging:
[2023-12-06T08:16:32.505] error: slurm_receive_msg [146.57.133.18:39104]:
Zero Bytes were transmitted or received
[2023-12-06T08:16:32.505] error: slurm_receive_msg [146.57.133.18:39106]:
Zero Bytes were transmitted or received
[2023-12-06T08:16:32.792] error: slurm_receive_msg [146.57.133.38:54722]:
Zero Bytes were transmitted or received
[2023-12-06T08:16:34.189] error: slurm_receive_msg [146.57.133.49:59058]:
Zero Bytes were transmitted or received
[2023-12-06T08:16:34.197] error: slurm_receive_msg [146.57.133.49:58232]:
Zero Bytes were transmitted or received
[2023-12-06T08:16:35.757] error: slurm_receive_msg [146.57.133.39:48856]:
Zero Bytes were transmitted or received
[2023-12-06T08:16:35.757] error: slurm_receive_msg [146.57.133.39:48860]:
Zero Bytes were transmitted or received
[2023-12-06T08:16:36.329] error: slurm_receive_msg [146.57.133.46:50848]:
Zero Bytes were transmitted or received
[2023-12-06T08:16:59.827] error: slurm_receive_msg [146.57.133.14:60328]:
Zero Bytes were transmitted or received
[2023-12-06T08:16:59.828] error: slurm_receive_msg [146.57.133.37:37734]:
Zero Bytes were transmitted or received
[2023-12-06T08:17:03.285] error: slurm_receive_msg [146.57.133.35:41426]:
Zero Bytes were transmitted or received
[2023-12-06T08:17:13.244] error: slurm_receive_msg [146.57.133.105:34416]:
Zero Bytes were transmitted or received
[2023-12-06T08:17:13.726] error: slurm_receive_msg [146.57.133.15:60164]:
Zero Bytes were transmitted or received
The controller also shows orphaned jobs:
[2023-12-06T07:47:42.010] error: Orphan StepId=9050.extern reported on node
amd03
[2023-12-06T07:47:42.010] error: Orphan StepId=9055.extern reported on node
amd03
[2023-12-06T07:47:42.011] error: Orphan StepId=8862.extern reported on node
amd12
[2023-12-06T07:47:42.011] error: Orphan StepId=9065.extern reported on node
amd07
[2023-12-06T07:47:42.011] error: Orphan StepId=9066.extern reported on node
amd07
[2023-12-06T07:47:42.011] error: Orphan StepId=8987.extern reported on node
amd09
[2023-12-06T07:47:42.012] error: Orphan StepId=9068.extern reported on node
amd08
[2023-12-06T07:47:42.012] error: Orphan StepId=8862.extern reported on node
amd13
[2023-12-06T07:47:42.012] error: Orphan StepId=8774.extern reported on node
amd10
[2023-12-06T07:47:42.012] error: Orphan StepId=9051.extern reported on node
amd10
[2023-12-06T07:49:22.009] error: Orphan StepId=9071.extern reported on node
aslab01
[2023-12-06T07:49:22.010] error: Orphan StepId=8699.extern reported on node
gpu05
On the compute nodes, I see a corresponding error message like this one:
[2023-12-06T08:18:03.292] [9052.extern] error: hash_g_compute: hash plugin
with id:0 not exist or is not loaded
[2023-12-06T08:18:03.292] [9052.extern] error: slurm_send_node_msg:
hash_g_compute: REQUEST_STEP_COMPLETE has error
The error seems to be a reference always to a job that was canceled, e.g.
9052:
# sacct -j 9052
JobID JobName Partition Account AllocCPUS State
ExitCode
------------ ---------- ---------- ---------- ---------- ----------
--------
9052 sys/dashb+ a40gpu 24 CANCELLED
0:0
9052.batch batch 24 CANCELLED
0:0
9052.extern extern 24 CANCELLED
0:0
These jobs were running at the start of the update but we subsequently
canceled because of the slurmd or slurmctld timeouts during the update.
How can I clean this up? I've tried canceling the jobs but nothing seems
to work to remove them.
Thanks in advance,
Jeff
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231206/5546d07c/attachment.htm>
More information about the slurm-users
mailing list