[slurm-users] Issues with orphaned jobs after update

Thu Dec 7 15:37:26 UTC 2023

Hi,

As an update, I able to clear out the orphan/cancelled jobs by rebooting
the compute nodes which had cancelled jobs.   The error messages have
ceased.

Regards,
Jeff

On Wed, Dec 6, 2023 at 8:26 AM Jeffrey McDonald <jmcdonal at umn.edu> wrote:

> Hi,
> Yesterday, an upgrade to slurm from 22.05.4 to 23.11.0 went sideways and I
> ended up losing a number of jobs on the compute nodes.   Ultimately, the
> installation seems to be successful but I now have some issues with job
> remnants it appears.    About once per minute (per job), the slurmctld
> daemon is logging:
>
> [2023-12-06T08:16:32.505] error: slurm_receive_msg [146.57.133.18:39104]:
> Zero Bytes were transmitted or received
> [2023-12-06T08:16:32.505] error: slurm_receive_msg [146.57.133.18:39106]:
> Zero Bytes were transmitted or received
> [2023-12-06T08:16:32.792] error: slurm_receive_msg [146.57.133.38:54722]:
> Zero Bytes were transmitted or received
> [2023-12-06T08:16:34.189] error: slurm_receive_msg [146.57.133.49:59058]:
> Zero Bytes were transmitted or received
> [2023-12-06T08:16:34.197] error: slurm_receive_msg [146.57.133.49:58232]:
> Zero Bytes were transmitted or received
> [2023-12-06T08:16:35.757] error: slurm_receive_msg [146.57.133.39:48856]:
> Zero Bytes were transmitted or received
> [2023-12-06T08:16:35.757] error: slurm_receive_msg [146.57.133.39:48860]:
> Zero Bytes were transmitted or received
> [2023-12-06T08:16:36.329] error: slurm_receive_msg [146.57.133.46:50848]:
> Zero Bytes were transmitted or received
> [2023-12-06T08:16:59.827] error: slurm_receive_msg [146.57.133.14:60328]:
> Zero Bytes were transmitted or received
> [2023-12-06T08:16:59.828] error: slurm_receive_msg [146.57.133.37:37734]:
> Zero Bytes were transmitted or received
> [2023-12-06T08:17:03.285] error: slurm_receive_msg [146.57.133.35:41426]:
> Zero Bytes were transmitted or received
> [2023-12-06T08:17:13.244] error: slurm_receive_msg [146.57.133.105:34416]:
> Zero Bytes were transmitted or received
> [2023-12-06T08:17:13.726] error: slurm_receive_msg [146.57.133.15:60164]:
> Zero Bytes were transmitted or received
>
> The controller also shows orphaned jobs:
>
> [2023-12-06T07:47:42.010] error: Orphan StepId=9050.extern reported on
> node amd03
> [2023-12-06T07:47:42.010] error: Orphan StepId=9055.extern reported on
> node amd03
> [2023-12-06T07:47:42.011] error: Orphan StepId=8862.extern reported on
> node amd12
> [2023-12-06T07:47:42.011] error: Orphan StepId=9065.extern reported on
> node amd07
> [2023-12-06T07:47:42.011] error: Orphan StepId=9066.extern reported on
> node amd07
> [2023-12-06T07:47:42.011] error: Orphan StepId=8987.extern reported on
> node amd09
> [2023-12-06T07:47:42.012] error: Orphan StepId=9068.extern reported on
> node amd08
> [2023-12-06T07:47:42.012] error: Orphan StepId=8862.extern reported on
> node amd13
> [2023-12-06T07:47:42.012] error: Orphan StepId=8774.extern reported on
> node amd10
> [2023-12-06T07:47:42.012] error: Orphan StepId=9051.extern reported on
> node amd10
> [2023-12-06T07:49:22.009] error: Orphan StepId=9071.extern reported on
> node aslab01
> [2023-12-06T07:49:22.010] error: Orphan StepId=8699.extern reported on
> node gpu05
>
>
> On the compute nodes, I see  a corresponding error message like this one:
>
> [2023-12-06T08:18:03.292] [9052.extern] error: hash_g_compute: hash plugin
> with id:0 not exist or is not loaded
> [2023-12-06T08:18:03.292] [9052.extern] error: slurm_send_node_msg:
> hash_g_compute: REQUEST_STEP_COMPLETE has error
>
>
>
> The error seems to be a reference always to a job that was canceled, e.g.
> 9052:
>
> # sacct -j 9052
> JobID           JobName  Partition    Account  AllocCPUS      State
> ExitCode
> ------------ ---------- ---------- ---------- ---------- ----------
> --------
> 9052         sys/dashb+     a40gpu                    24  CANCELLED
>      0:0
> 9052.batch        batch                               24  CANCELLED
>      0:0
> 9052.extern      extern                               24  CANCELLED
>      0:0
>
> These jobs were running at the start of the update but we subsequently
> canceled because of the slurmd or slurmctld timeouts during the update.
> How can I clean this up?    I've tried canceling the jobs but nothing seems
> to work to remove them.
>
> Thanks in advance,
> Jeff
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231207/2471295a/attachment-0001.htm>