[slurm-users] SLURM upgrade from 20.11.3 to 20.11.9 misidentification of job steps

John DeSantis desantis at usf.edu
Wed May 18 13:45:00 UTC 2022


Hello,

Due to the recent CVE posted by Tim, we did upgrade from SLURM 20.11.3 to 20.11.9.

Today, I received a ticket from a user with their output files populated with the "slurmstepd: error: Exceeded job memory limit" message.  But, the jobs are still running and it seems that the controller is misidentifying the job and/or step ID.  Please see below.

# slurmd log

> [2022-05-18T09:33:31.279] Job 7733409 exceeded memory limit (7973>5120), cancelling it
> [2022-05-18T09:33:31.291] debug:  _rpc_job_notify, uid = 65536, JobId=7733409
> [2022-05-18T09:33:31.291] [7733409.0] debug:  Handling REQUEST_STEP_UID
> [2022-05-18T09:33:31.300] send notification to StepId=7733409.batch
> [2022-05-18T09:33:31.300] [7733409.batch] debug:  Handling REQUEST_JOB_NOTIFY
> [2022-05-18T09:33:31.302] [7733409.batch] error: Exceeded job memory limit

# controller log

> [2022-05-18T09:33:31.293] debug2: Processing RPC: REQUEST_CANCEL_JOB_STEP from UID=0
> [2022-05-18T09:33:31.293] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7733409+0
> [2022-05-18T09:33:31.293] kill_job_step: invalid JobId=4367416
> [2022-05-18T09:33:31.293] debug2: slurm_send_timeout: Socket no longer there

A restart of the controller doesn't help either, as there are a log of misidentified jobs (truncated):

> [2022-05-18T09:41:27.128] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731668+0
> [2022-05-18T09:41:27.128] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731684+0
> [2022-05-18T09:41:27.128] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731625+0
> [2022-05-18T09:41:27.128] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731634+0
> [2022-05-18T09:41:27.128] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731629+0
> [2022-05-18T09:41:27.129] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7724380+0
> [2022-05-18T09:41:27.129] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7724380+0
> [2022-05-18T09:41:27.129] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731632+0
> [2022-05-18T09:41:27.129] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7724375+0
> [2022-05-18T09:41:27.129] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731650+0
> [2022-05-18T09:41:27.129] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7728855+0
> [2022-05-18T09:41:27.130] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731681+0
> [2022-05-18T09:41:27.130] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731651+0
> [2022-05-18T09:41:27.131] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7724380+0
> [2022-05-18T09:41:27.131] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7728855+0
> [2022-05-18T09:41:27.133] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7724378+0
> [2022-05-18T09:41:27.133] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7724380+0
> [2022-05-18T09:41:27.134] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7724378+0

These jobs were started post upgrade, too.

Has anyone else seen this?

Thank you,
John DeSantis
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 840 bytes
Desc: OpenPGP digital signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220518/05fd7b2c/attachment-0001.sig>


More information about the slurm-users mailing list