[slurm-users] SLURM upgrade from 20.11.3 to 20.11.9 misidentification of job steps
John DeSantis
desantis at usf.edu
Thu May 19 21:07:22 UTC 2022
Luke,
>> We ran into a similar issue a while ago (not sure what versions were involved though). Can't guarantee it's relevant, but have a look adding:
>>
>> JobAcctGatherParams=NoShared,UsePSS
Thanks for your response!
I did adjust the values to include those listed above. Unfortunately, the problem remained. Long story short, I upgraded to the latest stable version this morning and the issue appears resolved.
Thanks!
John DeSantis
On 5/19/22 05:41, Luke Sudbery wrote:
> We ran into a similar issue a while ago (not sure what versions were involved though). Can't guarantee it's relevant, but have a look adding:
>
> JobAcctGatherParams=NoShared,UsePSS
>
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9010%23c16&data=05%7C01%7Cdesantis%40usf.edu%7Ce6f55a62938f4533f18108da397c162e%7C741bf7dee2e546df8d6782607df9deaa%7C0%7C0%7C637885502376090312%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Ke2a48S60O32Oxi2K2yBshQFkmjkqFU3eXOkRCf1WVM%3D&reserved=0
>
> By default multiple processes using shared memory get counted multiple times - but it's not actually using that much RAM.
>
> Cheers,
>
> Luke
>
> --
> Luke Sudbery
> Principal Engineer (HPC and Storage).
> Architecture, Infrastructure and Systems
> Advanced Research Computing, IT Services
> Room 132, Computer Centre G5, Elms Road
>
> Please note I don’t work on Monday.
>
> -----Original Message-----
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of John DeSantis
> Sent: 18 May 2022 15:39
> To: slurm-users at lists.schedmd.com
> Subject: Re: [slurm-users] SLURM upgrade from 20.11.3 to 20.11.9 misidentification of job steps
>
> Hello,
>
> It also appears that random jobs are being identified as using too much memory, despite being well within limits.
>
> For example, a job is running that requested 2048 MB per CPU and all processes are within the limit. But, the job is identified as being over limit when it isn't. Please see below:
>
>> [2022-05-18T10:26:45.020] Job 7724384 exceeded memory limit (13140>2056), cancelling it
>> [2022-05-18T10:26:53.736] Job 7724384 exceeded memory limit (13140>2056), cancelling it
>
>> PID PSR RSS VSZ ELAPSED COMMAND
>> 16893 1 844172 1337540 15:53 vasp
>> 16885 0 842844 1337424 15:53 vasp
>> 16892 14 841216 1337652 15:53 vasp
>> 16888 6 840848 1337540 15:53 vasp
>> 16896 7 840756 1337536 15:53 vasp
>> 16891 12 840664 1337544 15:53 vasp
>> 16898 11 840556 1337540 15:53 vasp
>> 16900 15 840512 1337656 15:53 vasp
>> 16890 10 840500 1337540 15:53 vasp
>> 16889 8 840468 1337536 15:53 vasp
>> 16895 5 840416 1337544 15:53 vasp
>> 16886 2 840356 1337540 15:53 vasp
>> 16887 4 840348 1337540 15:53 vasp
>> 16897 9 840288 1337544 15:53 vasp
>> 16894 3 840128 1337536 15:53 vasp
>> 16899 13 839952 1337544 15:53 vasp
>
>> ps -U USER -o pid,psr,rss,vsz,etime,args h --sort=-rss|awk '{rss+=$3} END {printf("%.2f\n",rss/1024)}'
>> 13140.52
>
> This job isn't hasn't exceeded its memory request on a per CPU basis, yet it has been identified as having done so. Unfortunately, this will cause concern with users because their output logs indicate that the job exceeded its memory limit.
>
> I ran a few test jobs requesting 1K of memory and they were all terminated and identified correctly, so this new behavior is random and frequent.
>
> The commonality between both properly and improperly misidentified jobs is that they are not terminated and their output files log the message from slurmstepd that the memory limit has been exceeded.
>
> Are any other users on the 20.11.9 branch experiencing this issue? Are any users on the 21.08.8 branch experiencing this issue? We use JobAcctGatherParams=OverMemoryKill in our environment to monitor and kill jobs when the physical memory limit has been exceeded.
>
> Thank you,
> John DeSantis
>
> On 5/18/22 09:45, John DeSantis wrote:
>> Hello,
>>
>> Due to the recent CVE posted by Tim, we did upgrade from SLURM 20.11.3 to 20.11.9.
>>
>> Today, I received a ticket from a user with their output files populated with the "slurmstepd: error: Exceeded job memory limit" message. But, the jobs are still running and it seems that the controller is misidentifying the job and/or step ID. Please see below.
>>
>> # slurmd log
>>
>>> [2022-05-18T09:33:31.279] Job 7733409 exceeded memory limit (7973>5120), cancelling it
>>> [2022-05-18T09:33:31.291] debug: _rpc_job_notify, uid = 65536, JobId=7733409
>>> [2022-05-18T09:33:31.291] [7733409.0] debug: Handling REQUEST_STEP_UID
>>> [2022-05-18T09:33:31.300] send notification to StepId=7733409.batch
>>> [2022-05-18T09:33:31.300] [7733409.batch] debug: Handling REQUEST_JOB_NOTIFY
>>> [2022-05-18T09:33:31.302] [7733409.batch] error: Exceeded job memory limit
>>
>> # controller log
>>
>>> [2022-05-18T09:33:31.293] debug2: Processing RPC: REQUEST_CANCEL_JOB_STEP from UID=0
>>> [2022-05-18T09:33:31.293] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7733409+0
>>> [2022-05-18T09:33:31.293] kill_job_step: invalid JobId=4367416
>>> [2022-05-18T09:33:31.293] debug2: slurm_send_timeout: Socket no longer there
>>
>> A restart of the controller doesn't help either, as there are a log of misidentified jobs (truncated):
>>
>>> [2022-05-18T09:41:27.128] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731668+0
>>> [2022-05-18T09:41:27.128] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731684+0
>>> [2022-05-18T09:41:27.128] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731625+0
>>> [2022-05-18T09:41:27.128] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731634+0
>>> [2022-05-18T09:41:27.128] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731629+0
>>> [2022-05-18T09:41:27.129] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7724380+0
>>> [2022-05-18T09:41:27.129] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7724380+0
>>> [2022-05-18T09:41:27.129] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731632+0
>>> [2022-05-18T09:41:27.129] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7724375+0
>>> [2022-05-18T09:41:27.129] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731650+0
>>> [2022-05-18T09:41:27.129] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7728855+0
>>> [2022-05-18T09:41:27.130] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731681+0
>>> [2022-05-18T09:41:27.130] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7731651+0
>>> [2022-05-18T09:41:27.131] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7724380+0
>>> [2022-05-18T09:41:27.131] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7728855+0
>>> [2022-05-18T09:41:27.133] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7724378+0
>>> [2022-05-18T09:41:27.133] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7724380+0
>>> [2022-05-18T09:41:27.134] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=4367416.7724378+0
>>
>> These jobs were started post upgrade, too.
>>
>> Has anyone else seen this?
>>
>> Thank you,
>> John DeSantis
> [EXTERNAL EMAIL] DO NOT CLICK links or attachments unless you recognize the sender and know the content is safe.
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 840 bytes
Desc: OpenPGP digital signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220519/9bb77ce9/attachment.sig>
More information about the slurm-users
mailing list