[slurm-users] Why does Slurm kill one particular user's jobs after a few seconds?

Thu Apr 15 07:07:24 UTC 2021

Hi Thomas,

I wonder if your problem is related to that reported in this list thread?
https://lists.schedmd.com/pipermail/slurm-users/2021-April/007107.html

You could try to restart the slurmctld service, and also make sure your 
configuration (slurm.conf etc.) has been pushed correctly to the slurmd nodes.

/Ole

On 4/14/21 9:53 AM, Thomas Arildsen wrote:
> Oh and I forgot to mention that we are using Slurm version 20.11.3.
> Best,
> 
> Thomas
> 
> ons, 14 04 2021 kl. 09:23 +0200, skrev Thomas Arildsen:
>> I administer a Slurm cluster with many users and the operation of the
>> cluster currently appears "totally normal" for all users; except for
>> one. This one user gets all attempts to run commands through Slurm
>> killed after 20-25 seconds (I think the cause is another job - not so
>> much the time, see further down).
>> The following minimal example reproduces the error:
>>
>>      $ sudo -u <the_user> srun --pty sleep 25
>>      srun: job 110962 queued and waiting for resources
>>      srun: job 110962 has been allocated resources
>>      srun: Force Terminated job 110962
>>      srun: Job step aborted: Waiting up to 32 seconds for job step to
>> finish.
>>      slurmstepd: error: *** STEP 110962.0 ON <node> CANCELLED AT 2021-
>> 04-09T16:33:35 ***
>>      srun: error: <node>: task 0: Terminated
>>
>> When this happens, I find this line in the slurmctld log:
>>
>>      _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=110962 uid
>> <the_users_uid>
>>
>> It only happens for '<the_user>' and not for any other user that I know
>> of. This very similar but shorter-running example works fine:
>>
>>      $ sudo -u <the_user> srun --pty sleep 20
>>      srun: job 110963 queued and waiting for resources
>>      srun: job 110963 has been allocated resources
>>
>> Note that when I run srun --pty sleep 20 as myself, srun does not
>> output the two srun: job... lines. This seems to me to be an additional
>> indication that srun is subject to some different settings for
>> '<the_user>'.
>> All settings that I have been able to inspect appear identical for
>> '<the_user>' as for other users. I have checked, and 'MaxWall' is not
>> set for this user and not for any other user, either. Other users
>> belonging to the same Slurm account do not experience this problem.
>>
>> When this unfortunate user's jobs get allocated, I see messages like
>> this in
>> '/var/log/slurm/slurmctld.log':
>>
>>      sched: _slurm_rpc_allocate_resources JobId=111855 NodeList=<node>
>>
>> and shortly after, I see this message:
>>
>>      select/cons_tres: common_job_test: no job_resources info for
>> JobId=110722_* rc=0
>>
>> Job 110722_* is a pending array job by another user that is pending due
>> to 'QOSMaxGRESPerUser'. One pending part of this array job (110722_57)
>> eventually ends up taking over job 111855's CPU cores when 111855 gets
>> killed. This leads me to believe that 110722_57 causes 111855 to be
>> killed. However, 110722_57 remains pending afterwards.
>> Some of the things I fail to understand here are:
>>    - Why does a pending job kill another job, yet remains pending
>> afterwards?
>>    - Why does the pending job even have privileges to kill another job
>> in the first place?
>>    - Why does this only affect '<the_user>'s jobs but not those of other
>> users?
>>
>> None of this is intended to happen. I am guessing it must be caused by
>> some settings specific to '<the_user>', but I cannot figure out what
>> they are and they are not supposed to be like this. If these are
>> settings we admins somehow caused, it was unintended.
>>
>> NB: some details have been anonymized as <something> above.
>>
>> I hope someone has a clue what is going on here. Thanks in advance,
>>
>> Thomas Arildsen