[slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

Robert Kudyba rkudyba at fordham.edu
Wed Dec 2 15:19:01 UTC 2020


>
> been having the same issue with BCM, CentOS 8.2 BCM 9.0 Slurm 20.02.3. It
> seems to have started to occur when I enabled proctrack/cgroup and changed
> select/linear to select/con_tres.
>
Our slurm.conf has the same setting:
SelectType=select/cons_tres
SelectTypeParameters=CR_CPU
SchedulerTimeSlice=60
EnforcePartLimits=YES

We enabled MPS too. Not sure if that's relevant.


> Are you using cgroup process tracking and have you manipulated the
> cgroup.conf file?
>
Here's what we have in ours:
CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=no
AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf"
TaskAffinity=no
ConstrainCores=no
ConstrainRAMSpace=no
ConstrainSwapSpace=no
ConstrainDevices=no
ConstrainKmemSpace=yes
AllowedRamSpace=100
AllowedSwapSpace=0
MinKmemSpace=30
MaxKmemPercent=100
MaxRAMPercent=100
MaxSwapPercent=100
MinRAMSpace=30

  Do jobs complete correctly when not cancelled?


Yes they do and canceling doesn't always result in a node draining.

So would this be a Slurm issue or Bright? I'm telling users to add 'sleep
60' as the last line in their sbatch files.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201202/bc6b2c3b/attachment.htm>


More information about the slurm-users mailing list