[slurm-users] Nodes randomly set to drain state with cgroup error and reason "batch job complete failure"

Thu Apr 14 12:49:30 UTC 2022

Hi,

We maintain a cluster of about ~250 nodes - it runs Slurm version 21.08.6. "scontrol show config" attached in the paste below.

Here is what we observed about the issue:
- The related job/script doesn't start at all and terminates immediately (because the initial cgroup setup fails, we guess)
- Happens about once every day.
- It can put several nodes in DRAIN state at the same time. When it happens and if we trace back taking the related job ids, we are led to one single user. It's not the same user every time the issue occurs.
- It happen for array or regular jobs.
- If we look at the worker's log, we don't find any obvious correlation with others jobs that may start or complete at the same time.
- We aren't able to find an obvious correlation with the CPU load of the nodes or controller.

However we have been able to capture the error with slurmd in debug mode. Please see the following log paste:

https://postit.hadoly.fr/?ed77c43716cefbc8#3a4Ukb9aFut93gDYZW6JpJJdgYzCCBYizUeAgh6G5rfT

I also attached it as a text file (if the list allows it) to this post for future reference.

The reason of the node DRAIN set by slurm is "batch job complete failure".

The cgroup error is:

  error: common_cgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/cpuset/slurm/uid_43197/job_4302857' : No such file or directory

The log entry "sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status:0" reports either error 4020 or 4014

>From the user perspective, the job is shown as COMPLETED (which sounds counterintuitive)

Our current reading of the logs make us think that some folders entries are randomly missing at the time the cgroup setup happens.

Do you think it could be a bug in slurm ? (maybe a race condition). If yes we could carry on and report a bug.

It looks a bit like https://bugs.schedmd.com/show_bug.cgi?id=13136 at the difference we're an not using multiple-slurmd and the cgroup error in our case is about '/sys/fs/cgroup/cpuset', not '/sys/fs/cgroup/memory'.

Thank you for reading, any input is welcome !

Martin
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: issue_cgroup_slurm_users.txt
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220414/eaca325b/attachment-0001.txt>