[slurm-users] Sharing a node with non-gres and gres jobs
Peter Steinbach
steinbac at mpi-cbg.de
Tue Mar 19 12:31:50 UTC 2019
Hi,
we are struggling with a slurm 18.08.5 installation of ours. We are in a
situation, where our GPU nodes have a considerable number of cores but
"only" 2 GPUs inside. While people run jobs using the GPUs, non-GPU jobs
can enter alright. However, we found out the hard way, that the inverse
is not true.
For example, let's say I have a 4-core GPU node called gpu1. A non-GPU job
$ sbatch --wrap="sleep 10 && hostname" -c 3
comes in and starts running on gpu1.
We observed that the job produced by the following command targetting
the same node:
$ sbatch --wrap="hostname" -c 1 --gres=gpu:1 -w gpu1
will wait indefinitely for available resources until the non-gpu job is
finished. This is not something we want.
The sample gres.conf and slurm.conf from a docker based slurm cluster
where I was able to reproduce the issue are available here:
https://raw.githubusercontent.com/psteinb/docker-centos7-slurm/18.08.5-with-gres/slurm.conf
https://raw.githubusercontent.com/psteinb/docker-centos7-slurm/18.08.5-with-gres/gres.conf
We are not sure how to handle the situation as we would like both jobs
to enter the gpu node and run at the same time to maximize the utility
of our hardware to our users.
Any hints or ideas are highly appreciated.
Thanks for your help,
Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5253 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190319/e8c45300/attachment.bin>
More information about the slurm-users
mailing list