[slurm-users] Sharing a node with non-gres and gres jobs

Tue Mar 19 12:31:50 UTC 2019

Hi,

we are struggling with a slurm 18.08.5 installation of ours. We are in a 
situation, where our GPU nodes have a considerable number of cores but 
"only" 2 GPUs inside. While people run jobs using the GPUs, non-GPU jobs 
can enter alright. However, we found out the hard way, that the inverse 
is not true.

For example, let's say I have a 4-core GPU node called gpu1. A non-GPU job
$ sbatch --wrap="sleep 10 && hostname" -c 3
comes in and starts running on gpu1.
We observed that the job produced by the following command targetting 
the same node:
$ sbatch --wrap="hostname" -c 1 --gres=gpu:1 -w gpu1
will wait indefinitely for available resources until the non-gpu job is 
finished. This is not something we want.

The sample gres.conf and slurm.conf from a docker based slurm cluster 
where I was able to reproduce the issue are available here:
https://raw.githubusercontent.com/psteinb/docker-centos7-slurm/18.08.5-with-gres/slurm.conf
https://raw.githubusercontent.com/psteinb/docker-centos7-slurm/18.08.5-with-gres/gres.conf

We are not sure how to handle the situation as we would like both jobs 
to enter the gpu node and run at the same time to maximize the utility 
of our hardware to our users.

Any hints or ideas are highly appreciated.
Thanks for your help,
Peter

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5253 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190319/e8c45300/attachment.bin>