[slurm-users] Limit the number of GPUS per user per partition
Theis, Thomas
Thomas.Theis at Teledyne.com
Thu Apr 23 17:15:49 UTC 2020
Hi everyone,
First message, I am trying find a good way or multiple ways to limit the usage of jobs per node or use of gpus per node, without blocking a user from submitting them.
Example. We have 10 nodes each with 4 gpus in a partition. We allow a team of 6 people to submit jobs to any or all of the nodes. One job per gpu; thus we can hold a total of 40 jobs concurrently in the partition.
At the moment: each user usually submit 50- 100 jobs at once. Taking up all gpus, and all other users have to wait in pending..
What I am trying to setup is allow all users to submit as many jobs as they wish but only run on 1 out of the 4 gpus per node, or some number out of the total 40 gpus across the entire partition. Using slurm 18.08.3..
This is roughly our slurm scripts.
#SBATCH --job-name=Name # Job name
#SBATCH --mem=5gb # Job memory request
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --partition=PART1
#SBATCH --time=200:00:00 # Time limit hrs:min:sec
#SBATCH --output=job _%j.log # Standard output and error log
#SBATCH --nodes=1
#SBATCH --qos=high
srun -n1 --gres=gpu:1 --exclusive --export=ALL bash -c "NV_GPU=$SLURM_JOB_GPUS nvidia-docker run --rm -e SLURM_JOB_ID=$SLURM_JOB_ID -e SLURM_OUTPUT=$SLURM_OUTPUT --name $SLURM_JOB_ID do_job.sh"
Thomas Theis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200423/e6fd0bcd/attachment-0001.htm>
More information about the slurm-users
mailing list