[slurm-users] GPUs as resources which SLURM can control

Renfro, Michael Renfro at tntech.edu
Thu Mar 21 01:53:39 UTC 2019


I think all you’re looking for is Generic Resource (GRES) scheduling, starting at https://slurm.schedmd.com/gres.html — if you’ve already seen that, then more details would be helpful.

If it all works correctly, then ‘sbatch --gres=gpu scriptname’ should run up to 4 of those jobs and leave the rest pending.

-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601     / Tennessee Tech University

> On Mar 20, 2019, at 6:05 PM, Nicholas Yue <yue.nicholas at gmail.com> wrote:
> 
> External Email Warning
> This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
> Hi,
> 
>   I am new to SLURM.
> 
>   I have access to a cluster where one of the node has 4 GPUs
> 
>   We are running version SLURM 17.11.12
> 
>   Is there some SBATCH token=value pair value I can use to submit jobs (each of which has an application that is only able to utilize 1 GPU) so that if I submit 6 copies, 4 copies will be dispatched and the 2 remaining will be in a state e.g. PD, until a GPU frees up
> 
> +-----------------------------------------------------------------------------+
> | NVIDIA-SMI 396.44                 Driver Version: 396.44                    |
> |-------------------------------+----------------------+----------------------+
> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
> |===============================+======================+======================|
> |   0  Tesla P100-PCIE...  On   | 00000000:25:00.0 Off |                    0 |
> | N/A   29C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
> +-------------------------------+----------------------+----------------------+
> |   1  Tesla P100-PCIE...  On   | 00000000:59:00.0 Off |                    0 |
> | N/A   26C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
> +-------------------------------+----------------------+----------------------+
> |   2  Tesla P100-PCIE...  On   | 00000000:6D:00.0 Off |                    0 |
> | N/A   27C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
> +-------------------------------+----------------------+----------------------+
> |   3  Tesla P100-PCIE...  On   | 00000000:99:00.0 Off |                    0 |
> | N/A   31C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
> +-------------------------------+----------------------+----------------------+
> 
> 
> Cheers
> -- 
> Nicholas Yue
> Graphics - Arnold, Alembic, RenderMan, OpenGL, HDF5
> Custom Dev - C++ porting, OSX, Linux, Windows
> http://au.linkedin.com/in/nicholasyue
> https://vimeo.com/channels/naiadtools



More information about the slurm-users mailing list