Hi Jason,
We use the Slurm tool "pestat" (Processor Element status) available from [1] for all kinds of cluster monitoring, including GPU usage. An example usage is:
$ pestat -G -p a100 GPU GRES (Generic Resource) is printed after each JobID Print only nodes in partition a100 Hostname Partition Node Num_CPU CPUload Memsize Freemem GRES/node Joblist State Use/Tot (15min) (MB) (MB) JobID(JobArrayID) User GRES/job ... sd651 a100+ mix 38 128 4.01* 512000 407942 gpu:A100:4 8467943 user1 gpu:a100=1 8480327 user1 gpu:a100=1 8480325 user1 gpu:a100=1 8488029 user2 gpu:a100=1 sd652 a100+ mix 98 128 4.00* 512000 275860 gpu:A100:4 8467942 user1 gpu:a100=1 8488442 user2 gpu:a100=1 8489252 user2 gpu:a100=1 8489253 user2 gpu:a100=1 sd653 a100 mix 8 128 4.00 512000 487001 gpu:A100:4 8480330 user1 gpu:a100=1 8480329 user1 gpu:a100=1 8480328 user1 gpu:a100=1 8480326 user1 gpu:a100=1 sd654 a100 mix 38 128 4.05* 512000 365431 gpu:A100:4 8496110 user3 gpu:a100=1 8480331 user1 gpu:a100=1 8480332 user1 gpu:a100=1 8480333 user1 gpu:a100=1
If you want to find out the GPU usage of a specific job, the "psjob" command from [2] is really handy. An example output is:
$ psjob 8496110 JOBID PARTITION NODES TASKS USER START_TIME TIME TIME_LIMIT TRES_ALLOC 8496110 a100 1 32 user3 2025-04-02T03:43:49 12:22:40 2-00:00:00 cpu=32,mem=112000M,node=1,billing=128,gres/gpu=1,gres/gpu:a100=1 NODELIST: sd654 ==================================================== Process list from 'ps' on each node in the job:psjob 8496110 --------------- sd654 --------------- PID NLWP S USER STARTED TIME %CPU RSS COMMAND 603039 1 S user3 03:43:49 00:00:00 0.0 4304 /bin/bash /var/spool/slurmd/job8496110/slurm_sc 603061 1 S user3 03:43:50 00:00:00 0.0 4152 bash config.sh 603064 5 R user3 03:43:50 12:03:54 97.4 5907868 /home/cat/user3/MACE/mace_env/bin/python3 /ho Total: 3 processes and 7 threads Uptime: 16:06:30 up 12 days, 2:08, 0 users, load average: 4.28, 4.09, 4.02 ==================================================== Nodes in this job with GPU Generic Resources (Gres): sd654 gpu:A100:4
Running GPU tasks: Node: GPU GPU-type | Temp GPU% | Mem / Tot | user:process/PID(Mem) sd654: [2] NVIDIA A100-SXM4-40GB | 33°C, 41 % | 24060 / 40960 MB | user3:python3/603064(24050M) ==================================================== Scratch disk usage for JobID 8496110: Node: Usage Scratch folder sd654: 8.0K /scratch/8496110
Scratch disks on JobID 8496110 compute nodes: Node: Size Used Avail Use% Mounted on sd654: 1.7T 12G 1.7T 1% /scratch ====================================================
The psjob command's prerequisites are listed in the README.md file in [2], namely the "gpustat" and "ClusterShell" tools.
Best regards, Ole
[1] https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat [2] https://github.com/OleHolmNielsen/Slurm_tools/tree/master/jobs
On 4/2/25 15:17, Jason Simms via slurm-users wrote:
Apologies for the basic question, but is there a straightforward, best- accepted method for using Slurm to report on which GPUs are currently in use? I've done some searching and people recommend all sorts of methods, including parsing the output of nvidia-smi (seems inefficient, especially across multiple GPU nodes), as well as using other tools such as Grafana, XDMoD, etc.
We do track GPUs as a resource, so I'd expect I could get at the info with sreport or something like that, but before trying to craft my own from scratch, I'm hoping someone has something working already. Ultimately I'd like to see either which cards are available by node, or the reverse (which are in use by node). I know recent versions of Slurm supposedly added tighter integration in some way with NVIDIA cards, but I can't seem to find definitive docs on what, exactly, changed or what is now possible as a result.