I have a bash script that grabs current statistics from sinfo to ship into a time series database to use for Grafana dashboards.
We recently began using shards with our gpus, and I’m seeing some unexpected behavior with the data reported from sinfo.
$ sinfo -h -O "NodeHost:5 ,GresUsed:100 ,Gres:100" | grep gpu03 gpu03 gpu:p40:0(IDX:N/A),gpu:rtx:0(IDX:N/A),shard:p40:1(IDX:1),shard:rtx:0(IDX:N/A) gpu:p40:6(S:0),gpu:rtx:2(S:0),shard:p40:36(S:0),shard:rtx:12(S:0)
$ scontrol show node gpu03 NodeName=gpu03 Arch=x86_64 CoresPerSocket=22 CPUAlloc=52 CPUEfctv=88 CPUTot=88 CPULoad=43.67 AvailableFeatures=avx,avx512,largeMemory,matlab ActiveFeatures=avx,avx512,largeMemory,matlab Gres=gpu:p40:6(S:0),gpu:rtx:2(S:0),shard:p40:36(S:0),shard:rtx:12(S:0) NodeAddr=gpu03 NodeHostName=gpu03 Version=22.05.8 OS=Linux 5.4.0-164-generic #181-Ubuntu SMP Fri Sep 1 13:41:22 UTC 2023 RealMemory=768000 AllocMem=425984 FreeMem=116850 Sockets=2 Boards=1 State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=gpu BootTime=2024-02-02T14:17:07 SlurmdStartTime=2024-02-05T12:33:50 LastBusyTime=2024-02-07T09:48:50 CfgTRES=cpu=88,mem=750G,billing=88,gres/gpu=8,gres/gpu:p40=6,gres/gpu:rtx=2,gres/shard=48 AllocTRES=cpu=52,mem=416G,gres/shard=21 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Specifically, we can see there are 48 total shards on this specific node, 36 of type p40, 12 of type rtx. From control, I can see that 21 shards are in use, and from nvidia-smi I can see that the breakdown of shards per gpu index is:
[0]p40:4/6 [1]p40:6/6 [2]rtx:4/6 [3]rtx:3/6 [4]p40:4/6 [5]p40:0/6 [6]p40:0/6 [7]p40:0/6
So, I can tell that what “GresUsed” is reporting is actually an entire gpu’s worth of shards used, and the index of that specific entire GPU consumed. However, what it doesn’t show is the amount of specific gres/shard actually used, or more succinctly, I can’t seem to extrapolate that there are 21 shards in use. Right now my script says there is 1 shard out of 48 used, because I’m scraping for the num of shards in use, which is reporting the value of entire GPUs used by shards, and anything less than 100% is rounded to 0..
So, given that I’m on 22.05.8, maybe its better in 23.02.X which I’m hoping to move to within the next month. However I can’t seem to find anything in the release notes for 23.02 or 23.11 that would imply that sinfo reports (my) expected value of the actual count of shards used,
Does anyone have any ideas for how I might be able to achieve what I’m looking for using sinfo, or should I instead try to use the sinfo json parser which has a “tres_used” field, that doesn’t appear to be accessible outside of the json output?
{ "architecture": "x86_64", "burstbuffer_network_address": "", "boards": 1, "boot_time": 1706901428, "comment": "", "cores": 22, "cpu_binding": 0, [SNIP] "tres": "cpu=88,mem=750G,billing=88,gres/gpu=8,gres/gpu:p40=6,gres/gpu:rtx=2,gres/shard=48", "slurmd_version": "22.05.8", "alloc_memory": 284380, "alloc_cpus": 62, "idle_cpus": 26, "tres_used": "cpu=62,mem=284380M,gres/gpu=7,gres/gpu:p40=5,gres/gpu:rtx=2,gres/shard=17", "tres_weighted": 62 },
Any ideas appreciated. Thanks, Reed