How do you guys track which GPU is used by which job ?

List overview All Threads
Download

newer

older

Tracking costs - one single pool...

Job information is not being added...

Sylvain MARET

16 Oct 2024 16 Oct '24

6:10 a.m.

Hey guys !

I'm looking to improve GPU monitoring on our cluster. I want to install this https://github.com/NVIDIA/dcgm-exporter and saw in the README that it can support tracking of job id : https://github.com/NVIDIA/dcgm-exporter?tab=readme-ov-file#enabling-hpc-job-...

However I haven't been able to see any examples on how to do it nor does slurm seem to expose this information by default. Does anyone do this here ? And if so do you have any examples I could try to follow ? If you have advise on best practices to monitor GPU I'd be happy to hear it out !

Regards, Sylvain Maret

Show replies by date

Brian Andrus

16 Oct 16 Oct

9:03 a.m.

Looks like there is a step you would need to do to create the required job mapping files:

/The DCGM-exporter can include High-Performance Computing (HPC) job information into its metric labels. To achieve this, HPC environment administrators must configure their HPC environment to generate files that map GPUs to HPC jobs./

It does go on to show the conventions/format of the files.

I imagine you could have some bits in a prologue script that creates that as the job starts on the node and point dcgm-exporter there.

Brian Andrus

On 10/16/24 06:10, Sylvain MARET via slurm-users wrote:

...

Hey guys !

I'm looking to improve GPU monitoring on our cluster. I want to install this https://github.com/NVIDIA/dcgm-exporter and saw in the README that it can support tracking of job id : https://github.com/NVIDIA/dcgm-exporter?tab=readme-ov-file#enabling-hpc-job-...

However I haven't been able to see any examples on how to do it nor does slurm seem to expose this information by default. Does anyone do this here ? And if so do you have any examples I could try to follow ? If you have advise on best practices to monitor GPU I'd be happy to hear it out !

Regards, Sylvain Maret

Sylvain MARET

17 Oct 17 Oct

7:32 a.m.

Started testing in prolog and you're right ! Before doing anything I wanted to see if there was a best practices.

Regards, Sylvain Maret

On 16/10/2024 18:03, Brian Andrus via slurm-users wrote:

...

CAUTION : External Sender. Please do not click on links or open attachments from senders you do not trust.

Looks like there is a step you would need to do to create the required job mapping files:

/The DCGM-exporter can include High-Performance Computing (HPC) job information into its metric labels. To achieve this, HPC environment administrators must configure their HPC environment to generate files that map GPUs to HPC jobs./

It does go on to show the conventions/format of the files.

I imagine you could have some bits in a prologue script that creates that as the job starts on the node and point dcgm-exporter there.

Brian Andrus

On 10/16/24 06:10, Sylvain MARET via slurm-users wrote:

...
Hey guys !

I'm looking to improve GPU monitoring on our cluster. I want to install this https://github.com/NVIDIA/dcgm-exporter and saw in the README that it can support tracking of job id : https://github.com/NVIDIA/dcgm-exporter?tab=readme-ov-file#enabling-hpc-job-...

However I haven't been able to see any examples on how to do it nor does slurm seem to expose this information by default. Does anyone do this here ? And if so do you have any examples I could try to follow ? If you have advise on best practices to monitor GPU I'd be happy to hear it out !

Regards, Sylvain Maret

Pierre-Antoine Schnell

1:45 a.m.

New subject: [EXTERN] How do you guys track which GPU is used by which job ?

Hello,

we recently started monitoring GPU usage on our GPUs with NVIDIA's DCGM: https://developer.nvidia.com/blog/job-statistics-nvidia-data-center-gpu-mana...

We create a new dcgmi group for each job and start the statistics retrieval for it in a prolog script.

Then we stop the retrieval, save the dcgmi verbose stats output and delete the dcgmi group in an epilog script.

The output presents JobID, GPU IDs, runtime, energy consumed, and SM utilization, among other things.

We retrieve the relevant data into a database and hope to be able to advise our users on better practices based on the analysis of it.

Best wishes, Pierre-Antoine Schnell

Am 16.10.24 um 15:10 schrieb Sylvain MARET via slurm-users:

...

Hey guys !

I'm looking to improve GPU monitoring on our cluster. I want to install this https://github.com/NVIDIA/dcgm-exporter and saw in the README that it can support tracking of job id : https://github.com/NVIDIA/dcgm-exporter?tab=readme-ov-file#enabling-hpc-job-...

However I haven't been able to see any examples on how to do it nor does slurm seem to expose this information by default. Does anyone do this here ? And if so do you have any examples I could try to follow ? If you have advise on best practices to monitor GPU I'd be happy to hear it out !

Regards, Sylvain Maret

-- Pierre-Antoine Schnell Medizinische Universität Wien IT-Dienste & Strategisches Informationsmanagement Enterprise Technology & Infrastructure High Performance Computing 1090 Wien, Spitalgasse 23 Bauteil 88, Ebene 00, Büro 611 +43 1 40160-21304 pierre-antoine.schnell@meduniwien.ac.at

Sylvain MARET

7:33 a.m.

New subject: [EXTERN] How do you guys track which GPU is used by which job ?

Interesting solution didn't know it was possible to do this. Will try to test this also !

Sylvain

On 17/10/2024 10:45, Pierre-Antoine Schnell via slurm-users wrote:

...

CAUTION : External Sender. Please do not click on links or open attachments from senders you do not trust.

Hello,

we recently started monitoring GPU usage on our GPUs with NVIDIA's DCGM: https://developer.nvidia.com/blog/job-statistics-nvidia-data-center-gpu-mana...

We create a new dcgmi group for each job and start the statistics retrieval for it in a prolog script.

Then we stop the retrieval, save the dcgmi verbose stats output and delete the dcgmi group in an epilog script.

The output presents JobID, GPU IDs, runtime, energy consumed, and SM utilization, among other things.

We retrieve the relevant data into a database and hope to be able to advise our users on better practices based on the analysis of it.

Best wishes, Pierre-Antoine Schnell

Am 16.10.24 um 15:10 schrieb Sylvain MARET via slurm-users:

...
Hey guys !

I'm looking to improve GPU monitoring on our cluster. I want to install this https://github.com/NVIDIA/dcgm-exporter and saw in the README that it can support tracking of job id : https://github.com/NVIDIA/dcgm-exporter?tab=readme-ov-file#enabling-hpc-job-...

However I haven't been able to see any examples on how to do it nor does slurm seem to expose this information by default. Does anyone do this here ? And if so do you have any examples I could try to follow ? If you have advise on best practices to monitor GPU I'd be happy to hear it out !

Regards, Sylvain Maret

-- Pierre-Antoine Schnell

Medizinische Universität Wien IT-Dienste & Strategisches Informationsmanagement Enterprise Technology & Infrastructure High Performance Computing

1090 Wien, Spitalgasse 23 Bauteil 88, Ebene 00, Büro 611

+43 1 40160-21304

pierre-antoine.schnell@meduniwien.ac.at

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Paul Raines

7:41 a.m.

New subject: [EXTERN] How do you guys track which GPU is used by which job ?

We do the same thing. Our prolog has

============== # setup DCGMI job stats if [ -n "$CUDA_VISIBLE_DEVICES" ] ; then if [ -d /var/slurm/gpu_stats.run ] ; then if pgrep -f nv-hostengine >/dev/null 2>&1 ; then

groupstr=$(/usr/bin/dcgmi group -c J$SLURM_JOB_ID -a $CUDA_VISIBLE_DEVICES) groupid=$(echo $groupstr | awk '{print $10}')

/usr/bin/dcgmi stats -e /usr/bin/dcgmi stats -g $groupid -s $SLURM_JOB_ID

echo $groupid > /var/slurm/gpu_stats.run/J$SLURM_JOB_ID fi fi fi ======================

And our epilog has

====================== if [ -n "$CUDA_VISIBLE_DEVICES" ] ; then if [ -f /var/slurm/gpu_stats.run/J$SLURM_JOB_ID ] ; then if pgrep -f nv-hostengine >/dev/null 2>&1 ; then

groupid=$(cat /var/slurm/gpu_stats.run/J$SLURM_JOB_ID)

/usr/bin/dcgmi stats -v -j $SLURM_JOBID > /var/slurm/gpu_stats/$SLURM_JOBID if [ $? -eq 0 ] ; then /bin/rsync -a /var/slurm/gpu_stats/$SLURM_JOBID /cluster/batch/GPU/ /bin/rm -rf /tmp/gpuprocess.out # put the data in MYSQL database with perl script /cluster/batch/ADMIN/SCRIPTS/gpuprocess.pl $SLURM_JOB_ID > /tmp/gpuprocess.out 2>&1 if [ -s /tmp/gpuprocess.out ] ; then cat /tmp/gpuprocess.out | mail -s GPU_stat_process_error alert@nmr.mgh.harvard.edu fi fi

/usr/bin/dcgmi stats -x $SLURM_JOBID

/usr/bin/dcgmi group -d $groupid

/bin/rm /var/slurm/gpu_stats.run/J$SLURM_JOB_ID fi fi fi =======================

We also have a cron job on each node with GPUs that runs every 10 minutes to query dcgmi stats to write snapshot data on each GPU to the MYSQL database.

If you are on RHEL based boxes, the RPM you need from NVIDIA repos is datacenter-gpu-manager

On Thu, 17 Oct 2024 4:45am, Pierre-Antoine Schnell via slurm-users wrote:

...

   External Email - Use Caution 
Hello,

we recently started monitoring GPU usage on our GPUs with NVIDIA's DCGM: https: //secure-web.cisco.com/1KWAURVYDpmQYgABxXHpjl1HYdnLi1gOud_xdNWc3Pxea1JmFPHq-ARojCPZ7k2sn7nHLarge9d-vm4Yo0OwdO4jS-sxbbhr1mfGvdZ9653UOKmqqhQKiF7pNgB9ox8xEcuiLC-y_J7z3yC63xAdOL5pJKatcCaePuaoY4u2mTMIrOpNU-wulYVHWlLnv65d4AAFY6ipTgzp6As2PTZJlPcIP7RcToXJVUJhzDaMPYHRWsgRXaVU5156mcMRwn7bstXHH58PpmS2MkxpRJ0HGSA-Mjsmr6SKV3HixQxohY3OzyPnIslJt-kBC_AJvILCO/https%3A%2F%2Fdeveloper.nvidia.com%2Fblog%2Fjob-statistics-nvidia-data-center-gpu-manager-slurm%2F

We create a new dcgmi group for each job and start the statistics retrieval for it in a prolog script.

Then we stop the retrieval, save the dcgmi verbose stats output and delete the dcgmi group in an epilog script.

The output presents JobID, GPU IDs, runtime, energy consumed, and SM utilization, among other things.

We retrieve the relevant data into a database and hope to be able to advise our users on better practices based on the analysis of it.

Best wishes, Pierre-Antoine Schnell

Am 16.10.24 um 15:10 schrieb Sylvain MARET via slurm-users:

...
Hey guys !

I'm looking to improve GPU monitoring on our cluster. I want to install this https://secure-web.cisco.com/1fZ-E5mpOZvWDiBjPS6nGvTxPwlnYDhKBDJvrIMLGr18l4n... and saw in the README that it can support tracking of job id : https: //secure-web.cisco.com/1_JvkKV0Jm0yqxhTNbhLO9yC7U4G3sl2GSQRb2wrb-zRFRzd5kjwL7go8M2ESNdeIlaQM_peIOOHZCtWJibqHA4fl3Bk5xkr1tDe0QiOOCg8DLzw_OImhCSzXej8uZf3wHjpaQXCGtKzhUsW84CSsREcyBNTOTNjzAhr2HmDxYqMapS-TM8QFFrEB0u-3cJjdekUhw2rRhpZifMnj86S4nu6uG3Elyyla8GsaN8OC_Q6Jbu9kiW9hHGspRQ37Q3kbDIj7beBPkuik5eCPDtmabV-j2ppjd05G9eHZIrj9HAU2ZU3sIEsacOJ19eDUmNhl/https%3A%2F%2Fgithub.com%2FNVIDIA%2Fdcgm-exporter%3Ftab%3Dreadme-ov-file%23enabling-hpc-job-mapping-on-dcgm-exporter

However I haven't been able to see any examples on how to do it nor does slurm seem to expose this information by default. Does anyone do this here ? And if so do you have any examples I could try to follow ? If you have advise on best practices to monitor GPU I'd be happy to hear it out !

Regards, Sylvain Maret

-- Pierre-Antoine Schnell

Medizinische Universität Wien IT-Dienste & Strategisches Informationsmanagement Enterprise Technology & Infrastructure High Performance Computing

1090 Wien, Spitalgasse 23 Bauteil 88, Ebene 00, Büro 611

+43 1 40160-21304

pierre-antoine.schnell@meduniwien.ac.at

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.

Markus Kötter

11:27 p.m.

New subject: [EXTERN] How do you guys track which GPU is used by which job ?

Hi,

As their example was limited too "allgpus", I had posted my take on this on the nvidia developer blog.

Basically all the same, but lookups the groupid from the dcgmi group json using jp instead of a file.

https://developer.nvidia.com/blog/job-statistics-nvidia-data-center-gpu-mana...

prolog

...

group=$(sudo -u $SLURM_JOB_USER dcgmi group -c j$SLURM_JOB_ID) if [ $? -eq 0 ]; then groupid=$(echo $group | awk '{print $10}') sudo -u $SLURM_JOB_USER dcgmi group --group $groupid --add $SLURM_JOB_GPUS sudo -u $SLURM_JOB_USER dcgmi stats --group $groupid --enable sudo -u $SLURM_JOB_USER dcgmi stats --group $groupid --jstart $SLURM_JOBID fi

epilog

...

OUTPUTDIR=/tmp/ sudo -u $SLURM_JOB_USER dcgmi stats --jstop $SLURM_JOBID sudo -u $SLURM_JOB_USER dcgmi stats --verbose --job $SLURM_JOBID | sudo -u $SLURM_JOB_USER tee $OUTPUTDIR/dcgm-gpu-stats-$HOSTNAME-$SLURM_JOBID.out

groupid=$(sudo -u $SLURM_JOB_USER dcgmi group -l --json | jp "body.Groups.children.[*][0][?children."Group Name".value=='j$SLURM_JOBID'].children."Group ID".value | [0] " | sed s/"//g)

sudo -u $SLURM_JOB_USER dcgmi group --delete $groupid

MfG

-- Markus Kötter, +49 681 870832434 30159 Hannover, Lange Laube 6 Helmholtz Center for Information Security

497

Age (days ago)

499

Last active (days ago)

slurm-users@lists.schedmd.com

6 comments

5 participants

tags (0)

participants (5)

Brian Andrus
Markus Kötter
Paul Raines
Pierre-Antoine Schnell
Sylvain MARET