Thanks. I traced it to a MaxMemPerCPU=16384 setting on the pubgpu partition.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Tue, 9 Jul 2024 2:39pm, Timony, Mick wrote:
External Email - Use Caution
Hi Paul,
There could be multiple reasons why the job isn't running, from the user's QOS to your cluster hitting MaxJobCount. This page might help:
https://slurm.schedmd.com/high_throughput.html
The output of the following command might help:
scontrol show job 465072
Regards
Mick Timony Senior DevOps Engineer Harvard Medical School --
From: Paul Raines via slurm-users slurm-users@lists.schedmd.com Sent: Tuesday, July 9, 2024 9:24 AM To: slurm-users slurm-users@lists.schedmd.com Subject: [slurm-users] Job submitted to multiple partitions not running when any partition is full
I have a job 465072 submitted to multiple partitions (rtx6000,rtx8000,pubgpu)
JOBID PARTITION PENDING PRIORITY TRES_ALLOC|REASON 4650727 rtx6000 47970 0.00367972 cpu=5,mem=400G,node=1,gpu=1|Priority 4650727 rtx8000 47970 0.00367972 cpu=5,mem=400G,node=1,gpu=1|Priority 4650727 pubgpu 47970 0.00367972 cpu=5,mem=400G,node=1,gpu=1|Priority 4646926 rtx6000 487048 0.00121987 cpu=10,mem=32G,node=1,gpu=1|Priority,Resources 4650186 rtx8000 56979 0.00000000 cpu=4,mem=10G,node=1,gpu=1|Priority,Resources
We see the two partitions rtx6000 and rtx8000 are full and two other jobs are at the top of the queue waiting to run on those. But partition pubgpu is NOT full and you can see here a node leo with resources to run the 4650727 job
HOST PARTITION CORES MEMORY GPUS leo pubgpu 48/ 64 12288/1030994 0/ 1 leo pubcpu 48/ 64 12288/1030994 0/ 1
The node leo is NOT part of the rtx6000 or rtx8000 partitions and there are no other pending jobs waiting on either the pubgpu or pubcpu partition that leo is part of
So why is 4650727 not running on the pubgpu partition?
Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th Street Charlestown, MA 02129 USA
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.