Hi Paul,

There could be multiple reasons why the job isn't running, from the user's QOS to your cluster hitting MaxJobCount. This page might help:

https://slurm.schedmd.com/high_throughput.html

The output of the following command might help:

scontrol show job 465072​ 

Regards
-- 
Mick Timony
Senior DevOps Engineer
Harvard Medical School
--


From: Paul Raines via slurm-users <slurm-users@lists.schedmd.com>
Sent: Tuesday, July 9, 2024 9:24 AM
To: slurm-users <slurm-users@lists.schedmd.com>
Subject: [slurm-users] Job submitted to multiple partitions not running when any partition is full
 

I have a job 465072 submitted to multiple partitions (rtx6000,rtx8000,pubgpu)

   JOBID PARTITION  PENDING PRIORITY   TRES_ALLOC|REASON
4650727 rtx6000      47970 0.00367972 cpu=5,mem=400G,node=1,gpu=1|Priority
4650727 rtx8000      47970 0.00367972 cpu=5,mem=400G,node=1,gpu=1|Priority
4650727 pubgpu       47970 0.00367972 cpu=5,mem=400G,node=1,gpu=1|Priority
4646926 rtx6000     487048 0.00121987 cpu=10,mem=32G,node=1,gpu=1|Priority,Resources
4650186 rtx8000      56979 0.00000000 cpu=4,mem=10G,node=1,gpu=1|Priority,Resources

We see the two partitions rtx6000 and rtx8000 are full and two other
jobs are at the top of the queue waiting to run on those.  But partition
pubgpu is NOT full and you can see here a node leo with resources to
run the 4650727 job

HOST       PARTITION         CORES       MEMORY       GPUS
leo        pubgpu           48/ 64    12288/1030994   0/ 1
leo        pubcpu           48/ 64    12288/1030994   0/ 1

The node leo is NOT part of the rtx6000 or rtx8000 partitions and
there are no other pending jobs waiting on either the pubgpu or
pubcpu partition that leo is part of

So why is 4650727 not running on the pubgpu partition?

---------------------------------------------------------------
Paul Raines                     http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street     Charlestown, MA 02129            USA



The information in this e-mail is intended only for the person to whom it is addressed.  If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted).  If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately.  Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com