SLURM GRES reservation not working properly on 24.05.1
Hello, *Issue 1:* I am using slurm version 24.05.1 , my slurmd has a single node where I connect multiple gres by enabling the overscribe feature. I am able to use the advance reservation of gres only using *gres** name* (tres=gres/gpu:*SYSTEM12*). i.e while in reservation period , if other users submits job with SYSTEM12 , then slurm places this job in queue *user1@host$ srun --gres=gpu:SYSTEM12:1 hostname* *srun: job 333 queued and waiting for resources * but when other users just submit a job without any system name , slurm jobs goes through on that gres immediately even though it is reserved. *user1@host$ srun --gres=gpu:1 hostname * *mylinux.wbi.com <http://mylinux.wbi.com/> * Also I can see GresUsed in busy mode using "*scontrol show node -d*" , this means the job is running on Gres/GPU and not on cpu etc. Same way , job submission based on Feature "rev1 in my case" is also going through even though it is reserved for other users in multiple partition slurm. *snippet of slurm.conf file* NodeName=cluster01 NodeAddr=cluster Port=6002CPUs=8 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 Feature="rev1" Gres=gpu:SYSTEM12:1 RealMemory=64171 State=IDLE *Issue 2:* while execution , Slurm o/p's some extra prints in the srun output user1@host$ srun --gres=gpu:1 hostname srun: error: extract_net_cred: net_cred not provided srun: error: Malformed RPC of type RESPONSE_NODE_ALIAS_ADDRS(3017) received srun: error: slurm_unpack_received_msg: [mylinux.wbi.com]:41242] Header lengths are longer than data received *mylinux.wbi.com <http://mylinux.wbi.com/>* Regards, MS
Would appreciate any leads on the above query. Thanks in advance. On Fri, 20 Sept 2024 at 14:31, Minulakshmi S <minulakshmi.s@gmail.com> wrote:
Hello,
*Issue 1:* I am using slurm version 24.05.1 , my slurmd has a single node where I connect multiple gres by enabling the overscribe feature. I am able to use the advance reservation of gres only using *gres** name* (tres=gres/gpu:*SYSTEM12*).
i.e while in reservation period , if other users submits job with SYSTEM12 , then slurm places this job in queue
*user1@host$ srun --gres=gpu:SYSTEM12:1 hostname* *srun: job 333 queued and waiting for resources *
but when other users just submit a job without any system name , slurm jobs goes through on that gres immediately even though it is reserved.
*user1@host$ srun --gres=gpu:1 hostname * *mylinux.wbi.com <http://mylinux.wbi.com/> *
Also I can see GresUsed in busy mode using "*scontrol show node -d*" , this means the job is running on Gres/GPU and not on cpu etc.
Same way , job submission based on Feature "rev1 in my case" is also going through even though it is reserved for other users in multiple partition slurm.
*snippet of slurm.conf file* NodeName=cluster01 NodeAddr=cluster Port=6002CPUs=8 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 Feature="rev1" Gres=gpu:SYSTEM12:1 RealMemory=64171 State=IDLE
*Issue 2:*
while execution , Slurm o/p's some extra prints in the srun output
user1@host$ srun --gres=gpu:1 hostname
srun: error: extract_net_cred: net_cred not provided
srun: error: Malformed RPC of type RESPONSE_NODE_ALIAS_ADDRS(3017) received srun: error: slurm_unpack_received_msg: [mylinux.wbi.com]:41242] Header lengths are longer than data received *mylinux.wbi.com <http://mylinux.wbi.com/>*
Regards, MS
participants (1)
-
Minulakshmi S