[slurm-users] Weird scheduling behaviour

Tue May 23 12:38:21 UTC 2023

Hello,
I observe a weird behaviour of my SLURM installation (23.02.2). Some tasks take some hours to be scheduled (probably on one specific node), the pending state reason is "Resources", although resources are free.
I have tested a bit around and get this weird behaviour for salloc command:
"salloc --ntasks=4 --mem-per-cpu=3500M --gres=gpu:1" is waiting for resources, while
"salloc --ntasks=4 --mem-per-cpu=3700M --gres=gpu:1" is scheduled directly (while the command above is still waiting)

I have already restarted the slurmd daemon on that node and slurmctld, no changes to that behaviour.

This is the node configuration:

NodeName=node6 NodeHostname=cluster-node6 Port=17002 CPUs=64 RealMemory=254000 Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 Gres=gpu:a10:3 Weight=2 State=UNKNOWN

Gres.conf:
AutoDetect=off
Name=gpu Type=a10       File=/dev/nvidia0
Name=gpu Type=a10       File=/dev/nvidia1
Name=gpu Type=a10       File=/dev/nvidia2

What could be the issue here?

Regards,
Holger
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230523/55e3c877/attachment.htm>