Still learning about SLURM, so please forgive me if I ask a naïve question
I like to use Anders Halager’s gnodes command to visualise the state of our nodes. I’ve noticed lately that we fairly often see things like this (apologies for line wrap):
+- core - 46 cores & 186GB -----------------------------------------+-------------------------------------------------------------------+-------------------------------------------------------------------+
| seskscpn301 0G ___OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO | seskscpn309 172G ..........................................!!!! | seskscpn317 0G OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO |
Now, you can see in this that nodes 301 and 317 are more or less fully loaded. This is great. But 309 is in an interesting state. Four overloaded cores, and all other cores unused, and plenty of RAM available.
And yet SLURM is not scheduling any more work to that node. Right now there are more than 2000 jobs pending, many of which could run on that node. But SLURM is not scheduling them, and I don’t know why.
One thing I’ve seen cause this is a job trying to use more CPUs than it has been allocated. The cgroup stops this being a real problem of course, but it does cause the load average to go high. Is this what’s causing SLURM to stop sending anything to the node? Is there a configuration change that might help in this situation?
Thanks in advance,
Tim -- Tim Cutts Scientific Computing Platform Lead AstraZeneca
Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting our Service Cataloguehttps://azcollaboration.sharepoint.com/sites/CMU993 |
________________________________
AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.
This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.comhttps://www.astrazeneca.com
You probably want to look at scontrol show node and scontrol show job for that node and the jobs on it.
Compare those and you may find someone requested most all the resources, but are not running them properly. Look at the job itself to see what it is trying to do.
Brian Andrus
On 7/11/2024 7:48 AM, Cutts, Tim via slurm-users wrote:
Still learning about SLURM, so please forgive me if I ask a naïve question
I like to use Anders Halager’s gnodes command to visualise the state of our nodes. I’ve noticed lately that we fairly often see things like this (apologies for line wrap):
+- core- 46 cores & 186GB -----------------------------------------+-------------------------------------------------------------------+-------------------------------------------------------------------+
| seskscpn3010G___OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO | seskscpn309172G..........................................!!!! | seskscpn3170GOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO |
Now, you can see in this that nodes 301 and 317 are more or less fully loaded. This is great. But 309 is in an interesting state. Four overloaded cores, and all other cores unused, and plenty of RAM available.
And yet SLURM is not scheduling any more work to that node. Right now there are more than 2000 jobs pending, many of which could run on that node. But SLURM is not scheduling them, and I don’t know why.
One thing I’ve seen cause this is a job trying to use more CPUs than it has been allocated. The cgroup stops this being a real problem of course, but it does cause the load average to go high. Is this what’s causing SLURM to stop sending anything to the node? Is there a configuration change that might help in this situation?
Thanks in advance,
Tim
--
*Tim Cutts*
Scientific Computing Platform Lead
AstraZeneca
Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting ourService Catalogue https://azcollaboration.sharepoint.com/sites/CMU993|
AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.
This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com https://www.astrazeneca.com