[slurm-users] Problems with multiple constraints
nathanh at graphcore.ai
Mon Aug 1 10:55:03 UTC 2022
I have what feels like a bug, but I’m keen to validate my configuration first. We are running a configuration where we are tagging hosts with features so that we can specifically request them using job constraints. The configuration looks like:
# COMPUTE NODES
NodeName=hosts-[1-4] RealMemory=500000 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN features=rack17
NodeName=hosts-[5-8] RealMemory=500000 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN features=rack18
NodeName=hosts-[9-12] RealMemory=500000 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN features=rack19
NodeName=hosts-[13-16] RealMemory=500000 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN features=racklr20
We’re looking for a particular job setup where we get 4 nodes from one rack, then four nodes from any other rack. We use the constraints:
Most of the time the above works, but sometimes we’ve ended up with a job landing on 4 nodes from one rack, 3 nodes from another and one from a third rack, which should break our constraints. We took the constraints configuration from here: https://slurm.schedmd.com/srun.html#OPT_Brackets
Can anyone see what the issue might be?
This email and its attachments are intended solely for the addressed recipients and may contain confidential or legally privileged information.
If you are not the intended recipient you must not copy, distribute or disseminate this email in any way; to do so may be unlawful.
Any personal data/special category personal data herein are processed in accordance with UK data protection legislation.
All associated feasible security measures are in place. Further details are available from the Privacy Notice on the website and/or from the Company.
Graphcore Limited (registered in England and Wales with registration number 10185006) is registered at 107 Cheapside, London, UK, EC2V 6DN.
This message was scanned for viruses upon transmission. However Graphcore accepts no liability for any such transmission.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users