[slurm-users] Slurm node weights
Sarlo, Jeffrey S
JSarlo at Central.UH.EDU
Thu Jul 25 14:36:50 UTC 2019
I think it would be the slurm-slurmctld rpm.
I'm not sure on the timing of updating and restarting. We noticed the issue when we were testing 18.08.01 and so didn't have any users/jobs at the time and just modified and rebuilt.
Jeff
From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of David Baker
Sent: Thursday, July 25, 2019 8:30 AM
To: Slurm User Community List
Subject: Re: [slurm-users] Slurm node weights
Hi Jeff,
Thank you for these details. so far we have never implemented any Slurm fixes. I suspect the node weights feature is quite important and useful, and it's probably worth me investigating this fix. In this respect could you please advise me?
If I use the fix to regenerate the "slurm-slurmd" rpm can I then stop the slurmctld processes on the servers, re-install the revised rpm and finally restart the slurmctld processes? Most importantly, can this replacement/fix be done on a live system that is running jobs, etc? That's assuming that we regard/announce the system to be at risk. Or alternatively, do we need to arrange downtime, etc?
Best regards,
David
________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Sarlo, Jeffrey S <JSarlo at Central.UH.EDU>
Sent: 25 July 2019 13:04
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Slurm node weights
This is the fix if you want to modify the code and rebuild
https://github.com/SchedMD/slurm/commit/f66a2a3e2064<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSchedMD%2Fslurm%2Fcommit%2Ff66a2a3e2064&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cc72db5f7dab1400983e008d710f8840c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=bhMG78N1%2FQ2ZInn599QuEQ6tyD5pRXAIomlNja1f3j0%3D&reserved=0>
I think 18.08.04 and later have it fixed.
Jeff
________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of David Baker <D.J.Baker at soton.ac.uk>
Sent: Thursday, July 25, 2019 6:53 AM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Slurm node weights
Hello,
Thank you for the replies. We're running an early version of Slurm 18.08 and it does appear that the node weights are being ignored re the bug.
We're experimenting with Slurm 19*, however we don't expect to deploy that new version for quite a while. In the meantime does anyone know if there any fix or alternative strategy that might help us to achieve the same result?
Best regards,
David
________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Sarlo, Jeffrey S <JSarlo at Central.UH.EDU>
Sent: 25 July 2019 12:26
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Slurm node weights
Which version of Slurm are you running? I know some of the earlier versions of 18.08 had a bug and node weights were not working.
Jeff
________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of David Baker <D.J.Baker at soton.ac.uk>
Sent: Thursday, July 25, 2019 6:09 AM
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Slurm node weights
Hello,
As an update I note that I have tried restarting the slurmctld, however that doesn't help.
Best regards,
David
________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of David Baker <D.J.Baker at soton.ac.uk>
Sent: 25 July 2019 11:47:35
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: [slurm-users] Slurm node weights
Hello,
I'm experimenting with node weights and I'm very puzzled by what I see. Looking at the documentation I gathered that jobs will be allocated to the nodes with the lowest weight which satisfies their requirements. I have 3 nodes in a partition and I have defined the nodes like so..
NodeName=orange01 Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=1018990 State=UNKNOWN Weight=50
NodeName=orange[02-03] Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=1018990 State=UNKNOWN
So, given that the default weight is 1 I would expect jobs to be allocated to orange02 and orange03 first. I find, however that my test job is always allocated to orange01 with the higher weight. Have I overlooked something? I would appreciate your advice, please.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190725/c152e955/attachment-0001.htm>
More information about the slurm-users
mailing list