[slurm-users] Slurm node weights

David Baker D.J.Baker at soton.ac.uk
Thu Jul 25 13:29:37 UTC 2019


Hi Jeff,


Thank you for these details. so far we have never implemented any Slurm fixes. I suspect the node weights feature is quite important and useful, and it's probably worth me investigating this fix. In this respect could you please advise me?


If I use the fix to regenerate the "slurm-slurmd" rpm can I then stop the slurmctld processes on the servers, re-install the revised rpm and finally restart the slurmctld processes? Most importantly, can this replacement/fix be done on a live system that is running jobs, etc? That's assuming that we regard/announce the system to be at risk. Or alternatively, do we need to arrange downtime, etc?


Best regards,

David




________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Sarlo, Jeffrey S <JSarlo at Central.UH.EDU>
Sent: 25 July 2019 13:04
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Slurm node weights


This is the fix if you want to modify the code and rebuild


https://github.com/SchedMD/slurm/commit/f66a2a3e2064<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSchedMD%2Fslurm%2Fcommit%2Ff66a2a3e2064&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cc72db5f7dab1400983e008d710f8840c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=bhMG78N1%2FQ2ZInn599QuEQ6tyD5pRXAIomlNja1f3j0%3D&reserved=0>

I think 18.08.04 and later have it fixed.

Jeff


________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of David Baker <D.J.Baker at soton.ac.uk>
Sent: Thursday, July 25, 2019 6:53 AM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Slurm node weights


Hello,


Thank you for the replies. We're running an early version of Slurm 18.08 and it does appear that the node weights are being ignored re the bug.


We're experimenting with Slurm 19*, however we don't expect to deploy that new version for quite a while. In the meantime does anyone know if there any fix or alternative strategy that might help us to achieve the same result?


Best regards,

David

________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Sarlo, Jeffrey S <JSarlo at Central.UH.EDU>
Sent: 25 July 2019 12:26
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Slurm node weights


Which version of Slurm are you running?  I know some of the earlier versions of 18.08 had a bug and node weights were not working.


Jeff


________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of David Baker <D.J.Baker at soton.ac.uk>
Sent: Thursday, July 25, 2019 6:09 AM
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Slurm node weights


Hello,


As an update I note that I have tried restarting the slurmctld, however that doesn't help.


Best regards,

David

________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of David Baker <D.J.Baker at soton.ac.uk>
Sent: 25 July 2019 11:47:35
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: [slurm-users] Slurm node weights


Hello,


I'm experimenting with node weights and I'm very puzzled by what I see. Looking at the documentation I gathered that jobs will be allocated to the nodes with the lowest weight which satisfies their requirements. I have 3 nodes in a partition and I have defined the nodes like so..


NodeName=orange01 Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=1018990 State=UNKNOWN Weight=50
NodeName=orange[02-03] Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=1018990 State=UNKNOWN


So, given that the default weight is 1 I would expect jobs to be allocated to orange02 and orange03 first. I find, however that my test job is always allocated to orange01 with the higher weight. Have I overlooked something? I would appreciate your advice, please.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190725/65017e19/attachment-0001.htm>


More information about the slurm-users mailing list