[slurm-users] number of nodes varies for no reason?

Marcus Wagner wagner at itc.rwth-aachen.de
Fri Mar 29 12:14:32 UTC 2019


Hi Noam,

if you use the RealMemory parameter for the hosts, slurm will close a 
host, which has less than the configured memory. Thus
1. you would have seen much earlier, that something was wrong with the node
2. no job would have been submitted to that node, since it would have 
been closed


best
Marcus

On 3/28/19 2:32 PM, Noam Bernstein wrote:
>> On Mar 27, 2019, at 9:32 PM, Chris Samuel <chris at csamuel.org> wrote:
>>
>> On 27/3/19 2:43 pm, Noam Bernstein wrote:
>>
>>> Hi fellow slurm users - I’ve been using slurm happily for a few months, but now I feel like it’s gone crazy, and I’m wondering if anyone can explain what’s going on.  I have a trivial batch script which I submit multiple times, and ends up with different numbers of nodes allocated. Does anyone have any idea why?
>> You would need to share the output of "scontrol show nodes" to get an idea of what resources Slurm thinks each node has.
> Thanks for the pointer.  I believe this revealed the problem.  Systematically going over the "scontrol show nodes” output showed that while the number of cores was the same on each node, the memory was not, because one node had a badly socketed DIMM.  Even though I wasn’t explicitly requesting memory, the partition defaults to x * total_mem/n_cores per task, where x =~ 0.9, so it must have realized that the node was short of memory.  I fixed the underlying memory issue, and now I can no longer reproduce the weird behavior - now it always gets 2 nodes.
>
> 									Noam
>
>

-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de




More information about the slurm-users mailing list