[slurm-users] number of nodes varies for no reason?
noam.bernstein at nrl.navy.mil
Thu Mar 28 13:32:04 UTC 2019
> On Mar 27, 2019, at 9:32 PM, Chris Samuel <chris at csamuel.org> wrote:
> On 27/3/19 2:43 pm, Noam Bernstein wrote:
>> Hi fellow slurm users - I’ve been using slurm happily for a few months, but now I feel like it’s gone crazy, and I’m wondering if anyone can explain what’s going on. I have a trivial batch script which I submit multiple times, and ends up with different numbers of nodes allocated. Does anyone have any idea why?
> You would need to share the output of "scontrol show nodes" to get an idea of what resources Slurm thinks each node has.
Thanks for the pointer. I believe this revealed the problem. Systematically going over the "scontrol show nodes” output showed that while the number of cores was the same on each node, the memory was not, because one node had a badly socketed DIMM. Even though I wasn’t explicitly requesting memory, the partition defaults to x * total_mem/n_cores per task, where x =~ 0.9, so it must have realized that the node was short of memory. I fixed the underlying memory issue, and now I can no longer reproduce the weird behavior - now it always gets 2 nodes.
More information about the slurm-users