[slurm-users] number of nodes varies for no reason?

Wed Mar 27 21:43:10 UTC 2019

Hi fellow slurm users - I’ve been using slurm happily for a few months, but now I feel like it’s gone crazy, and I’m wondering if anyone can explain what’s going on.  I have a trivial batch script which I submit multiple times, and ends up with different numbers of nodes allocated. Does anyone have any idea why?  

Here’s the output:

tin 2028 : cat t
#!/bin/bash
#SBATCH --ntasks=72
#SBATCH --exclusive
#SBATCH --partition=n2019
#SBATCH --ntasks-per-core=1
#SBATCH --time=00:10:00

echo test
sleep 600

tin 2029 : sbatch t
Submitted batch job 407758
tin 2030 : sbatch t
Submitted batch job 407759
tin 2030 : sbatch t
Submitted batch job 407760

tin 2030 : squeue -l -u bernstei
Wed Mar 27 17:30:51 2019
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
            407760     n2019        t bernstei  RUNNING       0:03     10:00      3 compute-4-[16-18]
            407758     n2019        t bernstei  RUNNING       0:06     10:00      2 compute-4-[29-30]
            407759     n2019        t bernstei  RUNNING       0:06     10:00      2 compute-4-[21,28]

All the compute-4-* nodes have 36 physical cores, 72 hyperthreads.

If I look at the SLURM_* variables, all the jobs show 
SLURM_NPROCS=72
SLURM_NTASKS=72
SLURM_CPUS_ON_NODE=72
SLURM_NTASKS_PER_CORE=1
but for some reason the job that ends up on 3 nodes, and only that one, shows
SLURM_JOB_CPUS_PER_NODE=72(x3)
SLURM_TASKS_PER_NODE=24(x3)
while the others show the expected
SLURM_JOB_CPUS_PER_NODE=72(x2)
SLURM_TASKS_PER_NODE=36(x2)

I’m using CentOS 7 (via NPACI Rocks) and slurm 18.08.0 via the rocks slurm roll.

										thanks,
										Noam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190327/5ca7d447/attachment-0001.html>