[slurm-users] number of nodes varies for no reason?
Noam Bernstein
noam.bernstein at nrl.navy.mil
Wed Mar 27 21:43:10 UTC 2019
Hi fellow slurm users - I’ve been using slurm happily for a few months, but now I feel like it’s gone crazy, and I’m wondering if anyone can explain what’s going on. I have a trivial batch script which I submit multiple times, and ends up with different numbers of nodes allocated. Does anyone have any idea why?
Here’s the output:
tin 2028 : cat t
#!/bin/bash
#SBATCH --ntasks=72
#SBATCH --exclusive
#SBATCH --partition=n2019
#SBATCH --ntasks-per-core=1
#SBATCH --time=00:10:00
echo test
sleep 600
tin 2029 : sbatch t
Submitted batch job 407758
tin 2030 : sbatch t
Submitted batch job 407759
tin 2030 : sbatch t
Submitted batch job 407760
tin 2030 : squeue -l -u bernstei
Wed Mar 27 17:30:51 2019
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
407760 n2019 t bernstei RUNNING 0:03 10:00 3 compute-4-[16-18]
407758 n2019 t bernstei RUNNING 0:06 10:00 2 compute-4-[29-30]
407759 n2019 t bernstei RUNNING 0:06 10:00 2 compute-4-[21,28]
All the compute-4-* nodes have 36 physical cores, 72 hyperthreads.
If I look at the SLURM_* variables, all the jobs show
SLURM_NPROCS=72
SLURM_NTASKS=72
SLURM_CPUS_ON_NODE=72
SLURM_NTASKS_PER_CORE=1
but for some reason the job that ends up on 3 nodes, and only that one, shows
SLURM_JOB_CPUS_PER_NODE=72(x3)
SLURM_TASKS_PER_NODE=24(x3)
while the others show the expected
SLURM_JOB_CPUS_PER_NODE=72(x2)
SLURM_TASKS_PER_NODE=36(x2)
I’m using CentOS 7 (via NPACI Rocks) and slurm 18.08.0 via the rocks slurm roll.
thanks,
Noam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190327/5ca7d447/attachment-0001.html>
More information about the slurm-users
mailing list