[slurm-users] [EXT] Strange sbatch error with 21.08.2&5

Sat Jan 15 16:07:21 UTC 2022

Also I have noticed that the behavior only crops up in sbatch when
multiples of whole nodes are requested. One single node runs fine. But
say, on 8 GPU systems 16/24/32 GPU jobs fail, whereas 15/23/31 GPU
jobs run fine. A manual srun command does not have any issues
requesting any of these configurations.

On Sat, Jan 15, 2022 at 10:32 AM Wayne Hendricks
<waynehendricks at gmail.com> wrote:
>
> The only thing that jumps out on the ctl logs is:
> error: step_layout_create: no usable CPUs
> The node logs were unremarkable.
>
> It doesn't make much sense to me that the same job with srun or an odd
> number of GPUs in sbatch works. I suspect something isn't adding up
> right somewhere.
>
> On Sat, Jan 15, 2022 at 12:56 AM Sean Crosby <scrosby at unimelb.edu.au> wrote:
> >
> > Any error in slurmd.log on the node or slurmctld.log on the ctl?
> >
> > Sean
> > ________________________________
> > From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Wayne Hendricks <waynehendricks at gmail.com>
> > Sent: Saturday, 15 January 2022 16:04
> > To: slurm-users at schedmd.com <slurm-users at schedmd.com>
> > Subject: [EXT] [slurm-users] Strange sbatch error with 21.08.2&5
> >
> > External email: Please exercise caution
> >
> > Running test job with srun works:
> > wayneh at login:~$ srun -G16 -p v100 /home/wayne.hendricks/job.sh
> > 179851
> > Linux dgx1-1 5.4.0-94-generic #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC
> > 2022 x86_64 x86_64 x86_64 GNU/Linux
> > 179851
> > Linux dgx1-2 5.4.0-94-generic #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC
> > 2022 x86_64 x86_64 x86_64 GNU/Linux
> >
> > Submitting the same with sbatch does not:
> > wayneh at login:~$ sbatch test.sh
> > Submitted batch job 179850
> > wayneh at login:~$ cat test.out
> > srun: error: Unable to create step for job 179850: Unspecified error
> > wayneh at login:~$ cat test.sh
> > #!/usr/bin/env bash
> > #SBATCH -J testing
> > #SBATCH -e /home/wayne.hendricks/test.out
> > #SBATCH -o /home/wayne.hendricks/test.out
> > #SBATCH -G 16
> > #SBATCH --partition v100
> > srun uname -a
> >
> > Any idea why srun and sbatch wouldn't run the same way? It seems to
> > run correctly when I use an odd number of GPUs in sbatch. (#SBATCH -G
> > 15)
> >
> > Node config:
> > NodeName=dgx1-[1-10] CPUs=80 Sockets=2 CoresPerSocket=20
> > ThreadsPerCore=2 RealMemory=490000 Gres=gpu:8 State=UNKNOWN
> > PartitionName=v100 Nodes=dgx1-[1-10] OverSubscribe=FORCE:8
> > DefCpuPerGPU=10 DefMemPerGPU=61250 MaxTime=INFINITE State=UP
> >