[slurm-users] [EXT] Strange sbatch error with 21.08.2&5

Wayne Hendricks waynehendricks at gmail.com
Sat Jan 15 15:32:03 UTC 2022


The only thing that jumps out on the ctl logs is:
error: step_layout_create: no usable CPUs
The node logs were unremarkable.

It doesn't make much sense to me that the same job with srun or an odd
number of GPUs in sbatch works. I suspect something isn't adding up
right somewhere.

On Sat, Jan 15, 2022 at 12:56 AM Sean Crosby <scrosby at unimelb.edu.au> wrote:
>
> Any error in slurmd.log on the node or slurmctld.log on the ctl?
>
> Sean
> ________________________________
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Wayne Hendricks <waynehendricks at gmail.com>
> Sent: Saturday, 15 January 2022 16:04
> To: slurm-users at schedmd.com <slurm-users at schedmd.com>
> Subject: [EXT] [slurm-users] Strange sbatch error with 21.08.2&5
>
> External email: Please exercise caution
>
> Running test job with srun works:
> wayneh at login:~$ srun -G16 -p v100 /home/wayne.hendricks/job.sh
> 179851
> Linux dgx1-1 5.4.0-94-generic #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC
> 2022 x86_64 x86_64 x86_64 GNU/Linux
> 179851
> Linux dgx1-2 5.4.0-94-generic #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC
> 2022 x86_64 x86_64 x86_64 GNU/Linux
>
> Submitting the same with sbatch does not:
> wayneh at login:~$ sbatch test.sh
> Submitted batch job 179850
> wayneh at login:~$ cat test.out
> srun: error: Unable to create step for job 179850: Unspecified error
> wayneh at login:~$ cat test.sh
> #!/usr/bin/env bash
> #SBATCH -J testing
> #SBATCH -e /home/wayne.hendricks/test.out
> #SBATCH -o /home/wayne.hendricks/test.out
> #SBATCH -G 16
> #SBATCH --partition v100
> srun uname -a
>
> Any idea why srun and sbatch wouldn't run the same way? It seems to
> run correctly when I use an odd number of GPUs in sbatch. (#SBATCH -G
> 15)
>
> Node config:
> NodeName=dgx1-[1-10] CPUs=80 Sockets=2 CoresPerSocket=20
> ThreadsPerCore=2 RealMemory=490000 Gres=gpu:8 State=UNKNOWN
> PartitionName=v100 Nodes=dgx1-[1-10] OverSubscribe=FORCE:8
> DefCpuPerGPU=10 DefMemPerGPU=61250 MaxTime=INFINITE State=UP
>



More information about the slurm-users mailing list