[slurm-users] [EXT] Strange sbatch error with 21.08.2&5

Sat Jan 15 05:53:55 UTC 2022

Any error in slurmd.log on the node or slurmctld.log on the ctl?

Sean
________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Wayne Hendricks <waynehendricks at gmail.com>
Sent: Saturday, 15 January 2022 16:04
To: slurm-users at schedmd.com <slurm-users at schedmd.com>
Subject: [EXT] [slurm-users] Strange sbatch error with 21.08.2&5

External email: Please exercise caution

Running test job with srun works:
wayneh at login:~$ srun -G16 -p v100 /home/wayne.hendricks/job.sh
179851
Linux dgx1-1 5.4.0-94-generic #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC
2022 x86_64 x86_64 x86_64 GNU/Linux
179851
Linux dgx1-2 5.4.0-94-generic #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC
2022 x86_64 x86_64 x86_64 GNU/Linux

Submitting the same with sbatch does not:
wayneh at login:~$ sbatch test.sh
Submitted batch job 179850
wayneh at login:~$ cat test.out
srun: error: Unable to create step for job 179850: Unspecified error
wayneh at login:~$ cat test.sh
#!/usr/bin/env bash
#SBATCH -J testing
#SBATCH -e /home/wayne.hendricks/test.out
#SBATCH -o /home/wayne.hendricks/test.out
#SBATCH -G 16
#SBATCH --partition v100
srun uname -a

Any idea why srun and sbatch wouldn't run the same way? It seems to
run correctly when I use an odd number of GPUs in sbatch. (#SBATCH -G
15)

Node config:
NodeName=dgx1-[1-10] CPUs=80 Sockets=2 CoresPerSocket=20
ThreadsPerCore=2 RealMemory=490000 Gres=gpu:8 State=UNKNOWN
PartitionName=v100 Nodes=dgx1-[1-10] OverSubscribe=FORCE:8
DefCpuPerGPU=10 DefMemPerGPU=61250 MaxTime=INFINITE State=UP

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220115/6285bdde/attachment.htm>