[slurm-users] Strange sbatch error with 21.08.2&5

Wayne Hendricks waynehendricks at gmail.com
Sat Jan 15 05:04:00 UTC 2022


Running test job with srun works:
wayneh at login:~$ srun -G16 -p v100 /home/wayne.hendricks/job.sh
179851
Linux dgx1-1 5.4.0-94-generic #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC
2022 x86_64 x86_64 x86_64 GNU/Linux
179851
Linux dgx1-2 5.4.0-94-generic #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC
2022 x86_64 x86_64 x86_64 GNU/Linux

Submitting the same with sbatch does not:
wayneh at login:~$ sbatch test.sh
Submitted batch job 179850
wayneh at login:~$ cat test.out
srun: error: Unable to create step for job 179850: Unspecified error
wayneh at login:~$ cat test.sh
#!/usr/bin/env bash
#SBATCH -J testing
#SBATCH -e /home/wayne.hendricks/test.out
#SBATCH -o /home/wayne.hendricks/test.out
#SBATCH -G 16
#SBATCH --partition v100
srun uname -a

Any idea why srun and sbatch wouldn't run the same way? It seems to
run correctly when I use an odd number of GPUs in sbatch. (#SBATCH -G
15)

Node config:
NodeName=dgx1-[1-10] CPUs=80 Sockets=2 CoresPerSocket=20
ThreadsPerCore=2 RealMemory=490000 Gres=gpu:8 State=UNKNOWN
PartitionName=v100 Nodes=dgx1-[1-10] OverSubscribe=FORCE:8
DefCpuPerGPU=10 DefMemPerGPU=61250 MaxTime=INFINITE State=UP



More information about the slurm-users mailing list