[slurm-users] srun/sbatch dependency not working

Darin Gowan dgowan777 at gmail.com
Wed Apr 14 18:58:40 UTC 2021


Dear distinguished list,

I am new to SLURM.  I have recently installed SLURM 20.11.3 on two separate
three node clusters.  The first cluster was for testing purposes using
three small RHEL 7.7 VMs (8 core, 8G RAM).  After a successful installation
and some sbatch testing, I proceeded to the second cluster.

The production cluster is running on three RHEL 7.7 physical servers, two
sockets, 24 cores each, 2 threads per core and 1TB RAM.  This installation
was also successful.

Yesterday, a user brought an issue to my attention.  They reported that
when submitting a job via srun using the dependency option (-d
afterany:aaaa:bbbb:cccc:dddd...), the dependency was not being honored.

I began by testing the srun -d option in my test cluster, which worked like
a charm.  The srun job went into the Pending state, waiting for resources.
Once jobid 1296 completed, the srun executed.

$ srun /wks01/data/slurm-jobs/clean-up.sh -d afterany:1296
srun: job 1301 queued and waiting for resources

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
              1301     slurm clean-up zzgowand PD       0:00      1
(Resources)
              1291     slurm hostname zzgowand  R       0:56      1
r7slurm01
              1292     slurm hostname zzgowand  R       0:56      1
r7slurm01
              1293     slurm hostname zzgowand  R       0:56      1
r7slurm01
              1294     slurm hostname zzgowand  R       0:55      1
r7slurm01
              1295     slurm hostname zzgowand  R       0:55      1
r7slurm02
              1296     slurm hostname zzgowand  R       0:53      1
r7slurm02


However, when I ran the exact same test on the production cluster, it was
like the '-d' option wasn't supplied.  The job went into the Running state,
but never really executed.  It sat in this state for several minutes, when
it should have run in a few seconds.  I finally ended up aborting the
foreground execution of srun.  This resulted in the following messages:

srun launch/slurm: launch_p_step_launch: ... aborted before step completely
launched.

Has anyone experienced this before?

Thank you.

Darin Gowan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210414/9a65860c/attachment-0001.htm>


More information about the slurm-users mailing list