[slurm-users] Problems with sun and TaskProlog

Putnam, Harry Harry.Putnam at ucsf.edu
Thu Feb 10 23:49:09 UTC 2022


Greetings All

I am struggling a bit with how TaskProlog works with srun.  We have our TaskProlog set up to create a TMPDIR on local compute node scratch space and export the path in a variable called $TMPDIR. Our TaskEpilog deletes TMPDIR. This is working great for jobs submitted with sbatch. If I start an srun job with –ntasks=1, then everything works the same as with sbatch. Namely, the $TMPDIR variable is set and the directory is created on local scratch. However, if I use –ntasks=n where n > 1, we still get the $TMPDIR variable created but the directory itself is not created.  Key files and examples:

slurm.conf (relevant entries):

#Prolog=/opt/slurm/prolog.bash
#PrologFlags=Alloc,NoHold
#Epilog=/opt/slurm/epilog.bash
#SrunProlog=/opt/slurm/srun_prolog
#SrunEpilog=/opt/slurm/srun_epilog
TaskProlog=/opt/slurm/task_prolog
TaskEpilog=/opt/slurm/task_epilog


/opt/slurm/task_prolog:

#!/bin/bash
mytmpdir=/scratch/$SLURM_JOB_USER/$SLURM_JOB_ID
mkdir -p $mytmpdir
echo export TMPDIR=$mytmpdir
exit;

/opt/slurm/task_epilog

#!/bin/bash
mytmpdir=/scratch/$SLURM_JOB_USER/$SLURM_JOB_ID
rm -Rf $mytmpdir
exit;

Run Example –ntasks=1:

$ srun --pty --mem=16g --ntasks=1 --time 0-08:00 --gres=scratch:20g --partition=cbc --nodelist=c4-n13  $SHELL
[hputnam at c4-n13:job=421362 ~]$ echo $TMPDIR
/scratch/hputnam/421362
[hputnam at c4-n13:job=421362 ~]$ ls $TMPDIR
[hputnam at c4-n13:job=421362 ~]$

Run Example –ntasks=2 $TMPDIR variable is set but the directory is not created:

$ srun --pty --mem=16g --ntasks=2 --time 0-08:00 --gres=scratch:20g --partition=cbc --nodelist=c4-n13  $SHELL
[hputnam at c4-n13:job=421370 ~]$ echo $TMPDIR
/scratch/hputnam/421370
[hputnam at c4-n13:job=421370 ~]$ ls $TMPDIR
ls: cannot access /scratch/hputnam/421370: No such file or directory


I am quite confused by this. I read this: https://slurm.schedmd.com/prolog_epilog.html which says TaskProlog is run by the user executing srun prior to lunching job step. I am not sure I understand what constitutes a job step. I do see a stepd process launched on the compute node each time I execute srun. That seems independent of –ntasks, I get one process per srun regardless of what –ntasks is set to.

Thanks in advance.

-Harry
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220210/af59df01/attachment-0001.htm>


More information about the slurm-users mailing list