[slurm-users] Problems with sun and TaskProlog
Putnam, Harry
Harry.Putnam at ucsf.edu
Thu Feb 10 23:49:09 UTC 2022
Greetings All
I am struggling a bit with how TaskProlog works with srun. We have our TaskProlog set up to create a TMPDIR on local compute node scratch space and export the path in a variable called $TMPDIR. Our TaskEpilog deletes TMPDIR. This is working great for jobs submitted with sbatch. If I start an srun job with –ntasks=1, then everything works the same as with sbatch. Namely, the $TMPDIR variable is set and the directory is created on local scratch. However, if I use –ntasks=n where n > 1, we still get the $TMPDIR variable created but the directory itself is not created. Key files and examples:
slurm.conf (relevant entries):
#Prolog=/opt/slurm/prolog.bash
#PrologFlags=Alloc,NoHold
#Epilog=/opt/slurm/epilog.bash
#SrunProlog=/opt/slurm/srun_prolog
#SrunEpilog=/opt/slurm/srun_epilog
TaskProlog=/opt/slurm/task_prolog
TaskEpilog=/opt/slurm/task_epilog
/opt/slurm/task_prolog:
#!/bin/bash
mytmpdir=/scratch/$SLURM_JOB_USER/$SLURM_JOB_ID
mkdir -p $mytmpdir
echo export TMPDIR=$mytmpdir
exit;
/opt/slurm/task_epilog
#!/bin/bash
mytmpdir=/scratch/$SLURM_JOB_USER/$SLURM_JOB_ID
rm -Rf $mytmpdir
exit;
Run Example –ntasks=1:
$ srun --pty --mem=16g --ntasks=1 --time 0-08:00 --gres=scratch:20g --partition=cbc --nodelist=c4-n13 $SHELL
[hputnam at c4-n13:job=421362 ~]$ echo $TMPDIR
/scratch/hputnam/421362
[hputnam at c4-n13:job=421362 ~]$ ls $TMPDIR
[hputnam at c4-n13:job=421362 ~]$
Run Example –ntasks=2 $TMPDIR variable is set but the directory is not created:
$ srun --pty --mem=16g --ntasks=2 --time 0-08:00 --gres=scratch:20g --partition=cbc --nodelist=c4-n13 $SHELL
[hputnam at c4-n13:job=421370 ~]$ echo $TMPDIR
/scratch/hputnam/421370
[hputnam at c4-n13:job=421370 ~]$ ls $TMPDIR
ls: cannot access /scratch/hputnam/421370: No such file or directory
I am quite confused by this. I read this: https://slurm.schedmd.com/prolog_epilog.html which says TaskProlog is run by the user executing srun prior to lunching job step. I am not sure I understand what constitutes a job step. I do see a stepd process launched on the compute node each time I execute srun. That seems independent of –ntasks, I get one process per srun regardless of what –ntasks is set to.
Thanks in advance.
-Harry
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220210/af59df01/attachment-0001.htm>
More information about the slurm-users
mailing list