[slurm-users] SLURM_JOB_NODELIST not available in prolog / epilog scripts
Dan Jordan
ddj116 at gmail.com
Sun Mar 4 18:12:08 MST 2018
Sorry, you are right, the documentation is clear about it being available
only in EpilogSlurmctld. I'm quite new to SLURM and I've read some of the
documentation, but obviously I haven't grasped it all. I don't quite
understand the difference between --epilog vs --task-epilog, EpilogSlurm
vs. EpilogSlurmctld, so this is probably a case of me just doing it wrong!
Here's my example use case and how I'm trying to run and clean things up:
srun --nodes 3 --epilog path/to/*cleanup.sh* *myProgram.exe*
- *myProgram.exe* reads the SLURM_JOB_NODELIST and spawns "slave
processes" on those machines accordingly. It's all done through ssh
communication internally within the Trick simulation architecture
<https://github.com/nasa/trick/wiki/UserGuide-Monte-Carlo>. I point
this out because that means I'm *not *launching these slave processes
with srun or any other SLURM mechanism once Trick "takes control". I simply
"ask SLURM" for 3 nodes and Trick handles the distribution of processes
based on SLURM_JOB_NODELIST information.
- *cleanup.sh* is designed to read the same SLURM_JOB_NODELIST and then
clean up things on those machines in the event something kills the program
early (scancel for example). This way, the epilog script is only cleaning
up on the machines that *myProgram.exe* used during it's execution.
Calling srun with --epilog cleanup.sh works just fine, but that environment
variable doesn't exist so it can't go clean up the other nodes, and it
doesn't appear to be executed once per node, which was another thought I'd
had. Indeed, cleanup.sh appears to run *on the machine I launched srun
from*, not any of the machines given to me in my SLURM_JOB_NODELIST!
What is the *correct *way to clean up processes across the nodes given to
my program by SLURM_JOB_NODELIST?
Thanks!
P.S. I had originally attempted to use sbatch instead of srun, but found
that it doesn't support an --epilog switch at all.
On Sun, Mar 4, 2018 at 6:01 PM, Christopher Samuel <chris at csamuel.org>
wrote:
> On 05/03/18 10:16, Dan Jordan wrote:
>
> In my particular case, I need SLURM_JOB_NODELIST, which should be
>> available but it is not.
>>
>
> This is only available in PrologSlurmctld, not Prolog, according to
> those docs. Does that match what you're trying?
>
> cheers,
> Chris
>
>
--
Dan Jordan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180304/9e95936b/attachment.html>
More information about the slurm-users
mailing list