[slurm-users] How to debug a prolog script?
Davide DelVento
davide.quantum at gmail.com
Fri Sep 16 12:43:02 UTC 2022
Thanks to both of you.
> Permissions on the file itself (and the directories in the path to it)
Does it need the execution permission? For root alone sufficient?
> Existence of the script on the nodes (prologue is run on the nodes, not the head)
Yes, it's in a shared filesystem.
> Not sure your error is the prologue script itself. Does everything run fine with no prologue configured?
Yes, everything has been working fine for months and still does as
soon as I take the prolog out of slurm.conf.
> > 2. How to debug the issue?
> I'd try capturing all stdout and stderr from the script into a file on the compute
> node, for instance like this:
>
> exec &> /root/prolog_slurmd.$$
> set -x # To print out all commands
Do you mean INSIDE the prologue script itself? Yes, this is what I'd
have done, if it weren't so disruptive of all my production jobs,
hence I had to turn it off before wrecking havoc too much.
> > Even increasing the debug level the
> > slurmctld.log contains simply a "error: validate_node_specs: Prolog or
> > job env setup failure on node xxx, draining the node" message, without
> > even a line number or anything.
>
> Slurm only executes the prolog script. It doesn't parse it or evaluate
> it itself, so it has no way of knowing what fails inside the script.
Sure, but even "just executing" there is stdout and stderr which could
be captured and logged rather than thrown away and force one to do the
above.
> > 3. And more generally, how to debug a prolog (and epilog) script
> > without disrupting all production jobs? Unfortunately we can't have
> > another slurm install for testing, is there a sbatch option to force
> > utilizing a prolog script which would not be executed for all the
> > other jobs? Or perhaps making a dedicated queue?
>
> I tend to reserve a node, install the updated prolog scripts there, and
> run test jobs asking for that reservation.
How do you "install the prolog scripts there"? Isn't the prolog
setting in slurm.conf global?
> (Otherwise one could always
> set up a small cluster of VMs and use that for simpler testing.)
Yes, but I need to request that cluster of VM to IT, have the same OS
installed and configured (and to be 100% identical, it needs to be
RHEL so license paid), and everything sync'ed with the actual
cluster.... I know it'd be very useful, but sadly we don't have the
resources to do that, so unfortunately this is not an option for me.
Thanks again.
More information about the slurm-users
mailing list