[slurm-users] How to debug a prolog script?

Fri Sep 16 13:32:24 UTC 2022

Davide DelVento <davide.quantum at gmail.com> writes:

> Does it need the execution permission? For root alone sufficient?

slurmd runs as root, so it only need exec perms for root.

>> > 2. How to debug the issue?
>> I'd try capturing all stdout and stderr from the script into a file on the compute
>> node, for instance like this:
>>
>> exec &> /root/prolog_slurmd.$$
>> set -x # To print out all commands
>
> Do you mean INSIDE the prologue script itself?

Yes, inside the prolog script itself.

> Yes, this is what I'd have done, if it weren't so disruptive of all my
> production jobs, hence I had to turn it off before wrecking havoc too
> much.

I'm curious: What kind of disruption did it cause for your production
jobs?

We use this in our slurmd prologs (and similar in epilogs) on all our
production clusters, and have not seen any disruption due to it.  (We do
have things like

    ## Remove log file if we got this far:
    rm -f /root/prolog_slurmd.$$

at the bottom of the scripts, though, so as to remove the log file when
the prolog succeeded.)

> Sure, but even "just executing" there is stdout and stderr which could
> be captured and logged rather than thrown away and force one to do the
> above.

True.  But slurmd doesn't, so...

> How do you "install the prolog scripts there"? Isn't the prolog
> setting in slurm.conf global?

I just overwrite the prolog script file itself on the node.  We
don't have them on a shared file system, though.  If you have the
prologs on a shared file system, you'd have to override the slurm config
on the compute node itself.  This can be done in several ways, for
instance by starting slurmd with the "-f <modified slurm.conf file>"
option.

>> (Otherwise one could always
>> set up a small cluster of VMs and use that for simpler testing.)
>
> Yes, but I need to request that cluster of VM to IT, have the same OS
> installed and configured (and to be 100% identical, it needs to be
> RHEL so license paid), and everything sync'ed with the actual
> cluster.... I know it'd be very useful, but sadly we don't have the
> resources to do that, so unfortunately this is not an option for me.

I totally agree that VMs instead of a physical test cluster is never
going to be 100 % the same, but some things can be tested even though
the setups are not exactly the same (for instance, in my experience,
CentOS and Rocky are close enough to RHEL for most slurm-related
things).  One takes what one have. :)

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220916/4c3e4359/attachment-0001.sig>