[slurm-users] How to debug a prolog script?

Fri Sep 16 07:30:19 UTC 2022

Davide DelVento <davide.quantum at gmail.com> writes:

> 2. How to debug the issue?

I'd try capturing all stdout and stderr from the script into a file on the compute
node, for instance like this:

exec &> /root/prolog_slurmd.$$
set -x # To print out all commands

before any other commands in the script.  The "prolog_slurmd.<pid>" will
then contain a log of all commands executed in the script, along with
all output (stdout and stderr).  If there is no "prolog_slurmd.<pid>"
file after the job has been scheduled, then as has been pointed out by
others, slurm wasn't able to exec the prolog at all.

> Even increasing the debug level the
> slurmctld.log contains simply a "error: validate_node_specs: Prolog or
> job env setup failure on node xxx, draining the node" message, without
> even a line number or anything.

Slurm only executes the prolog script.  It doesn't parse it or evaluate
it itself, so it has no way of knowing what fails inside the script.

> 3. And more generally, how to debug a prolog (and epilog) script
> without disrupting all production jobs? Unfortunately we can't have
> another slurm install for testing, is there a sbatch option to force
> utilizing a prolog script which would not be executed for all the
> other jobs? Or perhaps making a dedicated queue?

I tend to reserve a node, install the updated prolog scripts there, and
run test jobs asking for that reservation.  (Otherwise one could always
set up a small cluster of VMs and use that for simpler testing.)

-- 
B/H
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220916/108ad4c4/attachment.sig>