[slurm-users] Problem with job allocation

Wed Mar 30 13:29:43 UTC 2022

Hi;

Slurm log says that your prolog did not finish at 300 seconds.

Only possible cause that I see, is the line started with "sudo 
/usr/bin/beeond start -F -P -b /usr/bin/pdsh".

You can put a timeout command at the begining of the sudo line to test:

timeout 150  sudo /usr/bin/beeond start -F -P -b /usr/bin/pdsh ......

If the problem is solved with the timeout command, you should check 
sudoers permission is correctly set for password-less sudo command. You 
can check permission by executing this sudo line as the slurm user.

If sudoers permission is correct, but command takes too much time, you 
can increase this 300 seconds threshold.

Regards,

Ahmet M.

On 30.03.2022 15:59, Nicolas Sonoda wrote:
> Hi!
>
> I'm getting the following error with prolog when I try to alocate more 
> then 2 nodes with Sbatch:
>
> [2022-03-28T07:40:17.016] backfill: Started JobId=19825 in intel_large 
> on n[01-05]
> [2022-03-28T07:45:17.310] _run_prolog: timeout after 300s: killing 
> pgid 45004
> [2022-03-28T07:45:17.310] error: prolog_slurmctld JobId=19825 prolog 
> exit status 0:9
>
> I have this configuration for my queue:
>
> PartitionName=intel_large Nodes=n[01-10] Default=NO MaxTime=72:00:00 
> MaxNodes=5 OverSubscribe=EXCLUSIVE State=UP
>
> And I'm attaching my slurmctld.prolog
>
> Can you help me with that?
>
> Thanks!