[slurm-users] Problem with job allocation
Nicolas Sonoda
nicolas.sonoda at versatushpc.com.br
Wed Mar 30 15:22:39 UTC 2022
Dear Ahmet M.,
I've tried your recomendation but unfortunately it didn't work.
But I realized that when I restart the slurmctld.service the job starts, but I don't know why.
Before, the job was stucked in CF, but when I restart the slurmctl it changes to R.
Have any ideas?
Thanks!
________________________________
De: mercan <ahmet.mercan at uhem.itu.edu.tr>
Enviado: quarta-feira, 30 de março de 2022 10:29
Para: Slurm User Community List <slurm-users at lists.schedmd.com>; Nicolas Sonoda <nicolas.sonoda at versatushpc.com.br>; slurm-users at schedmd.com <slurm-users at schedmd.com>
Assunto: Re: [slurm-users] Problem with job allocation
Hi;
Slurm log says that your prolog did not finish at 300 seconds.
Only possible cause that I see, is the line started with "sudo
/usr/bin/beeond start -F -P -b /usr/bin/pdsh".
You can put a timeout command at the begining of the sudo line to test:
timeout 150 sudo /usr/bin/beeond start -F -P -b /usr/bin/pdsh ......
If the problem is solved with the timeout command, you should check
sudoers permission is correctly set for password-less sudo command. You
can check permission by executing this sudo line as the slurm user.
If sudoers permission is correct, but command takes too much time, you
can increase this 300 seconds threshold.
Regards,
Ahmet M.
On 30.03.2022 15:59, Nicolas Sonoda wrote:
> Hi!
>
> I'm getting the following error with prolog when I try to alocate more
> then 2 nodes with Sbatch:
>
> [2022-03-28T07:40:17.016] backfill: Started JobId=19825 in intel_large
> on n[01-05]
> [2022-03-28T07:45:17.310] _run_prolog: timeout after 300s: killing
> pgid 45004
> [2022-03-28T07:45:17.310] error: prolog_slurmctld JobId=19825 prolog
> exit status 0:9
>
> I have this configuration for my queue:
>
> PartitionName=intel_large Nodes=n[01-10] Default=NO MaxTime=72:00:00
> MaxNodes=5 OverSubscribe=EXCLUSIVE State=UP
>
> And I'm attaching my slurmctld.prolog
>
> Can you help me with that?
>
> Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220330/7ed7d4d0/attachment-0002.htm>
More information about the slurm-users
mailing list