[slurm-users] good practices

Nigella Sanders nigella.sanders at gmail.com
Tue Nov 26 10:17:27 UTC 2019


Thank you all for such interesting replies.

The --dependency option is quite useful but in practice it has some
inconvenients. Firstly, all 20 jobs are *instantly queued* which some users
may be interpreting as an abusive use of common resources. Even worse, if a
job fails, the rest one will stay queued forever (?) being the first tagged
as "DependencyNeverSatisfied", and the rest just as "Dependency".

PS: Yarom, with queue time I meant the total run time allowed. I my case,
after a job starts running it will be killed if it takes more than 10 hours
of execution time. If the partition queue time limit were of 10 days for
instance I guess I could use a single sbatch to launch an script containing
the 20 executions as steps with srun

Regards,
Nigella







El lun., 25 nov. 2019 a las 15:08, Yair Yarom (<irush at cs.huji.ac.il>)
escribió:

> Hi,
>
> I'm not sure what queue time limit of 10 hours is. If you can't have jobs
> waiting for more than 10 hours, than it seems to be very small for 8 hours
> jobs.
> Generally, a few options:
> a. The --dependency option (either afterok or singleton)
> b. The --array option of sbatch with limit of 1 job at a time (instead of
> the for loop): sbatch --array=1-20%1
> c. At the end of the script of each job, call the sbatch line of the next
> job (this is probably the only option if indeed I understood the queue time
> limit correctly).
>
> And indeed, srun should probably be reserved for strictly interactive jobs.
>
> Regards,
>     Yair.
>
> On Mon, Nov 25, 2019 at 11:21 AM Nigella Sanders <
> nigella.sanders at gmail.com> wrote:
>
>>
>> Hi all,
>>
>> I guess this is a simple matter but I still find it confusing.
>>
>> I have to run 20 jobs on our supercomputer.
>> Each job takes about 8 hours and every one need the previous one to be
>> completed.
>> The queue time limit for jobs is 10 hours.
>>
>> So my first approach is serially launching them in a loop using srun:
>>
>>
>> *#!/bin/bash*
>> *for i in {1..20};do*
>>
>> *    srun  --time 08:10:00  [options]*
>>
>> *done*
>>
>> However SLURM literature keeps saying that 'srun' should be only used for
>> short command line tests. So that some sysadmins would consider this a bad
>> practice (see this
>> <https://stackoverflow.com/questions/43767866/slurm-srun-vs-sbatch-and-their-parameters>
>> ).
>>
>> My second approach switched to sbatch:
>>
>> * #!/bin/bash *
>> *for i in {1..20};do*
>> *    sbatch  --time 08:10:00 [options]*
>>
>> *    [polling to queue to see if job is done]*
>> *done*
>>
>> But since sbatch returns the prompt I had to add code to check for job
>> termination. Polling make use of sleep command and it is prone to race
>> conditions so it doesn't like to sysadmins either.
>>
>> I guess there must be a --wait option in some recent versions of SLURM (see
>> this <https://bugs.schedmd.com/show_bug.cgi?id=1685>). Not yet available
>> in our system though.
>>
>> Is there any prefererable/canonical/friendly way to do this?
>> Any thoughts would be really appreciated,
>>
>> Regards,
>> Nigella.
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191126/aa1e497d/attachment.htm>


More information about the slurm-users mailing list