[slurm-users] good practices

Eli V eliventer at gmail.com
Tue Nov 26 13:03:07 UTC 2019


Inline below

On Tue, Nov 26, 2019 at 5:50 AM Loris Bennett
<loris.bennett at fu-berlin.de> wrote:
>
> Hi Nigella,
>
> Nigella Sanders <nigella.sanders at gmail.com> writes:
>
> > Thank you all for such interesting replies.
> >
> > The --dependency option is quite useful but in practice it has some
> > inconvenients. Firstly, all 20 jobs are instantly queued which some
> > users may be interpreting as an abusive use of common resources.
>
> This doesn't seem a problem to me, since no common resources are being
> used by jobs in the queue.  It only becomes a problem if a single person
> can queue enough jobs to consume all the resources *and* you are not using
> any form of fairshare.  Otherwise job started later, but with a higher
> priority will start earlier, if the resources become available.
>
> This is not to say that users might *think* that a large number of jobs
> belonging other users automatically means that later jobs will be
> disadvantages.  However, that is more an issue of educating your users.
>
> > Even worse, if a job fails, the rest one will stay queued forever (?)
> > being the first tagged as "DependencyNeverSatisfied", and the rest
> > just as "Dependency".
>
> This is just a consequence of your requirement that "each job ... needs
> the previous one to be completed", but it also isn't a problem, because
> pending jobs don't consume resources for which users complete.

Also, using kill_invalid_depend in your slurm.conf's
SchedulerParameters will automatically remove the jobs from the queue
once their dependency can't be satisfied.


>
> Regards
>
> Loris
>
> > PS: Yarom, with queue time I meant the total run time allowed. I my case, after a job starts running it will be killed if it takes more than 10 hours of execution time. If the partition queue time limit were of 10 days
> > for instance I guess I could use a single sbatch to launch an script containing the 20 executions as steps with srun
> >
> > Regards,
> > Nigella
> >
> > El lun., 25 nov. 2019 a las 15:08, Yair Yarom (<irush at cs.huji.ac.il>) escribió:
> >
> >  Hi,
> >
> >  I'm not sure what queue time limit of 10 hours is. If you can't have jobs waiting for more than 10 hours, than it seems to be very small for 8 hours jobs.
> >  Generally, a few options:
> >  a. The --dependency option (either afterok or singleton)
> >  b. The --array option of sbatch with limit of 1 job at a time (instead of the for loop): sbatch --array=1-20%1
> >  c. At the end of the script of each job, call the sbatch line of the next job (this is probably the only option if indeed I understood the queue time limit correctly).
> >
> >  And indeed, srun should probably be reserved for strictly interactive jobs.
> >
> >  Regards,
> >      Yair.
> >
> >  On Mon, Nov 25, 2019 at 11:21 AM Nigella Sanders <nigella.sanders at gmail.com> wrote:
> >
> >  Hi all,
> >
> >  I guess this is a simple matter but I still find it confusing.
> >
> >  I have to run 20 jobs on our supercomputer.
> >  Each job takes about 8 hours and every one need the previous one to be completed.
> >  The queue time limit for jobs is 10 hours.
> >
> >  So my first approach is serially launching them in a loop using srun:
> >
> >  #!/bin/bash
> >  for i in {1..20};do
> >      srun  --time 08:10:00  [options]
> >  done
> >
> >  However SLURM literature keeps saying that 'srun' should be only used for short command line tests. So that some sysadmins would consider this a bad practice (see this).
> >
> >  My second approach switched to sbatch:
> >
> >  #!/bin/bash
> >  for i in {1..20};do
> >      sbatch  --time 08:10:00 [options]
> >      [polling to queue to see if job is done]
> >  done
> >
> >  But since sbatch returns the prompt I had to add code to check for job termination. Polling make use of sleep command and it is prone to race conditions so it doesn't like to sysadmins either.
> >
> >  I guess there must be a --wait option in some recent versions of SLURM (see this). Not yet available in our system though.
> >
> >  Is there any prefererable/canonical/friendly way to do this?
> >  Any thoughts would be really appreciated,
> >
> >  Regards,
> >  Nigella.
> >
> --
> Dr. Loris Bennett (Mr.)
> ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de
>



More information about the slurm-users mailing list