[slurm-users] Weirdness with partitions

David drhey at umich.edu
Thu Sep 21 14:38:22 UTC 2023


 That's not at all how I interpreted this man page description.  By "If the
job can use more than..." I thought it was completely obvious (although
perhaps wrong, if your interpretation is correct, but it never crossed my
mind) that it referred to whether the _submitting user_ is OK with it using
more than one partition. The partition where the user is forbidden (because
of the partition's allowed account) should just be _not_ the earliest
initiation (because it'll never initiate there), and therefore not run
there, but still be able to run on the other partitions listed in the batch
script.

> that's fair. I was considering this only given the fact that we know the
user doesn't have access to a partition (this isn't the surprise here) and
that slurm communicates that as the reason pretty clearly. I can see how if
a user is submitting against multiple partitions they might hope that if a
job couldn't run in a given partition, given the number of others provided,
the scheduler might consider all of those *before* dying outright at the
first rejection.

On Thu, Sep 21, 2023 at 10:28 AM Bernstein, Noam CIV USN NRL (6393)
Washington DC (USA) <noam.bernstein at nrl.navy.mil> wrote:

> On Sep 21, 2023, at 9:46 AM, David <drhey at umich.edu> wrote:
>
> Slurm is working as it should. From your own examples you proved that; by
> not submitting to b4 the job works. However, looking at man sbatch:
>
>        -p, --partition=<partition_names>
>               Request  a  specific partition for the resource allocation.
> If not specified, the default behavior is to allow the slurm controller to
> select
>               the default partition as designated by the system
> administrator. If the job can use more than one partition, specify their
> names  in  a  comma
>               separate  list and the one offering earliest initiation will
> be used with no regard given to the partition name ordering (although
> higher pri‐
>               ority partitions will be considered first).  When the job is
> initiated, the name of the partition used will be placed first in the job
>  record
>               partition string.
>
> In your example, the job can NOT use more than one partition (given the
> restrictions defined on the partition itself precluding certain accounts
> from using it). This, to me, seems either like a user education issue (i.e.
> don't have them submit to every partition), or you can try the job submit
> lua route - or perhaps the hidden partition route (which I've not tested).
>
>
> That's not at all how I interpreted this man page description.  By "If the
> job can use more than..." I thought it was completely obvious (although
> perhaps wrong, if your interpretation is correct, but it never crossed my
> mind) that it referred to whether the _submitting user_ is OK with it using
> more than one partition. The partition where the user is forbidden (because
> of the partition's allowed account) should just be _not_ the earliest
> initiation (because it'll never initiate there), and therefore not run
> there, but still be able to run on the other partitions listed in the batch
> script.
>
> I think it's completely counter-intuitive that submitting saying it's OK
> to run on one of a few partitions, and one partition happening to be
> forbidden to the submitting user, means that it won't run at all.  What if
> you list multiple partitions, and increase the number of nodes so that
> there aren't enough in one of the partitions, but not realize this
> problem?  Would you expect that to prevent the job from ever running on any
> partition?
>
> Noam
>


-- 
David Rhey
---------------
Advanced Research Computing
University of Michigan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230921/ef87949b/attachment.htm>


More information about the slurm-users mailing list