[slurm-users] Weirdness with partitions
David
drhey at umich.edu
Thu Sep 21 14:38:22 UTC 2023
That's not at all how I interpreted this man page description. By "If the
job can use more than..." I thought it was completely obvious (although
perhaps wrong, if your interpretation is correct, but it never crossed my
mind) that it referred to whether the _submitting user_ is OK with it using
more than one partition. The partition where the user is forbidden (because
of the partition's allowed account) should just be _not_ the earliest
initiation (because it'll never initiate there), and therefore not run
there, but still be able to run on the other partitions listed in the batch
script.
> that's fair. I was considering this only given the fact that we know the
user doesn't have access to a partition (this isn't the surprise here) and
that slurm communicates that as the reason pretty clearly. I can see how if
a user is submitting against multiple partitions they might hope that if a
job couldn't run in a given partition, given the number of others provided,
the scheduler might consider all of those *before* dying outright at the
first rejection.
On Thu, Sep 21, 2023 at 10:28 AM Bernstein, Noam CIV USN NRL (6393)
Washington DC (USA) <noam.bernstein at nrl.navy.mil> wrote:
> On Sep 21, 2023, at 9:46 AM, David <drhey at umich.edu> wrote:
>
> Slurm is working as it should. From your own examples you proved that; by
> not submitting to b4 the job works. However, looking at man sbatch:
>
> -p, --partition=<partition_names>
> Request a specific partition for the resource allocation.
> If not specified, the default behavior is to allow the slurm controller to
> select
> the default partition as designated by the system
> administrator. If the job can use more than one partition, specify their
> names in a comma
> separate list and the one offering earliest initiation will
> be used with no regard given to the partition name ordering (although
> higher pri‐
> ority partitions will be considered first). When the job is
> initiated, the name of the partition used will be placed first in the job
> record
> partition string.
>
> In your example, the job can NOT use more than one partition (given the
> restrictions defined on the partition itself precluding certain accounts
> from using it). This, to me, seems either like a user education issue (i.e.
> don't have them submit to every partition), or you can try the job submit
> lua route - or perhaps the hidden partition route (which I've not tested).
>
>
> That's not at all how I interpreted this man page description. By "If the
> job can use more than..." I thought it was completely obvious (although
> perhaps wrong, if your interpretation is correct, but it never crossed my
> mind) that it referred to whether the _submitting user_ is OK with it using
> more than one partition. The partition where the user is forbidden (because
> of the partition's allowed account) should just be _not_ the earliest
> initiation (because it'll never initiate there), and therefore not run
> there, but still be able to run on the other partitions listed in the batch
> script.
>
> I think it's completely counter-intuitive that submitting saying it's OK
> to run on one of a few partitions, and one partition happening to be
> forbidden to the submitting user, means that it won't run at all. What if
> you list multiple partitions, and increase the number of nodes so that
> there aren't enough in one of the partitions, but not realize this
> problem? Would you expect that to prevent the job from ever running on any
> partition?
>
> Noam
>
--
David Rhey
---------------
Advanced Research Computing
University of Michigan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230921/ef87949b/attachment.htm>
More information about the slurm-users
mailing list