[slurm-users] Advice on setting a partition QOS

Wed Sep 25 15:58:38 UTC 2019

Dear Jurgen,

Thank you for your reply. So, in respond to your suggestion I submitted a batch of jobs each asking for 2 cpus. Again I was able to get 32 jobs running at once. I presume this is a weird interaction with the normal QOS. In that respect would it be best to redefine the normal OQS simply in terms of cpu/user usage? That is, not cpus/user and nodes/user.

Best regards,
David

________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Juergen Salk <juergen.salk at uni-ulm.de>
Sent: 25 September 2019 14:52
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Advice on setting a partition QOS

Dear David,

as it seems, Slurm counts allocated nodes on a per job basis,
i.e. every individual one-core jobs counts as an additional node
even if they all run on one and the same node.

Can you allocate 64 CPUs at the same time when requesting 2 CPUs
per job?

We've also had this (somewhat strange) behaviour with Moab and
therefore implemented limits based on processor counts rather
than node counts per user. This is obviously no issue for exclusive
node scheduling, but for non-exclusive nodes it is (or at least may
be).

Best regards
Jürgen

--
Jürgen Salk
Scientific Software & Compute Services (SSCS)
Kommunikations- und Informationszentrum (kiz)
Universität Ulm
Telefon: +49 (0)731 50-22478
Telefax: +49 (0)731 50-22471

* David Baker <D.J.Baker at soton.ac.uk> [190925 12:12]:
> Hello,
>
> I have defined a partition and corresponding QOS in Slurm. This is
> the serial queue to which we route jobs that require up to (and
> including) 20 cpus. The nodes controlled by serial are shared. I've
> set the QOS like so..
>
> [djb1 at cyan53 slurm]$ sacctmgr show qos serial format=name,maxtresperuser
>       Name     MaxTRESPU
> ---------- -------------
>     serial       cpu=120
>
> The max cpus/user is set high to try to ensure (as often as
> possible) that the nodes are all busy and not in mixed states.
> Obviously this cannot be the case all the time -- depending upon
> memory requirements, etc.
>
> I noticed that a number of jobs were pending with the reason
> QOSMaxNodePerUserLimit. I've tried firing test jobs to the queue
> myself and noticed that I can never have more than 32 jobs running
> (each requesting 1 cpu) and the rest are pending as per the reason
> above. Since the QOS cpu/user limit is set to 120 I would expect to
> be able to run more jobs -- given that some serial nodes are still
> not fully occupied. Furthermore, I note that other users appear not
> to be able to use more then 32 cpus in the queue.
>
> The 32 limit does make a degree of sense. The "normal" QOS is set to
> cpus/user=1280, nodes/user=32. It's almost like the 32 cpus in the
> serial queue are being counted as nodes -- as per the pending
> reason.
>
> Could someone please help me understand this issue and how to avoid it?
>
> Best regards,
> David

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190925/150a7961/attachment.htm>