[slurm-users] Running job using our serial queue

Wed Nov 6 09:53:33 UTC 2019

Hi David,

if I remember right (we have disabled swap for years now), swapping out 
processes seem to slow down the system overall.
But I know, that if the oom_killer does its job (killing over memory 
processes), the whole system is stalled until it has done its work. This 
might be the issue, your users see.

Hwloc at least should help the scheduler to decide, where to place 
processes, but if I remember right, slurm has to be built with hwloc 
support (meaning at least hwloc-devel has to be installed).
But this part is more guessing, than knowing.

Best
Marcus

On 11/5/19 11:58 AM, David Baker wrote:
> Hello,
>
> Thank you for your replies. I double checked that the "task" in, for 
> example, taskplugin=task/affinity is optional. In this respect it is 
> good to know that we have  the correct cgroups setup. So in theory 
> users should only disturb themselves, however in reality we find that 
> there is often a knock on effect on other users' jobs. So, for 
> example, users have complained that their jobs sometimes stall. I can 
> only vaguely think that something odd is going on at the kernel level 
> perhaps.
>
> One additional thing that I need to ask is... Should we have hwloc 
> installed our compute nodes? Does that help? Whenever I check which 
> processes are not being constrained by cgroups I only ever find a 
> small group of system processes.
>
> Best regards,
> David
>
>
>
>
> ------------------------------------------------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf 
> of Marcus Wagner <wagner at itc.rwth-aachen.de>
> *Sent:* 05 November 2019 07:47
> *To:* slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] Running job using our serial queue
> Hi David,
>
> doing it the way you do it, is the same way, we do it.
>
> When the Matlab job asks for one CPU, it only gets on CPU this way. 
> That means, that all the processes are bound to this one CPU. So 
> (theoretically) the user is just disturbing himself, if he uses more.
>
> But especially Matlab, there are more things to do. I t does not 
> suffice to add '-singleCompThread' to the commandline. Matlab is not 
> the only tool, that tries to use all cores, it finds on the node.
> The same is valid for CPLEX and Gurobi, both often used from Matlab. 
> So even, if the user sets '-singleCompThread' for Matlab, that does 
> not mean at all, the job is only using one CPU.
>
>
> Best
> Marcus
>
> On 11/4/19 4:14 PM, David Baker wrote:
>> Hello,
>>
>> We decided to route all jobs requesting from 1 to 20 cores to our 
>> serial queue. Furthermore, the nodes controlled by the serial queue 
>> are shared by multiple users. We did this to try to reduce the level 
>> of fragmentation across the cluster -- our default "batch" queue 
>> provides exclusive access to compute nodes.
>>
>> It looks like the downside of the serial queue is that jobs from 
>> different users can interact quite badly. To some extent this is an 
>> education issue -- for example matlab users need to be told to add 
>> the "-singleCompThread" option to their command line. On the other 
>> hand I wonder if our cgroups setup is optimal for the serial queue. 
>> Our *cgroup.conf* contains...
>>
>> *CgroupAutomount=yes
>> *
>> *CgroupReleaseAgentDir="/etc/slurm/cgroup"
>> *
>> *
>> *
>> *ConstrainCores=yes
>> *
>> *ConstrainRAMSpace=yes
>> *
>> *ConstrainDevices=yes
>> *
>> *TaskAffinity=no
>> *
>> *
>> *
>> *CgroupMountpoint=/sys/fs/cgroup*
>>
>> The relevant cgroup configuration in the *slurm.conf *is...
>> *ProctrackType=proctrack/cgroup
>> TaskPlugin=affinity,cgroup*
>>
>> Could someone please advise us on the required/recommended cgroup 
>> setup for the above scenario? For example, should we really set 
>> "TaskAffinity=yes"? I assume the interaction between jobs (sometimes 
>> jobs can get stalled) is due to context switching at the kernel 
>> level, however (apart from educating users) how can we minimise that 
>> switching on the serial nodes?
>>
>> Best regards,
>> David
>>
>
> -- 
> Marcus Wagner, Dipl.-Inf.
>
> IT Center
> Abteilung: Systeme und Betrieb
> RWTH Aachen University
> Seffenter Weg 23
> 52074 Aachen
> Tel: +49 241 80-24383
> Fax: +49 241 80-624383
> wagner at itc.rwth-aachen.de  <mailto:wagner at itc.rwth-aachen.de>
> www.itc.rwth-aachen.de  <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.itc.rwth-aachen.de&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cf4fb53d4fef74523599b08d761c4ac18%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=dtF928nvXUbXjpc4COy5bB9Qrs9LoZE8ePa26Ydjdsc%3D&reserved=0>

-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191106/39b253da/attachment.htm>