[slurm-users] swap size

A andrealphus at gmail.com
Sat Sep 22 00:01:46 MDT 2018


Hi John! Thanks for the reply, lots to think about.

In terms of suspending/resuming, my situation might be a bit different than
other people. As I mentioned this is an install on a single node
workstation. This is my daily office machine. I run alot of python
processing scripts that have low CPU need but lots of iterations. I found
it easier to manage these in slurm, opposed to writing mpi/parallel
processing routines in python directly.

Given this, sometimes I might submit a slurm array with 10K jobs, that
might take a week to run, but I still need to sometimes do work during the
day that requires more CPU power. In those cases I suspend the background
array, crank through whatever I need to do and then resume in the evening
when I go home. Sometimes I can say for jobs to finish, sometimes I have to
break in the middle of running jobs

On Fri, Sep 21, 2018, 10:07 PM John Hearns <hearnsj at googlemail.com> wrote:

> Ashton,   on a compute node with 256Gbytes of RAM I would not
> configure any swap at all. None.
> I managed an SGI UV1 machine at an F1 team which had 1Tbyte of RAM -
> and no swap.
> Also our ICE clusters were diskless - SGI very smartly configured swap
> over ISCSI - but we disabled this, the reason being that if one node
> in a job starts swapping the likelihood is that all the nodes are
> swapping, and things turn to treacle from there.
> Also, as another issue, if you have lots of RAM you need to look at
> the vm tunings for dirty ratio, background ratio and centisecs. Linux
> will aggressively cache data which is written to disk - you can get a
> situation where your processes THINK data is written to disk but it is
> cached, then what happens of there is a power loss? SO get those
> caches flushed often.
>
> https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
>
> Oh, and my other tip.  In the past vm.min_free_kbytes was ridiculously
> small on default Linux systems. I call this the 'wriggle room' when a
> system is short on RAM. Think of it like those square sliding letters
> puzzles - min_free_kbytes is the empty square which permits the letter
> tiles to move.
> SO look at your min_free_kbytes and increase it (If I'm not wrong in
> RH7 and Centos7 systems it is a reasonable value already)
> https://bbs.archlinux.org/viewtopic.php?id=184655
>
> Oh, and  it is good to keep a terminal open with 'watch cat
> /proc/meminfo'  I have spent many a happy hour staring at that when
> looking at NFS performance etc. etc.
>
> Back to your specific case. My point is that for HPC work you should
> never go into swap (with a normally running process, ie no job
> pre-emption). I find that 20 percent rule is out of date. Yes,
> probably you should have some swap on a workstation. And yes disk
> space is cheap these days.
>
>
> However, you do talk about job pre-emption and suspending/resuming
> jobs. I have never actually seen that being used in production.
> At this point I would be grateful for some education from the choir -
> is this commonly used and am I just hopelessly out of date?
> Honestly, anywhere I have managed systems, lower priority jobs are
> either allowed to finish, or in the case of F1 we checkpointed and
> killed low priority jobs manually if there was a super high priority
> job to run.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Fri, 21 Sep 2018 at 22:34, A <andrealphus at gmail.com> wrote:
> >
> > I have a single node slurm config on my workstation (18 cores, 256 gb
> ram, 40 Tb disk space). I recently just extended the array size to its
> current config and am reconfiguring my LVM logical volumes.
> >
> > I'm curious on people's thoughts on swap sizes for a node. Redhat these
> days recommends up to 20% of ram size for swap size, but no less than 4 gb.
> >
> > But......according to slurm faq;
> > "Suspending and resuming a job makes use of the SIGSTOP and SIGCONT
> signals respectively, so swap and disk space should be sufficient to
> accommodate all jobs allocated to a node, either running or suspended."
> >
> > So I'm wondering if 20% is enough, or whether it should scale by the
> number of single jobs I might be running at any one time. E.g. if I'm
> running 10 jobs that all use 20 gb of ram, and I suspend, should I need 200
> gb of swap?
> >
> > any thoughts?
> >
> > -ashton
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180921/f27b4a60/attachment-0001.html>


More information about the slurm-users mailing list