[slurm-users] swap size

Fri Sep 21 23:04:41 MDT 2018

Ashton,   on a compute node with 256Gbytes of RAM I would not
configure any swap at all. None.
I managed an SGI UV1 machine at an F1 team which had 1Tbyte of RAM -
and no swap.
Also our ICE clusters were diskless - SGI very smartly configured swap
over ISCSI - but we disabled this, the reason being that if one node
in a job starts swapping the likelihood is that all the nodes are
swapping, and things turn to treacle from there.
Also, as another issue, if you have lots of RAM you need to look at
the vm tunings for dirty ratio, background ratio and centisecs. Linux
will aggressively cache data which is written to disk - you can get a
situation where your processes THINK data is written to disk but it is
cached, then what happens of there is a power loss? SO get those
caches flushed often.
https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/

Oh, and my other tip.  In the past vm.min_free_kbytes was ridiculously
small on default Linux systems. I call this the 'wriggle room' when a
system is short on RAM. Think of it like those square sliding letters
puzzles - min_free_kbytes is the empty square which permits the letter
tiles to move.
SO look at your min_free_kbytes and increase it (If I'm not wrong in
RH7 and Centos7 systems it is a reasonable value already)
https://bbs.archlinux.org/viewtopic.php?id=184655

Oh, and  it is good to keep a terminal open with 'watch cat
/proc/meminfo'  I have spent many a happy hour staring at that when
looking at NFS performance etc. etc.

Back to your specific case. My point is that for HPC work you should
never go into swap (with a normally running process, ie no job
pre-emption). I find that 20 percent rule is out of date. Yes,
probably you should have some swap on a workstation. And yes disk
space is cheap these days.

However, you do talk about job pre-emption and suspending/resuming
jobs. I have never actually seen that being used in production.
At this point I would be grateful for some education from the choir -
is this commonly used and am I just hopelessly out of date?
Honestly, anywhere I have managed systems, lower priority jobs are
either allowed to finish, or in the case of F1 we checkpointed and
killed low priority jobs manually if there was a super high priority
job to run.

On Fri, 21 Sep 2018 at 22:34, A <andrealphus at gmail.com> wrote:
>
> I have a single node slurm config on my workstation (18 cores, 256 gb ram, 40 Tb disk space). I recently just extended the array size to its current config and am reconfiguring my LVM logical volumes.
>
> I'm curious on people's thoughts on swap sizes for a node. Redhat these days recommends up to 20% of ram size for swap size, but no less than 4 gb.
>
> But......according to slurm faq;
> "Suspending and resuming a job makes use of the SIGSTOP and SIGCONT signals respectively, so swap and disk space should be sufficient to accommodate all jobs allocated to a node, either running or suspended."
>
> So I'm wondering if 20% is enough, or whether it should scale by the number of single jobs I might be running at any one time. E.g. if I'm running 10 jobs that all use 20 gb of ram, and I suspend, should I need 200 gb of swap?
>
> any thoughts?
>
> -ashton