<div dir="auto">Hi John! Thanks for the reply, lots to think about.<div dir="auto"><br></div><div dir="auto">In terms of suspending/resuming, my situation might be a bit different than other people. As I mentioned this is an install on a single node workstation. This is my daily office machine. I run alot of python processing scripts that have low CPU need but lots of iterations. I found it easier to manage these in slurm, opposed to writing mpi/parallel processing routines in python directly.</div><div dir="auto"><br></div><div dir="auto">Given this, sometimes I might submit a slurm array with 10K jobs, that might take a week to run, but I still need to sometimes do work during the day that requires more CPU power. In those cases I suspend the background array, crank through whatever I need to do and then resume in the evening when I go home. Sometimes I can say for jobs to finish, sometimes I have to break in the middle of running jobs</div></div><br><div class="gmail_quote"><div dir="ltr">On Fri, Sep 21, 2018, 10:07 PM John Hearns <<a href="mailto:hearnsj@googlemail.com">hearnsj@googlemail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Ashton, on a compute node with 256Gbytes of RAM I would not<br>
configure any swap at all. None.<br>
I managed an SGI UV1 machine at an F1 team which had 1Tbyte of RAM -<br>
and no swap.<br>
Also our ICE clusters were diskless - SGI very smartly configured swap<br>
over ISCSI - but we disabled this, the reason being that if one node<br>
in a job starts swapping the likelihood is that all the nodes are<br>
swapping, and things turn to treacle from there.<br>
Also, as another issue, if you have lots of RAM you need to look at<br>
the vm tunings for dirty ratio, background ratio and centisecs. Linux<br>
will aggressively cache data which is written to disk - you can get a<br>
situation where your processes THINK data is written to disk but it is<br>
cached, then what happens of there is a power loss? SO get those<br>
caches flushed often.<br>
<a href="https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/" rel="noreferrer noreferrer" target="_blank">https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/</a><br>
<br>
Oh, and my other tip. In the past vm.min_free_kbytes was ridiculously<br>
small on default Linux systems. I call this the 'wriggle room' when a<br>
system is short on RAM. Think of it like those square sliding letters<br>
puzzles - min_free_kbytes is the empty square which permits the letter<br>
tiles to move.<br>
SO look at your min_free_kbytes and increase it (If I'm not wrong in<br>
RH7 and Centos7 systems it is a reasonable value already)<br>
<a href="https://bbs.archlinux.org/viewtopic.php?id=184655" rel="noreferrer noreferrer" target="_blank">https://bbs.archlinux.org/viewtopic.php?id=184655</a><br>
<br>
Oh, and it is good to keep a terminal open with 'watch cat<br>
/proc/meminfo' I have spent many a happy hour staring at that when<br>
looking at NFS performance etc. etc.<br>
<br>
Back to your specific case. My point is that for HPC work you should<br>
never go into swap (with a normally running process, ie no job<br>
pre-emption). I find that 20 percent rule is out of date. Yes,<br>
probably you should have some swap on a workstation. And yes disk<br>
space is cheap these days.<br>
<br>
<br>
However, you do talk about job pre-emption and suspending/resuming<br>
jobs. I have never actually seen that being used in production.<br>
At this point I would be grateful for some education from the choir -<br>
is this commonly used and am I just hopelessly out of date?<br>
Honestly, anywhere I have managed systems, lower priority jobs are<br>
either allowed to finish, or in the case of F1 we checkpointed and<br>
killed low priority jobs manually if there was a super high priority<br>
job to run.<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
On Fri, 21 Sep 2018 at 22:34, A <<a href="mailto:andrealphus@gmail.com" target="_blank" rel="noreferrer">andrealphus@gmail.com</a>> wrote:<br>
><br>
> I have a single node slurm config on my workstation (18 cores, 256 gb ram, 40 Tb disk space). I recently just extended the array size to its current config and am reconfiguring my LVM logical volumes.<br>
><br>
> I'm curious on people's thoughts on swap sizes for a node. Redhat these days recommends up to 20% of ram size for swap size, but no less than 4 gb.<br>
><br>
> But......according to slurm faq;<br>
> "Suspending and resuming a job makes use of the SIGSTOP and SIGCONT signals respectively, so swap and disk space should be sufficient to accommodate all jobs allocated to a node, either running or suspended."<br>
><br>
> So I'm wondering if 20% is enough, or whether it should scale by the number of single jobs I might be running at any one time. E.g. if I'm running 10 jobs that all use 20 gb of ram, and I suspend, should I need 200 gb of swap?<br>
><br>
> any thoughts?<br>
><br>
> -ashton<br>
<br>
</blockquote></div>