[slurm-users] ulimit in sbatch script
Bill Barth
bbarth at tacc.utexas.edu
Sun Apr 15 12:31:08 MDT 2018
Are you using pam_limits.so in any of your /etc/pam.d/ configuration files? That would be enforcing /etc/security/limits.conf for all users which are usually unlimited for root. Root’s almost always allowed to do stuff bad enough to crash the machine or run it out of resources. If the /etc/pam.d/sshd file has pam_limits.so in it, that’s probably where the unlimited setting for root is coming from.
Best,
Bill.
--
Bill Barth, Ph.D., Director, HPC
bbarth at tacc.utexas.edu | Phone: (512) 232-7069
Office: ROC 1.435 | Fax: (512) 475-9445
On 4/15/18, 1:26 PM, "slurm-users on behalf of Mahmood Naderan" <slurm-users-bounces at lists.schedmd.com on behalf of mahmood.nt at gmail.com> wrote:
I actually have disabled the swap partition (!) since the system goes
really bad and based on my experience I have to enter the room and
reset the affected machine (!). Otherwise I have to wait for long
times to see it get back to normal.
When I ssh to the node with root user, the ulimit -a says unlimited
virtual memory. So, it seems that the root have unlimited value while
users have limited value.
Regards,
Mahmood
On Sun, Apr 15, 2018 at 10:26 PM, Ole Holm Nielsen
<Ole.H.Nielsen at fysik.dtu.dk> wrote:
> Hi Mahmood,
>
> It seems your compute node is configured with this limit:
>
> virtual memory (kbytes, -v) 72089600
>
> So when the batch job tries to set a higher limit (ulimit -v 82089600) than
> permitted by the system (72089600), this must surely get rejected, as you
> have discovered!
>
> You may want to reconfigure your compute nodes' limits, for example by
> setting the virtual memory limit to "unlimited" in your configuration. If
> the nodes has a very small RAM memory + swap space size, you might encounter
> Out Of Memory errors...
>
> /Ole
More information about the slurm-users
mailing list