[slurm-users] R jobs crashing when run in parallel

Simon Andrews simon.andrews at babraham.ac.uk
Tue Mar 30 10:39:43 UTC 2021


Interesting idea, thanks.  I don't think this looks like the likely cause though:

# lsof | wc -l
20675

# cat /proc/sys/fs/file-max
52325451

This is on one of the nodes which had failures.  The number of open files is tiny compared to the limit.  I know there's a per-process limit, but given that the jobs are all identical then this should consistently fail if it was that.

Simon.

From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of William Brown
Sent: 29 March 2021 19:13
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] R jobs crashing when run in parallel

Maybe you have run out of file handles.

William

On Mon, 29 Mar 2021, 17:36 Patrick Goetz, <pgoetz at math.utexas.edu<mailto:pgoetz at math.utexas.edu>> wrote:
Could this be a function of the R script you're trying to run, or are
you saying you get this error running the same script which works at
other times?

On 3/29/21 7:47 AM, Simon Andrews wrote:
> I've got a weird problem on our slurm cluster.  If I submit lots of R
> jobs to the queue then as soon as I've got more than about 7 of them
> running at the same time I start to get failures, saying:
>
> /bi/apps/R/4.0.4/lib64/R/bin/exec/R: error while loading shared
> libraries: libpcre2-8.so.0: cannot open shared object file: No such file
> or directory
>
> ..which makes no sense because that library is definitely there, and
> other jobs on the same nodes worked both before and after the failed
> jobs.  I recently ran 500 identical jobs and 152 of them failed in this way.
>
> There are no errors in the log files on the compute nodes where this
> failed and it happens across multiple nodes so it's not a single one
> being strange.  The R binary is on an isilon network share, but the
> libpcre2 library is on the local disk for the node.
>
> Anyone come across anything like this before?  Any suggestions for fixes?
>
> Thanks
>
> Simon.
>
>
> This message is from an external sender. Learn more about why this
> matters. <https://ut.service-now.com/sp?id=kb_article&number=KB0011401>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210330/a46da6c7/attachment-0001.htm>


More information about the slurm-users mailing list