[slurm-users] R jobs crashing when run in parallel

William Brown william at signalbox.org.uk
Mon Mar 29 18:12:54 UTC 2021


Maybe you have run out of file handles.

William

On Mon, 29 Mar 2021, 17:36 Patrick Goetz, <pgoetz at math.utexas.edu> wrote:

> Could this be a function of the R script you're trying to run, or are
> you saying you get this error running the same script which works at
> other times?
>
> On 3/29/21 7:47 AM, Simon Andrews wrote:
> > I've got a weird problem on our slurm cluster.  If I submit lots of R
> > jobs to the queue then as soon as I've got more than about 7 of them
> > running at the same time I start to get failures, saying:
> >
> > /bi/apps/R/4.0.4/lib64/R/bin/exec/R: error while loading shared
> > libraries: libpcre2-8.so.0: cannot open shared object file: No such file
> > or directory
> >
> > ..which makes no sense because that library is definitely there, and
> > other jobs on the same nodes worked both before and after the failed
> > jobs.  I recently ran 500 identical jobs and 152 of them failed in this
> way.
> >
> > There are no errors in the log files on the compute nodes where this
> > failed and it happens across multiple nodes so it's not a single one
> > being strange.  The R binary is on an isilon network share, but the
> > libpcre2 library is on the local disk for the node.
> >
> > Anyone come across anything like this before?  Any suggestions for fixes?
> >
> > Thanks
> >
> > Simon.
> >
> >
> > This message is from an external sender. Learn more about why this
> > matters. <https://ut.service-now.com/sp?id=kb_article&number=KB0011401>
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210329/93a71e1b/attachment.htm>


More information about the slurm-users mailing list