[slurm-users] R jobs crashing when run in parallel
Patrick Goetz
pgoetz at math.utexas.edu
Mon Mar 29 16:34:14 UTC 2021
Could this be a function of the R script you're trying to run, or are
you saying you get this error running the same script which works at
other times?
On 3/29/21 7:47 AM, Simon Andrews wrote:
> I've got a weird problem on our slurm cluster. If I submit lots of R
> jobs to the queue then as soon as I've got more than about 7 of them
> running at the same time I start to get failures, saying:
>
> /bi/apps/R/4.0.4/lib64/R/bin/exec/R: error while loading shared
> libraries: libpcre2-8.so.0: cannot open shared object file: No such file
> or directory
>
> ..which makes no sense because that library is definitely there, and
> other jobs on the same nodes worked both before and after the failed
> jobs. I recently ran 500 identical jobs and 152 of them failed in this way.
>
> There are no errors in the log files on the compute nodes where this
> failed and it happens across multiple nodes so it's not a single one
> being strange. The R binary is on an isilon network share, but the
> libpcre2 library is on the local disk for the node.
>
> Anyone come across anything like this before? Any suggestions for fixes?
>
> Thanks
>
> Simon.
>
>
> This message is from an external sender. Learn more about why this
> matters. <https://ut.service-now.com/sp?id=kb_article&number=KB0011401>
>
>
More information about the slurm-users
mailing list