[slurm-users] R jobs crashing when run in parallel

Patrick Goetz pgoetz at math.utexas.edu
Mon Mar 29 16:34:14 UTC 2021


Could this be a function of the R script you're trying to run, or are 
you saying you get this error running the same script which works at 
other times?

On 3/29/21 7:47 AM, Simon Andrews wrote:
> I've got a weird problem on our slurm cluster.  If I submit lots of R 
> jobs to the queue then as soon as I've got more than about 7 of them 
> running at the same time I start to get failures, saying:
> 
> /bi/apps/R/4.0.4/lib64/R/bin/exec/R: error while loading shared 
> libraries: libpcre2-8.so.0: cannot open shared object file: No such file 
> or directory
> 
> ..which makes no sense because that library is definitely there, and 
> other jobs on the same nodes worked both before and after the failed 
> jobs.  I recently ran 500 identical jobs and 152 of them failed in this way.
> 
> There are no errors in the log files on the compute nodes where this 
> failed and it happens across multiple nodes so it's not a single one 
> being strange.  The R binary is on an isilon network share, but the 
> libpcre2 library is on the local disk for the node.
> 
> Anyone come across anything like this before?  Any suggestions for fixes?
> 
> Thanks
> 
> Simon.
> 
> 
> This message is from an external sender. Learn more about why this 
> matters. <https://ut.service-now.com/sp?id=kb_article&number=KB0011401>
> 
> 



More information about the slurm-users mailing list