[slurm-users] R jobs crashing when run in parallel

Simon Andrews simon.andrews at babraham.ac.uk
Mon Mar 29 12:47:01 UTC 2021


I've got a weird problem on our slurm cluster.  If I submit lots of R jobs to the queue then as soon as I've got more than about 7 of them running at the same time I start to get failures, saying:

/bi/apps/R/4.0.4/lib64/R/bin/exec/R: error while loading shared libraries: libpcre2-8.so.0: cannot open shared object file: No such file or directory

..which makes no sense because that library is definitely there, and other jobs on the same nodes worked both before and after the failed jobs.  I recently ran 500 identical jobs and 152 of them failed in this way.

There are no errors in the log files on the compute nodes where this failed and it happens across multiple nodes so it's not a single one being strange.  The R binary is on an isilon network share, but the libpcre2 library is on the local disk for the node.

Anyone come across anything like this before?  Any suggestions for fixes?

Thanks

Simon.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210329/779f79dd/attachment.htm>


More information about the slurm-users mailing list