[slurm-users] Strange behaviour with dynamically linked binary in batch job
Sebastian Potthoff
s.potthoff at uni-muenster.de
Wed Mar 30 15:45:54 UTC 2022
Hi all,
I am observing some strange behaviour with a dynamically linked binary inside an sbatch job. This binary is, among others, compiled against the MPICH library - so when I do an „ldd“ I get
$ ldd /path/to/binary
linux-vdso.so.1 => (0x00007ffd817c5000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002ae4a3152000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ae4a3356000)
libmpi.so.12 => not found
libmpifort.so.12 => not found
libm.so.6 => /lib64/libm.so.6 (0x00002ae4a3572000)
librt.so.1 => /lib64/librt.so.1 (0x00002ae4a3874000)
libc.so.6 => /lib64/libc.so.6 (0x00002ae4a3a7c000)
/lib64/ld-linux-x86-64.so.2 (0x00002ae4a2f2e000)
showing me that it cannot find those shared objects as I have not loaded any modules into my environment, yet. (This is expected).
Now, if I allocate some resources and start an interactive slurm session via e.g.
$ srun -N 1 -c 4 -t 10:00 --pty bash
and load the appropriate module (LMOD btw.) into my environment, e.g.
$ module load GCC/10.3.0
$ module load MPICH/3.4.2
and then again check the linked libraries, I get
$ ldd /path/to/binary
linux-vdso.so.1 => (0x00007fffe3d2c000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002b4f58b6d000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b4f58d71000)
libmpi.so.12 => /Applic.HPC/Easybuild/skylake/2021a/software/MPICH/3.4.2-GCC-10.3.0/lib/libmpi.so.12 (0x00002b4f58f8d000)
libmpifort.so.12 => /Applic.HPC/Easybuild/skylake/2021a/software/MPICH/3.4.2-GCC-10.3.0/lib/libmpifort.so.12 (0x00002b4f58977000)
libm.so.6 => /lib64/libm.so.6 (0x00002b4f59ee4000)
librt.so.1 => /lib64/librt.so.1 (0x00002b4f5a1e6000)
libc.so.6 => /lib64/libc.so.6 (0x00002b4f5a3ee000)
/lib64/ld-linux-x86-64.so.2 (0x00002b4f58949000)
Now finding the correct paths to the libraries.
HOWEVER, I cannot reproduce this inside an sbatch job I submitted. When it checks for the shared libs via ldd, the paths to the MPI libraries are not found. The job script looks more or less like his
####################################################
#!/bin/bash
#SBATCH --partition admin
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --time=10:00
module load GCC/10.3.0
module load MPICH/3.4.2
ldd /path/to/binary
####################################################
So nothing too complicated. I tested this with other, self-compiled, binaries which all seem to work just fine. Unfortunately this is a closed source binary blob - so I cannot recompile.
One interesting thing is, when I do not load any environment modules, but just directly set the LD_LIBRARY_PATH variable to the correct path before calling ldd, i.e.
LD_LIBRARY_PATH=/Applic.HPC/Easybuild/skylake/2021a/software/MPICH/3.4.2-GCC-10.3.0/lib ldd /path/to/binary
it will work as intended - also in batch job.
Can anyone make sense of this? Can there be something hard coded into the binary, preventing it from using an exported LD_LIBRARY_PATH? And why would it work interactively, but not in a batch job?
Many thanks
Sebastian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5630 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220330/4fa790be/attachment-0001.bin>
More information about the slurm-users
mailing list