[slurm-users] Strange behaviour with dynamically linked binary in batch job

Sebastian Potthoff s.potthoff at uni-muenster.de
Wed Mar 30 15:45:54 UTC 2022


Hi all,

I am observing some strange behaviour with a dynamically linked binary inside an sbatch job. This binary is, among others, compiled against the MPICH library - so when I do an „ldd“  I get

$ ldd /path/to/binary

        linux-vdso.so.1 =>  (0x00007ffd817c5000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002ae4a3152000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ae4a3356000)
        libmpi.so.12 => not found
        libmpifort.so.12 => not found
        libm.so.6 => /lib64/libm.so.6 (0x00002ae4a3572000)
        librt.so.1 => /lib64/librt.so.1 (0x00002ae4a3874000)
        libc.so.6 => /lib64/libc.so.6 (0x00002ae4a3a7c000)
        /lib64/ld-linux-x86-64.so.2 (0x00002ae4a2f2e000)

showing me that it cannot find those shared objects as I have not loaded any modules into my environment, yet. (This is expected).

Now, if I allocate some resources and start an interactive slurm session via e.g. 

$ srun -N 1 -c 4 -t 10:00 --pty bash

and load the appropriate module (LMOD btw.) into my environment, e.g.

$ module load GCC/10.3.0
$ module load MPICH/3.4.2

and then again check the linked libraries, I get

$ ldd /path/to/binary

        linux-vdso.so.1 =>  (0x00007fffe3d2c000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002b4f58b6d000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b4f58d71000)
        libmpi.so.12 => /Applic.HPC/Easybuild/skylake/2021a/software/MPICH/3.4.2-GCC-10.3.0/lib/libmpi.so.12 (0x00002b4f58f8d000)
        libmpifort.so.12 => /Applic.HPC/Easybuild/skylake/2021a/software/MPICH/3.4.2-GCC-10.3.0/lib/libmpifort.so.12 (0x00002b4f58977000)
        libm.so.6 => /lib64/libm.so.6 (0x00002b4f59ee4000)
        librt.so.1 => /lib64/librt.so.1 (0x00002b4f5a1e6000)
        libc.so.6 => /lib64/libc.so.6 (0x00002b4f5a3ee000)
        /lib64/ld-linux-x86-64.so.2 (0x00002b4f58949000)

Now finding the correct paths to the libraries.

HOWEVER, I cannot reproduce this inside an sbatch job I submitted. When it checks for the shared libs via ldd, the paths to the MPI libraries are not found. The job script looks more or less like his

####################################################
#!/bin/bash
#SBATCH --partition admin
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --time=10:00

module load GCC/10.3.0
module load MPICH/3.4.2

ldd /path/to/binary 
####################################################

So nothing too complicated. I tested this with other, self-compiled, binaries which all seem to work just fine. Unfortunately this is a closed source binary blob - so I cannot recompile.

One interesting thing is, when I do not load any environment modules, but just directly set the LD_LIBRARY_PATH variable to the correct path before calling ldd, i.e.

LD_LIBRARY_PATH=/Applic.HPC/Easybuild/skylake/2021a/software/MPICH/3.4.2-GCC-10.3.0/lib ldd /path/to/binary

it will work as intended - also in batch job. 


Can anyone make sense of this? Can there be something hard coded into the binary, preventing it from using an exported LD_LIBRARY_PATH? And why would it work interactively, but not in a batch job? 

Many thanks
Sebastian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5630 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220330/4fa790be/attachment-0001.bin>


More information about the slurm-users mailing list