[slurm-users] Problem with Cuda program in multi-cluster

Wed Jul 5 13:35:08 UTC 2023

Mohamad,

It seems you need to upgrade the GCC on the GPU nodes of cluster A and C.
The error message says that the srun needs newer GCC libs. Or you can
downgrade your SLURM(like recompile it using GCC 2.27 or older) on cluster
A/C.

Best,

Feng

On Tue, Jul 4, 2023 at 2:46 PM mohammed shambakey <shambakey1 at gmail.com>
wrote:

> Hi
>
> I work on 3 clusters: A, B, C. Each of Clusters A and C has 3 compute
> nodes and the head node. One of the 3 compute nodes has an old GPU in each
> cluster of A and C. All nodes, on all clusters, have Ubuntu 22.04 except
> for the 2 nodes with GPU (both of them have Ubuntu 18.04 to suit the old
> GPU card). The installed slurm version (on all clusters) is slurm
> 23.11.0-0rc1.
>
> Cluster B has only 2 compute nodes and the head node. I tried to submit a
> sbatch script from cluster B (with a CUDA program) to be executed in any of
> clusters A or C (where a GPU node resides). Previously, this used to work,
> but after updating the system, I get the following error:
>
> srun: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found
> (required by srun)
> srun: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found
> (required by srun)
> srun: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found
> (required by /hpcshared/slurm_vm/usr/lib/slurm/libslurmfull.so)
> srun: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found
> (required by /hpcshared/slurm_vm/usr/lib/slurm/libslurmfull.so)
> srun: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found
> (required by /hpcshared/slurm_vm/usr/lib/slurm/libslurmfull.so)
>
> The installed glibc is 2.35 on all nodes, except for the 2 GPU nodes
> (glibc version 2.27). I tried to run the same sbatch script on each of
> clusters A and C, and it works fine. The problem happens only when trying
> to use the "sbatch -Mall" form cluster B. Just to be sure, I tried to run
> another sbatch program (with the multicluster option) that does NOT involve
> CUDA program, and it worked fine.
>
> Should I install the same glibc6 on all nodes (2.33 or 2.33 or 2.34), or
> what?
>
> Regards
>
> --
> Mohammed
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230705/93037d52/attachment.htm>