Hi Feng, Thank you for replying.
It is the same binary on the same machine that fails.
If I ssh to a compute node on the second cluster, it works fine.
It fails when running in an interactive shell obtained with srun on that same compute node.
I agree that it seems like a runtime environment difference between the SSH shell and the srun obtained shell.
This is the ldd from within the srun obtained shell (and gives the error when run):
[deej@moose66 ~]$ ldd /mnt/local/ollama/ollama linux-vdso.so.1 (0x00007ffde81ee000) libresolv.so.2 => /lib64/libresolv.so.2 (0x0000154f732cc000) libpthread.so.0 => /lib64/libpthread.so.0 (0x0000154f732c7000) libstdc++.so.6 => /lib64/libstdc++.so.6 (0x0000154f73000000) librt.so.1 => /lib64/librt.so.1 (0x0000154f732c2000) libdl.so.2 => /lib64/libdl.so.2 (0x0000154f732bb000) libm.so.6 => /lib64/libm.so.6 (0x0000154f72f25000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000154f732a0000) libc.so.6 => /lib64/libc.so.6 (0x0000154f72c00000) /lib64/ld-linux-x86-64.so.2 (0x0000154f732f8000)
This is the ldd from the same exact node within an SSH shell which runs fine:
[deej@moose66 ~]$ ldd /mnt/local/ollama/ollama linux-vdso.so.1 (0x00007fffa66ff000) libresolv.so.2 => /lib64/libresolv.so.2 (0x000014a9d82da000) libpthread.so.0 => /lib64/libpthread.so.0 (0x000014a9d82d5000) libstdc++.so.6 => /lib64/libstdc++.so.6 (0x000014a9d8000000) librt.so.1 => /lib64/librt.so.1 (0x000014a9d82d0000) libdl.so.2 => /lib64/libdl.so.2 (0x000014a9d82c9000) libm.so.6 => /lib64/libm.so.6 (0x000014a9d7f25000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000014a9d82ae000) libc.so.6 => /lib64/libc.so.6 (0x000014a9d7c00000) /lib64/ld-linux-x86-64.so.2 (0x000014a9d8306000)
-Dj
On 5/14/24 15:25, Feng Zhang via slurm-users wrote:
Looks more like a runtime environment issue.
Check the binaries:
ldd /mnt/local/ollama/ollama
on both clusters and comparing the output may give some hints.
Best,
Feng
On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users slurm-users@lists.schedmd.com wrote:
I'm running into a strange issue and I'm hoping another set of brains looking at this might help. I would appreciate any feedback.
I have two Slurm Clusters. The first cluster is running Slurm 21.08.8 on Rocky Linux 8.9 machines. The second cluster is running Slurm 23.11.6 on Rocky Linux 9.4 machines.
This works perfectly fine on the first cluster:
$ srun --mem=32G --pty /bin/bash
srun: job 93911 queued and waiting for resources srun: job 93911 has been allocated resources
and on the resulting shell on the compute node:
$ /mnt/local/ollama/ollama help
and the ollama help message appears as expected.
However, on the second cluster:
$ srun --mem=32G --pty /bin/bash srun: job 3 queued and waiting for resources srun: job 3 has been allocated resources
and on the resulting shell on the compute node:
$ /mnt/local/ollama/ollama help fatal error: failed to reserve page summary memory runtime stack: runtime.throw({0x1240c66?, 0x154fa39a1008?}) runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618 pc=0x4605dc runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?) runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8 sp=0x7ffe6be32648 pc=0x456b7c runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0) runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8 pc=0x454565 runtime.(*mheap).init(0x127b47e0) runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8 pc=0x451885 runtime.mallocinit() runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720 pc=0x434f97 runtime.schedinit() runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758 pc=0x464397 runtime.rt0_go() runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0 pc=0x49421c
If I ssh directly to the same node on that second cluster (skipping Slurm entirely), and run the same "/mnt/local/ollama/ollama help" command, it works perfectly fine.
My first thought was that it might be related to cgroups. I switched the second cluster from cgroups v2 to v1 and tried again, no difference. I tried disabling cgroups on the second cluster by removing all cgroups references in the slurm.conf file but that also made no difference.
My guess is something changed with regards to srun between these two Slurm versions, but I'm not sure what.
Any thoughts on what might be happening and/or a way to get this to work on the second cluster? Essentially I need a way to request an interactive shell through Slurm that is associated with the requested resources. Should we be using something other than srun for this?
Thank you,
-Dj
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com