[slurm-users] Re: srun weirdness

15 May 2024


      Hi Dj,
could be a memory-limits related problem. What is the output of
ulimit -l -m -v -s
in both interactive job-shells?
You are using cgroups-v1 now, right?
In that case what is the respective content of
/sys/fs/cgroup/memory/slurm_*/uid_$(id -u)/job_*/memory.limit_in_bytes
in both shells?
Regards,
Hemann
On 5/14/24 20:38, Dj Merrill via slurm-users wrote:
...
I'm running into a strange issue and I'm hoping another set of brains 
looking at this might help.  I would appreciate any feedback.
I have two Slurm Clusters.  The first cluster is running Slurm 21.08.8 
on Rocky Linux 8.9 machines.  The second cluster is running Slurm 
23.11.6 on Rocky Linux 9.4 machines.
This works perfectly fine on the first cluster:
$ srun --mem=32G --pty /bin/bash
srun: job 93911 queued and waiting for resources
srun: job 93911 has been allocated resources
and on the resulting shell on the compute node:
$ /mnt/local/ollama/ollama help
and the ollama help message appears as expected.
However, on the second cluster:
$ srun --mem=32G --pty /bin/bash
srun: job 3 queued and waiting for resources
srun: job 3 has been allocated resources
and on the resulting shell on the compute node:
$ /mnt/local/ollama/ollama help
fatal error: failed to reserve page summary memory
runtime stack:
runtime.throw({0x1240c66?, 0x154fa39a1008?})
     runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618 
pc=0x4605dc
runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?)
     runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8 
sp=0x7ffe6be32648 pc=0x456b7c
runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0)
     runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8 
pc=0x454565
runtime.(*mheap).init(0x127b47e0)
     runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8 
pc=0x451885
runtime.mallocinit()
     runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720 
pc=0x434f97
runtime.schedinit()
     runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758 
pc=0x464397
runtime.rt0_go()
     runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0 
pc=0x49421c
If I ssh directly to the same node on that second cluster (skipping 
Slurm entirely), and run the same "/mnt/local/ollama/ollama help" 
command, it works perfectly fine.
My first thought was that it might be related to cgroups.  I switched 
the second cluster from cgroups v2 to v1 and tried again, no 
difference.  I tried disabling cgroups on the second cluster by removing 
all cgroups references in the slurm.conf file but that also made no 
difference.
My guess is something changed with regards to srun between these two 
Slurm versions, but I'm not sure what.
Any thoughts on what might be happening and/or a way to get this to work 
on the second cluster?  Essentially I need a way to request an 
interactive shell through Slurm that is associated with the requested 
resources.  Should we be using something other than srun for this?
Thank you,
-Dj

2026

2025

2024

[slurm-users] Re: srun weirdness