srun weirdness

List overview All Threads
Download

newer

older

Apply an specific QoS to all users...

Slurm DB upgrade failure behavior

Dj Merrill

14 May 2024 14 May '24

7:38 p.m.

I'm running into a strange issue and I'm hoping another set of brains looking at this might help. I would appreciate any feedback.

I have two Slurm Clusters. The first cluster is running Slurm 21.08.8 on Rocky Linux 8.9 machines. The second cluster is running Slurm 23.11.6 on Rocky Linux 9.4 machines.

This works perfectly fine on the first cluster:

$ srun --mem=32G --pty /bin/bash

srun: job 93911 queued and waiting for resources srun: job 93911 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help

and the ollama help message appears as expected.

However, on the second cluster:

$ srun --mem=32G --pty /bin/bash srun: job 3 queued and waiting for resources srun: job 3 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help fatal error: failed to reserve page summary memory runtime stack: runtime.throw({0x1240c66?, 0x154fa39a1008?}) runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618 pc=0x4605dc runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?) runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8 sp=0x7ffe6be32648 pc=0x456b7c runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0) runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8 pc=0x454565 runtime.(*mheap).init(0x127b47e0) runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8 pc=0x451885 runtime.mallocinit() runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720 pc=0x434f97 runtime.schedinit() runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758 pc=0x464397 runtime.rt0_go() runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0 pc=0x49421c

If I ssh directly to the same node on that second cluster (skipping Slurm entirely), and run the same "/mnt/local/ollama/ollama help" command, it works perfectly fine.

My first thought was that it might be related to cgroups. I switched the second cluster from cgroups v2 to v1 and tried again, no difference. I tried disabling cgroups on the second cluster by removing all cgroups references in the slurm.conf file but that also made no difference.

My guess is something changed with regards to srun between these two Slurm versions, but I'm not sure what.

Any thoughts on what might be happening and/or a way to get this to work on the second cluster? Essentially I need a way to request an interactive shell through Slurm that is associated with the requested resources. Should we be using something other than srun for this?

Thank you,

-Dj

Show replies by date

Feng Zhang

14 May 14 May

8:25 p.m.

Looks more like a runtime environment issue.

Check the binaries:

ldd /mnt/local/ollama/ollama

on both clusters and comparing the output may give some hints.

Best,

Feng

On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users slurm-users@lists.schedmd.com wrote:

...

I'm running into a strange issue and I'm hoping another set of brains looking at this might help. I would appreciate any feedback.

I have two Slurm Clusters. The first cluster is running Slurm 21.08.8 on Rocky Linux 8.9 machines. The second cluster is running Slurm 23.11.6 on Rocky Linux 9.4 machines.

This works perfectly fine on the first cluster:

$ srun --mem=32G --pty /bin/bash

srun: job 93911 queued and waiting for resources srun: job 93911 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help

and the ollama help message appears as expected.

However, on the second cluster:

$ srun --mem=32G --pty /bin/bash srun: job 3 queued and waiting for resources srun: job 3 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help fatal error: failed to reserve page summary memory runtime stack: runtime.throw({0x1240c66?, 0x154fa39a1008?}) runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618 pc=0x4605dc runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?) runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8 sp=0x7ffe6be32648 pc=0x456b7c runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0) runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8 pc=0x454565 runtime.(*mheap).init(0x127b47e0) runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8 pc=0x451885 runtime.mallocinit() runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720 pc=0x434f97 runtime.schedinit() runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758 pc=0x464397 runtime.rt0_go() runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0 pc=0x49421c

If I ssh directly to the same node on that second cluster (skipping Slurm entirely), and run the same "/mnt/local/ollama/ollama help" command, it works perfectly fine.

My first thought was that it might be related to cgroups. I switched the second cluster from cgroups v2 to v1 and tried again, no difference. I tried disabling cgroups on the second cluster by removing all cgroups references in the slurm.conf file but that also made no difference.

My guess is something changed with regards to srun between these two Slurm versions, but I'm not sure what.

Any thoughts on what might be happening and/or a way to get this to work on the second cluster? Essentially I need a way to request an interactive shell through Slurm that is associated with the requested resources. Should we be using something other than srun for this?

Thank you,

-Dj

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Dj Merrill

8:41 p.m.

Hi Feng, Thank you for replying.

It is the same binary on the same machine that fails.

If I ssh to a compute node on the second cluster, it works fine.

It fails when running in an interactive shell obtained with srun on that same compute node.

I agree that it seems like a runtime environment difference between the SSH shell and the srun obtained shell.

This is the ldd from within the srun obtained shell (and gives the error when run):

[deej@moose66 ~]$ ldd /mnt/local/ollama/ollama linux-vdso.so.1 (0x00007ffde81ee000) libresolv.so.2 => /lib64/libresolv.so.2 (0x0000154f732cc000) libpthread.so.0 => /lib64/libpthread.so.0 (0x0000154f732c7000) libstdc++.so.6 => /lib64/libstdc++.so.6 (0x0000154f73000000) librt.so.1 => /lib64/librt.so.1 (0x0000154f732c2000) libdl.so.2 => /lib64/libdl.so.2 (0x0000154f732bb000) libm.so.6 => /lib64/libm.so.6 (0x0000154f72f25000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000154f732a0000) libc.so.6 => /lib64/libc.so.6 (0x0000154f72c00000) /lib64/ld-linux-x86-64.so.2 (0x0000154f732f8000)

This is the ldd from the same exact node within an SSH shell which runs fine:

[deej@moose66 ~]$ ldd /mnt/local/ollama/ollama linux-vdso.so.1 (0x00007fffa66ff000) libresolv.so.2 => /lib64/libresolv.so.2 (0x000014a9d82da000) libpthread.so.0 => /lib64/libpthread.so.0 (0x000014a9d82d5000) libstdc++.so.6 => /lib64/libstdc++.so.6 (0x000014a9d8000000) librt.so.1 => /lib64/librt.so.1 (0x000014a9d82d0000) libdl.so.2 => /lib64/libdl.so.2 (0x000014a9d82c9000) libm.so.6 => /lib64/libm.so.6 (0x000014a9d7f25000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000014a9d82ae000) libc.so.6 => /lib64/libc.so.6 (0x000014a9d7c00000) /lib64/ld-linux-x86-64.so.2 (0x000014a9d8306000)

-Dj

On 5/14/24 15:25, Feng Zhang via slurm-users wrote:

...

Looks more like a runtime environment issue.

Check the binaries:

ldd /mnt/local/ollama/ollama

on both clusters and comparing the output may give some hints.

Best,

Feng

On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users slurm-users@lists.schedmd.com wrote:

...
I'm running into a strange issue and I'm hoping another set of brains looking at this might help. I would appreciate any feedback.

I have two Slurm Clusters. The first cluster is running Slurm 21.08.8 on Rocky Linux 8.9 machines. The second cluster is running Slurm 23.11.6 on Rocky Linux 9.4 machines.

This works perfectly fine on the first cluster:

$ srun --mem=32G --pty /bin/bash

srun: job 93911 queued and waiting for resources srun: job 93911 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help

and the ollama help message appears as expected.

However, on the second cluster:

$ srun --mem=32G --pty /bin/bash srun: job 3 queued and waiting for resources srun: job 3 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help fatal error: failed to reserve page summary memory runtime stack: runtime.throw({0x1240c66?, 0x154fa39a1008?}) runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618 pc=0x4605dc runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?) runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8 sp=0x7ffe6be32648 pc=0x456b7c runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0) runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8 pc=0x454565 runtime.(*mheap).init(0x127b47e0) runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8 pc=0x451885 runtime.mallocinit() runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720 pc=0x434f97 runtime.schedinit() runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758 pc=0x464397 runtime.rt0_go() runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0 pc=0x49421c

If I ssh directly to the same node on that second cluster (skipping Slurm entirely), and run the same "/mnt/local/ollama/ollama help" command, it works perfectly fine.

My first thought was that it might be related to cgroups. I switched the second cluster from cgroups v2 to v1 and tried again, no difference. I tried disabling cgroups on the second cluster by removing all cgroups references in the slurm.conf file but that also made no difference.

My guess is something changed with regards to srun between these two Slurm versions, but I'm not sure what.

Any thoughts on what might be happening and/or a way to get this to work on the second cluster? Essentially I need a way to request an interactive shell through Slurm that is associated with the requested resources. Should we be using something other than srun for this?

Thank you,

-Dj

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Feng Zhang

8:57 p.m.

Not sure, very strange, while the two linux-vdso.so.1 looks different:

[deej@moose66 ~]$ ldd /mnt/local/ollama/ollama linux-vdso.so.1 (0x00007ffde81ee000)

[deej@moose66 ~]$ ldd /mnt/local/ollama/ollama linux-vdso.so.1 (0x00007fffa66ff000)

Best,

Feng

On Tue, May 14, 2024 at 3:43 PM Dj Merrill via slurm-users slurm-users@lists.schedmd.com wrote:

...

Hi Feng, Thank you for replying.

It is the same binary on the same machine that fails.

If I ssh to a compute node on the second cluster, it works fine.

It fails when running in an interactive shell obtained with srun on that same compute node.

I agree that it seems like a runtime environment difference between the SSH shell and the srun obtained shell.

This is the ldd from within the srun obtained shell (and gives the error when run):

[deej@moose66 ~]$ ldd /mnt/local/ollama/ollama linux-vdso.so.1 (0x00007ffde81ee000) libresolv.so.2 => /lib64/libresolv.so.2 (0x0000154f732cc000) libpthread.so.0 => /lib64/libpthread.so.0 (0x0000154f732c7000) libstdc++.so.6 => /lib64/libstdc++.so.6 (0x0000154f73000000) librt.so.1 => /lib64/librt.so.1 (0x0000154f732c2000) libdl.so.2 => /lib64/libdl.so.2 (0x0000154f732bb000) libm.so.6 => /lib64/libm.so.6 (0x0000154f72f25000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000154f732a0000) libc.so.6 => /lib64/libc.so.6 (0x0000154f72c00000) /lib64/ld-linux-x86-64.so.2 (0x0000154f732f8000)

This is the ldd from the same exact node within an SSH shell which runs fine:

[deej@moose66 ~]$ ldd /mnt/local/ollama/ollama linux-vdso.so.1 (0x00007fffa66ff000) libresolv.so.2 => /lib64/libresolv.so.2 (0x000014a9d82da000) libpthread.so.0 => /lib64/libpthread.so.0 (0x000014a9d82d5000) libstdc++.so.6 => /lib64/libstdc++.so.6 (0x000014a9d8000000) librt.so.1 => /lib64/librt.so.1 (0x000014a9d82d0000) libdl.so.2 => /lib64/libdl.so.2 (0x000014a9d82c9000) libm.so.6 => /lib64/libm.so.6 (0x000014a9d7f25000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000014a9d82ae000) libc.so.6 => /lib64/libc.so.6 (0x000014a9d7c00000) /lib64/ld-linux-x86-64.so.2 (0x000014a9d8306000)

-Dj

On 5/14/24 15:25, Feng Zhang via slurm-users wrote:

...
Looks more like a runtime environment issue.

Check the binaries:

ldd /mnt/local/ollama/ollama

on both clusters and comparing the output may give some hints.

Best,

Feng

On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users slurm-users@lists.schedmd.com wrote:

...
I'm running into a strange issue and I'm hoping another set of brains looking at this might help. I would appreciate any feedback.

I have two Slurm Clusters. The first cluster is running Slurm 21.08.8 on Rocky Linux 8.9 machines. The second cluster is running Slurm 23.11.6 on Rocky Linux 9.4 machines.

This works perfectly fine on the first cluster:

$ srun --mem=32G --pty /bin/bash

srun: job 93911 queued and waiting for resources srun: job 93911 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help

and the ollama help message appears as expected.

However, on the second cluster:

$ srun --mem=32G --pty /bin/bash srun: job 3 queued and waiting for resources srun: job 3 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help fatal error: failed to reserve page summary memory runtime stack: runtime.throw({0x1240c66?, 0x154fa39a1008?}) runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618 pc=0x4605dc runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?) runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8 sp=0x7ffe6be32648 pc=0x456b7c runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0) runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8 pc=0x454565 runtime.(*mheap).init(0x127b47e0) runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8 pc=0x451885 runtime.mallocinit() runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720 pc=0x434f97 runtime.schedinit() runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758 pc=0x464397 runtime.rt0_go() runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0 pc=0x49421c

If I ssh directly to the same node on that second cluster (skipping Slurm entirely), and run the same "/mnt/local/ollama/ollama help" command, it works perfectly fine.

My first thought was that it might be related to cgroups. I switched the second cluster from cgroups v2 to v1 and tried again, no difference. I tried disabling cgroups on the second cluster by removing all cgroups references in the slurm.conf file but that also made no difference.

My guess is something changed with regards to srun between these two Slurm versions, but I'm not sure what.

Any thoughts on what might be happening and/or a way to get this to work on the second cluster? Essentially I need a way to request an interactive shell through Slurm that is associated with the requested resources. Should we be using something other than srun for this?

Thank you,

-Dj

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Feng Zhang

9:07 p.m.

Do you have containers setting?

On Tue, May 14, 2024 at 3:57 PM Feng Zhang prod.feng@gmail.com wrote:

...

Not sure, very strange, while the two linux-vdso.so.1 looks different:

[deej@moose66 ~]$ ldd /mnt/local/ollama/ollama linux-vdso.so.1 (0x00007ffde81ee000)

[deej@moose66 ~]$ ldd /mnt/local/ollama/ollama linux-vdso.so.1 (0x00007fffa66ff000)

Best,

Feng

On Tue, May 14, 2024 at 3:43 PM Dj Merrill via slurm-users slurm-users@lists.schedmd.com wrote:

...
Hi Feng, Thank you for replying.

It is the same binary on the same machine that fails.

If I ssh to a compute node on the second cluster, it works fine.

It fails when running in an interactive shell obtained with srun on that same compute node.

I agree that it seems like a runtime environment difference between the SSH shell and the srun obtained shell.

This is the ldd from within the srun obtained shell (and gives the error when run):

[deej@moose66 ~]$ ldd /mnt/local/ollama/ollama linux-vdso.so.1 (0x00007ffde81ee000) libresolv.so.2 => /lib64/libresolv.so.2 (0x0000154f732cc000) libpthread.so.0 => /lib64/libpthread.so.0 (0x0000154f732c7000) libstdc++.so.6 => /lib64/libstdc++.so.6 (0x0000154f73000000) librt.so.1 => /lib64/librt.so.1 (0x0000154f732c2000) libdl.so.2 => /lib64/libdl.so.2 (0x0000154f732bb000) libm.so.6 => /lib64/libm.so.6 (0x0000154f72f25000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000154f732a0000) libc.so.6 => /lib64/libc.so.6 (0x0000154f72c00000) /lib64/ld-linux-x86-64.so.2 (0x0000154f732f8000)

This is the ldd from the same exact node within an SSH shell which runs fine:

[deej@moose66 ~]$ ldd /mnt/local/ollama/ollama linux-vdso.so.1 (0x00007fffa66ff000) libresolv.so.2 => /lib64/libresolv.so.2 (0x000014a9d82da000) libpthread.so.0 => /lib64/libpthread.so.0 (0x000014a9d82d5000) libstdc++.so.6 => /lib64/libstdc++.so.6 (0x000014a9d8000000) librt.so.1 => /lib64/librt.so.1 (0x000014a9d82d0000) libdl.so.2 => /lib64/libdl.so.2 (0x000014a9d82c9000) libm.so.6 => /lib64/libm.so.6 (0x000014a9d7f25000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000014a9d82ae000) libc.so.6 => /lib64/libc.so.6 (0x000014a9d7c00000) /lib64/ld-linux-x86-64.so.2 (0x000014a9d8306000)

-Dj

On 5/14/24 15:25, Feng Zhang via slurm-users wrote:

...
Looks more like a runtime environment issue.

Check the binaries:

ldd /mnt/local/ollama/ollama

on both clusters and comparing the output may give some hints.

Best,

Feng

On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users slurm-users@lists.schedmd.com wrote:

...
I'm running into a strange issue and I'm hoping another set of brains looking at this might help. I would appreciate any feedback.

I have two Slurm Clusters. The first cluster is running Slurm 21.08.8 on Rocky Linux 8.9 machines. The second cluster is running Slurm 23.11.6 on Rocky Linux 9.4 machines.

This works perfectly fine on the first cluster:

$ srun --mem=32G --pty /bin/bash

srun: job 93911 queued and waiting for resources srun: job 93911 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help

and the ollama help message appears as expected.

However, on the second cluster:

$ srun --mem=32G --pty /bin/bash srun: job 3 queued and waiting for resources srun: job 3 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help fatal error: failed to reserve page summary memory runtime stack: runtime.throw({0x1240c66?, 0x154fa39a1008?}) runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618 pc=0x4605dc runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?) runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8 sp=0x7ffe6be32648 pc=0x456b7c runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0) runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8 pc=0x454565 runtime.(*mheap).init(0x127b47e0) runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8 pc=0x451885 runtime.mallocinit() runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720 pc=0x434f97 runtime.schedinit() runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758 pc=0x464397 runtime.rt0_go() runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0 pc=0x49421c

If I ssh directly to the same node on that second cluster (skipping Slurm entirely), and run the same "/mnt/local/ollama/ollama help" command, it works perfectly fine.

My first thought was that it might be related to cgroups. I switched the second cluster from cgroups v2 to v1 and tried again, no difference. I tried disabling cgroups on the second cluster by removing all cgroups references in the slurm.conf file but that also made no difference.

My guess is something changed with regards to srun between these two Slurm versions, but I'm not sure what.

Any thoughts on what might be happening and/or a way to get this to work on the second cluster? Essentially I need a way to request an interactive shell through Slurm that is associated with the requested resources. Should we be using something other than srun for this?

Thank you,

-Dj

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Hermann Schwärzler

15 May 15 May

9:44 a.m.

Hi Dj,

could be a memory-limits related problem. What is the output of

ulimit -l -m -v -s

in both interactive job-shells?

You are using cgroups-v1 now, right? In that case what is the respective content of

/sys/fs/cgroup/memory/slurm_*/uid_$(id -u)/job_*/memory.limit_in_bytes

in both shells?

Regards, Hemann

On 5/14/24 20:38, Dj Merrill via slurm-users wrote:

...

I'm running into a strange issue and I'm hoping another set of brains looking at this might help. I would appreciate any feedback.

I have two Slurm Clusters. The first cluster is running Slurm 21.08.8 on Rocky Linux 8.9 machines. The second cluster is running Slurm 23.11.6 on Rocky Linux 9.4 machines.

This works perfectly fine on the first cluster:

$ srun --mem=32G --pty /bin/bash

srun: job 93911 queued and waiting for resources srun: job 93911 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help

and the ollama help message appears as expected.

However, on the second cluster:

$ srun --mem=32G --pty /bin/bash srun: job 3 queued and waiting for resources srun: job 3 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help fatal error: failed to reserve page summary memory runtime stack: runtime.throw({0x1240c66?, 0x154fa39a1008?}) runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618 pc=0x4605dc runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?) runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8 sp=0x7ffe6be32648 pc=0x456b7c runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0) runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8 pc=0x454565 runtime.(*mheap).init(0x127b47e0) runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8 pc=0x451885 runtime.mallocinit() runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720 pc=0x434f97 runtime.schedinit() runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758 pc=0x464397 runtime.rt0_go() runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0 pc=0x49421c

If I ssh directly to the same node on that second cluster (skipping Slurm entirely), and run the same "/mnt/local/ollama/ollama help" command, it works perfectly fine.

My first thought was that it might be related to cgroups. I switched the second cluster from cgroups v2 to v1 and tried again, no difference. I tried disabling cgroups on the second cluster by removing all cgroups references in the slurm.conf file but that also made no difference.

My guess is something changed with regards to srun between these two Slurm versions, but I'm not sure what.

Any thoughts on what might be happening and/or a way to get this to work on the second cluster? Essentially I need a way to request an interactive shell through Slurm that is associated with the requested resources. Should we be using something other than srun for this?

Thank you,

-Dj

Thomas Green - Staff in University IT, Research Technologies / Staff Technoleg Gwybodaeth, Technolegau Ymchwil

12:20 p.m.

Hi,

When we first migrated to Slurm from PBS one of the strangest issues we hit was that ulimit settings are inherited from the submission host which could explain the different between ssh'ing into the machine (and the default ulimit being applied) and with running a job via srun.

You could use:

srun --propagate=NONE --mem=32G --pty bash

I still find Slurm inheriting ulimit and environment variables from the submission host an odd default behaviour.

Tom

-- Thomas Green Senior Programmer ARCCA, Redwood Building, King Edward VII Avenue, Cardiff, CF10 3NB Tel: +44 (0)29 208 79269 Fax: +44 (0)29 208 70734 Email: greent10@cardiff.ac.uk Web: http://www.cardiff.ac.uk/arcca

Thomas Green Uwch Raglennydd ARCCA, Adeilad Redwood, King Edward VII Avenue, Caerdydd, CF10 3NB Ffôn: +44 (0)29 208 79269 Ffacs: +44 (0)29 208 70734 E-bost: greent10@caerdydd.ac.uk Gwefan: http://www.caerdydd.ac.uk/arcca

-----Original Message----- From: Hermann Schwärzler via slurm-users slurm-users@lists.schedmd.com Sent: Wednesday, May 15, 2024 9:45 AM To: slurm-users@lists.schedmd.com Subject: [slurm-users] Re: srun weirdness

External email to Cardiff University - Take care when replying/opening attachments or links. Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor atodiadau neu ddolenni.

Hi Dj,

could be a memory-limits related problem. What is the output of

ulimit -l -m -v -s

in both interactive job-shells?

You are using cgroups-v1 now, right? In that case what is the respective content of

/sys/fs/cgroup/memory/slurm_*/uid_$(id -u)/job_*/memory.limit_in_bytes

in both shells?

Regards, Hemann

On 5/14/24 20:38, Dj Merrill via slurm-users wrote:

...

I'm running into a strange issue and I'm hoping another set of brains looking at this might help. I would appreciate any feedback.

I have two Slurm Clusters. The first cluster is running Slurm 21.08.8 on Rocky Linux 8.9 machines. The second cluster is running Slurm 23.11.6 on Rocky Linux 9.4 machines.

This works perfectly fine on the first cluster:

$ srun --mem=32G --pty /bin/bash

srun: job 93911 queued and waiting for resources srun: job 93911 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help

and the ollama help message appears as expected.

However, on the second cluster:

$ srun --mem=32G --pty /bin/bash srun: job 3 queued and waiting for resources srun: job 3 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help fatal error: failed to reserve page summary memory runtime stack: runtime.throw({0x1240c66?, 0x154fa39a1008?}) runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618 pc=0x4605dc runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?) runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8 sp=0x7ffe6be32648 pc=0x456b7c runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0) runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8 pc=0x454565 runtime.(*mheap).init(0x127b47e0) runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8 pc=0x451885 runtime.mallocinit() runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720 pc=0x434f97 runtime.schedinit() runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758 pc=0x464397 runtime.rt0_go() runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0 pc=0x49421c

If I ssh directly to the same node on that second cluster (skipping Slurm entirely), and run the same "/mnt/local/ollama/ollama help" command, it works perfectly fine.

My first thought was that it might be related to cgroups. I switched the second cluster from cgroups v2 to v1 and tried again, no difference. I tried disabling cgroups on the second cluster by removing all cgroups references in the slurm.conf file but that also made no difference.

My guess is something changed with regards to srun between these two Slurm versions, but I'm not sure what.

Any thoughts on what might be happening and/or a way to get this to work on the second cluster? Essentially I need a way to request an interactive shell through Slurm that is associated with the requested resources. Should we be using something other than srun for this?

Thank you,

-Dj

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Dj Merrill

2:43 p.m.

Thank you Hemann and Tom! That was it.

The new cluster has a virtual memory limit on the login host, and the old cluster did not.

It doesn't look like there is any way to set a default to override the srun behaviour of passing those resource limits to the shell, so I may consider removing those limits on the login host so folks don't have to manually specify this every time.

I really appreciate the help!

-Dj

On 5/15/24 07:20, greent10--- via slurm-users wrote:

...

Hi,

When we first migrated to Slurm from PBS one of the strangest issues we hit was that ulimit settings are inherited from the submission host which could explain the different between ssh'ing into the machine (and the default ulimit being applied) and with running a job via srun.

You could use:

srun --propagate=NONE --mem=32G --pty bash

I still find Slurm inheriting ulimit and environment variables from the submission host an odd default behaviour.

Tom

-- Thomas Green Senior Programmer ARCCA, Redwood Building, King Edward VII Avenue, Cardiff, CF10 3NB Tel: +44 (0)29 208 79269 Fax: +44 (0)29 208 70734 Email: greent10@cardiff.ac.uk Web: http://www.cardiff.ac.uk/arcca

Thomas Green Uwch Raglennydd ARCCA, Adeilad Redwood, King Edward VII Avenue, Caerdydd, CF10 3NB Ffôn: +44 (0)29 208 79269 Ffacs: +44 (0)29 208 70734 E-bost: greent10@caerdydd.ac.uk Gwefan: http://www.caerdydd.ac.uk/arcca

-----Original Message----- From: Hermann Schwärzler via slurm-users slurm-users@lists.schedmd.com Sent: Wednesday, May 15, 2024 9:45 AM To: slurm-users@lists.schedmd.com Subject: [slurm-users] Re: srun weirdness

External email to Cardiff University - Take care when replying/opening attachments or links. Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor atodiadau neu ddolenni.

Hi Dj,

could be a memory-limits related problem. What is the output of

ulimit -l -m -v -s

in both interactive job-shells?

You are using cgroups-v1 now, right? In that case what is the respective content of

/sys/fs/cgroup/memory/slurm_*/uid_$(id -u)/job_*/memory.limit_in_bytes

in both shells?

Regards, Hemann

On 5/14/24 20:38, Dj Merrill via slurm-users wrote:

...
I'm running into a strange issue and I'm hoping another set of brains looking at this might help. I would appreciate any feedback.

I have two Slurm Clusters. The first cluster is running Slurm 21.08.8 on Rocky Linux 8.9 machines. The second cluster is running Slurm 23.11.6 on Rocky Linux 9.4 machines.

This works perfectly fine on the first cluster:

$ srun --mem=32G --pty /bin/bash

srun: job 93911 queued and waiting for resources srun: job 93911 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help

and the ollama help message appears as expected.

However, on the second cluster:

$ srun --mem=32G --pty /bin/bash srun: job 3 queued and waiting for resources srun: job 3 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help fatal error: failed to reserve page summary memory runtime stack: runtime.throw({0x1240c66?, 0x154fa39a1008?}) runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618 pc=0x4605dc runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?) runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8 sp=0x7ffe6be32648 pc=0x456b7c runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0) runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8 pc=0x454565 runtime.(*mheap).init(0x127b47e0) runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8 pc=0x451885 runtime.mallocinit() runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720 pc=0x434f97 runtime.schedinit() runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758 pc=0x464397 runtime.rt0_go() runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0 pc=0x49421c

If I ssh directly to the same node on that second cluster (skipping Slurm entirely), and run the same "/mnt/local/ollama/ollama help" command, it works perfectly fine.

My first thought was that it might be related to cgroups. I switched the second cluster from cgroups v2 to v1 and tried again, no difference. I tried disabling cgroups on the second cluster by removing all cgroups references in the slurm.conf file but that also made no difference.

My guess is something changed with regards to srun between these two Slurm versions, but I'm not sure what.

Any thoughts on what might be happening and/or a way to get this to work on the second cluster? Essentially I need a way to request an interactive shell through Slurm that is associated with the requested resources. Should we be using something other than srun for this?

Thank you,

-Dj

Laura Hild

7:17 p.m.

PropagateResourceLimitsExcept won't do it?

________________________________________ Od: Dj Merrill via slurm-users slurm-users@lists.schedmd.com Poslano: sreda, 15. maj 2024 09:43 Za: slurm-users@lists.schedmd.com Zadeva: [EXTERNAL] [slurm-users] Re: srun weirdness

Thank you Hemann and Tom! That was it.

The new cluster has a virtual memory limit on the login host, and the old cluster did not.

I really appreciate the help!

-Dj

On 5/15/24 07:20, greent10--- via slurm-users wrote:

...

Hi,

When we first migrated to Slurm from PBS one of the strangest issues we hit was that ulimit settings are inherited from the submission host which could explain the different between ssh'ing into the machine (and the default ulimit being applied) and with running a job via srun.

You could use:

srun --propagate=NONE --mem=32G --pty bash

I still find Slurm inheriting ulimit and environment variables from the submission host an odd default behaviour.

Tom

-- Thomas Green Senior Programmer ARCCA, Redwood Building, King Edward VII Avenue, Cardiff, CF10 3NB Tel: +44 (0)29 208 79269 Fax: +44 (0)29 208 70734 Email: greent10@cardiff.ac.uk Web: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cardiff.ac.uk_arcca&...

Thomas Green Uwch Raglennydd ARCCA, Adeilad Redwood, King Edward VII Avenue, Caerdydd, CF10 3NB Ffôn: +44 (0)29 208 79269 Ffacs: +44 (0)29 208 70734 E-bost: greent10@caerdydd.ac.uk Gwefan: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.caerdydd.ac.uk_arcca...

-----Original Message----- From: Hermann Schwärzler via slurm-users slurm-users@lists.schedmd.com Sent: Wednesday, May 15, 2024 9:45 AM To: slurm-users@lists.schedmd.com Subject: [slurm-users] Re: srun weirdness

External email to Cardiff University - Take care when replying/opening attachments or links. Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor atodiadau neu ddolenni.

Hi Dj,

could be a memory-limits related problem. What is the output of

ulimit -l -m -v -s

in both interactive job-shells?

You are using cgroups-v1 now, right? In that case what is the respective content of

/sys/fs/cgroup/memory/slurm_*/uid_$(id -u)/job_*/memory.limit_in_bytes

in both shells?

Regards, Hemann

On 5/14/24 20:38, Dj Merrill via slurm-users wrote:

...
I'm running into a strange issue and I'm hoping another set of brains looking at this might help. I would appreciate any feedback.

I have two Slurm Clusters. The first cluster is running Slurm 21.08.8 on Rocky Linux 8.9 machines. The second cluster is running Slurm 23.11.6 on Rocky Linux 9.4 machines.

This works perfectly fine on the first cluster:

$ srun --mem=32G --pty /bin/bash

srun: job 93911 queued and waiting for resources srun: job 93911 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help

and the ollama help message appears as expected.

However, on the second cluster:

$ srun --mem=32G --pty /bin/bash srun: job 3 queued and waiting for resources srun: job 3 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help fatal error: failed to reserve page summary memory runtime stack: runtime.throw({0x1240c66?, 0x154fa39a1008?}) runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618 pc=0x4605dc runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?) runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8 sp=0x7ffe6be32648 pc=0x456b7c runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0) runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8 pc=0x454565 runtime.(*mheap).init(0x127b47e0) runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8 pc=0x451885 runtime.mallocinit() runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720 pc=0x434f97 runtime.schedinit() runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758 pc=0x464397 runtime.rt0_go() runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0 pc=0x49421c

If I ssh directly to the same node on that second cluster (skipping Slurm entirely), and run the same "/mnt/local/ollama/ollama help" command, it works perfectly fine.

My first thought was that it might be related to cgroups. I switched the second cluster from cgroups v2 to v1 and tried again, no difference. I tried disabling cgroups on the second cluster by removing all cgroups references in the slurm.conf file but that also made no difference.

My guess is something changed with regards to srun between these two Slurm versions, but I'm not sure what.

Any thoughts on what might be happening and/or a way to get this to work on the second cluster? Essentially I need a way to request an interactive shell through Slurm that is associated with the requested resources. Should we be using something other than srun for this?

Thank you,

-Dj

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Dj Merrill

7:26 p.m.

I completely missed that, thank you!

-Dj

Laura Hild via slurm-users wrote:

...

PropagateResourceLimitsExcept won't do it?

Sarlo, Jeffrey S wrote:

...

You might look at the PropagateResourceLimits and PropagateResourceLimitsExcept settings in slurm.conf

Patryk Bełzak

17 May 17 May

10:03 a.m.

Hi,

I wonder where does this problems come from, perhaps I am missing something, but we never had such issues with limits since we have it set on worker nodes in /etc/security/limits.d/99-cluster.conf:

``` * soft memlock 4086160 #Allow more Memory Locks for MPI * hard memlock 4086160 #Allow more Memory Locks for MPI * soft nofile 1048576 #Increase the Number of File Descriptors * hard nofile 1048576 #Increase the Number of File Descriptors * soft stack unlimited #Set soft to hard limit * soft core 4194304 #Allow Core Files ```

and it sets up all limits we want without any problems, and there is no need to pass extra arguments to slurm commands or modify the config file.

Regards, Patryk.

On 24/05/15 02:26, Dj Merrill via slurm-users wrote: [-- Type: text/plain; charset=US-ASCII, Encoding: 7bit, Size: 0,2K --]

...

I completely missed that, thank you!

-Dj

Laura Hild via slurm-users wrote:

...
PropagateResourceLimitsExcept won't do it?

Sarlo, Jeffrey S wrote:

...
You might look at the PropagateResourceLimits and PropagateResourceLimitsExcept settings in slurm.conf

[-- Alternative Type #1: text/html; charset=UTF-8, Encoding: 8bit, Size: 1,0K --]

...

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Thomas Green - Staff in University IT, Research Technologies / Staff Technoleg Gwybodaeth, Technolegau Ymchwil

10:30 a.m.

Hi,

The problem comes from if the login nodes (or submission hosts) have different ulimits – maybe the submission hosts are VMs and not physical servers. Then the ulimits will be passed from submission hosts in Slurm to the jobs compute node by default which can results in different settings being applied. If the login nodes have the same ulimit settings then you may not see a difference.

We happened to see a difference due to moving to a virtualised login node infrastructure which has slightly different settings applied.

Does that make sense?

I also missed that setting in slurm.conf so good to know it is possible to change the default behaviour.

Tom

From: Patryk Bełzak via slurm-users slurm-users@lists.schedmd.com Date: Friday, 17 May 2024 at 10:15 To: Dj Merrill slurm@deej.net Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: srun weirdness External email to Cardiff University - Take care when replying/opening attachments or links. Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor atodiadau neu ddolenni.

Hi,

I wonder where does this problems come from, perhaps I am missing something, but we never had such issues with limits since we have it set on worker nodes in /etc/security/limits.d/99-cluster.conf:

and it sets up all limits we want without any problems, and there is no need to pass extra arguments to slurm commands or modify the config file.

Regards, Patryk.

On 24/05/15 02:26, Dj Merrill via slurm-users wrote: [-- Type: text/plain; charset=US-ASCII, Encoding: 7bit, Size: 0,2K --]

...

I completely missed that, thank you!

-Dj

Laura Hild via slurm-users wrote:

...
PropagateResourceLimitsExcept won't do it?

Sarlo, Jeffrey S wrote:

...
You might look at the PropagateResourceLimits and PropagateResourceLimitsExcept settings in slurm.conf

[-- Alternative Type #1: text/html; charset=UTF-8, Encoding: 8bit, Size: 1,0K --]

...

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Patryk Bełzak

11:16 a.m.

We do have diferent limits on submit host, and I believe that until we put `limits.d/99-cluster.conf` file the limits were passed to jobs, but can't tell for sure, it was long time ago. Still, modyfying the `limits.d` on cluster nodes may be a different approach and solution to formentioned issue.

I wonder if anyone has an opinion which way is better and why - whether to modify the slurmctld.conf or node system limits.

Patryk.

On 24/05/17 09:30, greent10--- via slurm-users wrote: [-- Type: text/plain; charset=windows-1250, Encoding: quoted-printable, Size: 2,5K --]

...

Hi,

The problem comes from if the login nodes (or submission hosts) have different ulimits – maybe the submission hosts are VMs and not physical servers. Then the ulimits will be passed from submission hosts in Slurm to the jobs compute node by default which can results in different settings being applied. If the login nodes have the same ulimit settings then you may not see a difference.

We happened to see a difference due to moving to a virtualised login node infrastructure which has slightly different settings applied.

Does that make sense?

I also missed that setting in slurm.conf so good to know it is possible to change the default behaviour.

Tom

From: Patryk Bełzak via slurm-users slurm-users@lists.schedmd.com Date: Friday, 17 May 2024 at 10:15 To: Dj Merrill slurm@deej.net Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: srun weirdness External email to Cardiff University - Take care when replying/opening attachments or links. Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor atodiadau neu ddolenni.

Hi,

I wonder where does this problems come from, perhaps I am missing something, but we never had such issues with limits since we have it set on worker nodes in /etc/security/limits.d/99-cluster.conf:
*       soft    memlock 4086160 #Allow more Memory Locks for MPI
*       hard    memlock 4086160 #Allow more Memory Locks for MPI
*       soft    nofile  1048576 #Increase the Number of File Descriptors
*       hard    nofile  1048576 #Increase the Number of File Descriptors
*       soft    stack   unlimited       #Set soft to hard limit
*       soft    core    4194304 #Allow Core Files
and it sets up all limits we want without any problems, and there is no need to pass extra arguments to slurm commands or modify the config file.

Regards, Patryk.

On 24/05/15 02:26, Dj Merrill via slurm-users wrote: [-- Type: text/plain; charset=US-ASCII, Encoding: 7bit, Size: 0,2K --]

...
I completely missed that, thank you!

-Dj

Laura Hild via slurm-users wrote:

...
PropagateResourceLimitsExcept won't do it?

Sarlo, Jeffrey S wrote:

...
You might look at the PropagateResourceLimits and PropagateResourceLimitsExcept settings in slurm.conf

[-- Alternative Type #1: text/html; charset=UTF-8, Encoding: 8bit, Size: 1,0K --]

...
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

[-- Alternative Type #1: text/html; charset=windows-1250, Encoding: quoted-printable, Size: 5,8K --]

...

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

532

Age (days ago)

535

Last active (days ago)

slurm-users@lists.schedmd.com

12 comments

6 participants

tags (0)

participants (6)

Dj Merrill
Feng Zhang
Hermann Schwärzler
Laura Hild
Patryk Bełzak
Thomas Green - Staff in University IT, Research Technologies / Staff Technoleg Gwybodaeth, Technolegau Ymchwil