- slurm-users - lists.schedmd.com

Re: Location of Slurm source packages?
by Lloyd Brown 15 May '24

15 May '24

Jeff, Dang. That's really old. I'm not sure I would run one that old, to be honest. Too many missing security fixes and added features. It's never been that hard to do a 'git clone' and the normal configure/make/make install process with slurm. Someone else made me aware of this, in case it's easier: https://slurm.schedmd.com/quickstart_admin.html#debuild Lloyd On 5/15/24 08:57, Jeffrey Layton wrote: > Lloyd, > > Good to hear from you! I was hoping to … [View More]

1 0

Job Invalid Account
by Joe Teumer 15 May '24

15 May '24

We installed slurm 23.11.5 and we are receiving "JobId=n has invalid account" for every sbatch job. We are not using the slurm accounting/user database; we are using uniform UIDs and GIDs across the cluster. The jobs run and complete; can these invalid account errors be ignored or silenced? Job Submission Environment: id joteumer uid=938401109(joteumer) gid=938400513(SPG) groups=938400513(SPG),27(sudo) Slurm Worker Node: id joteumer uid=938401109(joteumer) gid=938400513(SPG) groups=938400513(… [View More]

2 1

Slurm Cleaning Up $XDG_RUNTIME_DIR Before It Should?
by Arnuld 15 May '24

15 May '24

I am using the latest slurm. It runs fine for scripts. But if I give it a container then it kills it as soon as I submit the job. Is slurm cleaning up the $XDG_RUNTIME_DIR before it should? This is the log: [2024-05-15T08:00:35.143] [90.0] debug2: _generate_patterns: StepId=90.0 TaskId=-1 [2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command argv[0]=/bin/sh [2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command argv[1]=-c [2024-05-15T08:00:35.143] [90.0] … [View More]

2 2

Slurm release candidate version 24.05.0rc1 available for testing
by Marshall Garey 14 May '24

14 May '24

We are pleased to announce the availability of Slurm release candidate 24.05.0rc1. To highlight some new features coming in 24.05: - (Optional) isolated Job Step management. Enabled on a job-by-job basis with the --stepmgr option, or globally through SlurmctldParameters=enable_stepmgr. - Federation - Allow for client command operation while SlurmDBD is unavailable. - New MaxTRESRunMinsPerAccount and MaxTRESRunMinsPerUser QOS limits. - New USER_DELETE reservation flag. - New Flags=… [View More]

1 0

Submitting from an untrusted node
by Rike-Benjamin Schuppner 14 May '24

14 May '24

Hi, If I understand it correctly, the MUNGE and SACK authentication modules naturally require that no-one can get access to the key. This means that we should not use our normal workstations to which our users have physical access to run any jobs, nor could our users use the workstations to submit jobs to the compute nodes. They would have to ssh to a specific submit node and only then could they schedule their jobs. Is there an elegant way to enable job submission from any computer (possibly … [View More]

2 1

Which "oci.conf" to use?
by Arnuld 14 May '24

14 May '24

I have installed slurm and podman. I have replaced podman's default runtime as per the documentation to "slurm". Documentation says I need to choose one oci.conf: https://slurm.schedmd.com/containers.html#example Which one should I use? runc? crun? nvidia?

1 0

any way to allow interactive jobs or ssh in Slurm 23.02 when node is draining?
by Robert Kudyba 13 May '24

13 May '24

We use Bright Cluster Manager with SLurm 23.02 on RHEL9. I know about pam_slurm_adopt https://slurm.schedmd.com/pam_slurm_adopt.html which does not appear to come by default with the Bright 'cm' package of Slurm. Currently ssh to a node gets: Login not allowed: no running jobs and no WLM allocations We have 8 GPUs on a node so when we drain a node, which can have up to a 5 day job, no new jobs can run. And since we have 20+ TB (yes TB) local drives, researchers have their work and files on … [View More]

2 2

Announcing Slurm-web v3.0.0, open source web dashboard for Slurm
by Rémi Palancher 13 May '24

13 May '24

Hello Slurm users, Some of you may find interest in the new major version of Slurm-web v3.0.0, an open source web dashboard for Slurm: https://slurm-web.com Slurm-web provides a reactive & responsive web interface to track jobs with intuitive insights and advanced visualizations to monitor status of HPC supercomputers in your organization. The software is released under GPLv3 [1]. This new version is based on official Slurm REST API slurmrestd and adopts modern web technologies to … [View More]

1 0

srun launched mpi job occasionally core dumps
by Henderson, Brent 10 May '24

10 May '24

Greetings Slurm gurus -- I've been having an issue where very occasionally an srun launched OpenMPI job launched will die during startup within MPI_Init(). E.g. srun -N 8 --ntasks-per-node=1 ./hello_world_mpi. Same binary launched with mpirun does not experience the issue. E.g. mpirun -n 64 -H cn01,... ./hello_world_mpi. The failure rate seems to be in the 0.5% - 1.0% range when using srun for launch. SW stack is self-built with: * Dual socket AMD nodes * RHEL 9.3 base … [View More]system + tools * Single 100 Gb card per host * hwloc 2.9.3 * pmix 4.2.9 (5.0.2 also tried but continued to see the same issues) * slurm 23.11.6 (started with 23.11.5 - update did not change the behavior) * openmpi 5.0.3 The MPI code is a simple hello_world_mpi.c - anything that goes through startup via srun - does not seem to matter. Application core dump looks like the following regardless of the test running: [cn04:1194785] *** Process received signal *** [cn04:1194785] Signal: Segmentation fault (11) [cn04:1194785] Signal code: Address not mapped (1) [cn04:1194785] Failing at address: 0xe0 [cn04:1194785] [ 0] /lib64/libc.so.6(+0x54db0)[0x7f54e6254db0] [cn04:1194785] [ 1] /share/openmpi/5.0.3/lib/libmpi.so.40(mca_pml_ob1_recv_frag_callback_match+0x7d)[0x7f54e67eab3d] [cn04:1194785] [ 2] /share/openmpi/5.0.3/lib/libopen-pal.so.80(+0xa7d8c)[0x7f54e6566d8c] [cn04:1194785] [ 3] /lib64/libevent_core-2.1.so.7(+0x21b88)[0x7f54e649cb88] [cn04:1194785] [ 4] /lib64/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7f54e649e7a7] [cn04:1194785] [ 5] /share/openmpi/5.0.3/lib/libopen-pal.so.80(+0x222af)[0x7f54e64e12af] [cn04:1194785] [ 6] /share/openmpi/5.0.3/lib/libopen-pal.so.80(opal_progress+0x85)[0x7f54e64e1365] [cn04:1194785] [ 7] /share/openmpi/5.0.3/lib/libmpi.so.40(ompi_mpi_init+0x46d)[0x7f54e663ce7d] [cn04:1194785] [ 8] /share/openmpi/5.0.3/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f54e66711ae] [cn04:1194785] [ 9] /home/brent/bin/ior-3.0.1/ior[0x403780] [cn04:1194785] [10] /lib64/libc.so.6(+0x3feb0)[0x7f54e623feb0] [cn04:1194785] [11] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f54e623ff60] [cn04:1194785] [12] /home/brent/bin/ior-3.0.1/ior[0x4069d5] [cn04:1194785] *** End of error message *** More than one rank can die with the same stacktrace on a node when this happens - I've seen as many as 6. One other interesting note is that if I change my srun command line to include strace (e.g. srun -N 8 --ntasks-per-node=8 strace <strace-options> ./hello_world_mpi) the issue appears to go away. 0 failures in ~2500 runs. Another thing that seems to help is to disabling cgroups in the slurm.conf. After the change, saw 0 failures in >6100 hello_world_mpi runs. The changes in the slurm.conf were - original: ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup,task/affinity JobAcctGatherType=jobacct_gather/cgroup Changed ProctrackType=proctrack/linuxproc TaskPlugin=task/affinity JobAcctGatherType=jobacct_gather/linux My cgroup.conf file contains: ConstrainCores=yes ConstrainDevices=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes AllowedRamSpace=95 Curious is anyone has any thoughts on next steps to help figure out what might be going on and how to resolve it. Currently, I'm planning to back down to the 23.02.7 release and see how that goes but open to other suggestions. Thanks, Brent [View Less]

2 4

Issues with MIG after update to 23.11
by Cumer Cristiano 10 May '24

10 May '24

Dear list, after an update form 22.05 to 23.11 a host where I'm using MIG started to discard the mig devices during the NVML setuop. I can see those lines in the log: slurmd: error: Discarding the following config-only GPU due to lack of File specification: slurmd: error: GRES[gpu] Type:a100_3g.39gb Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null) where the MIG devices are discarded during the start of … [View More]the slurmd deamon. this is my gres.conf file cat /etc/slurm/gres.conf ################################################################## # Slurm's Generic Resource (GRES) configuration file ################################################################## AutoDetect=nvml and that still works for hosts on 22.05 Any idea of what is going on? I tried to disable MIG on the upgraded server, and then all the A100 are recognized without issues. Thanks Cristiano slurmd: debug: Log file re-opened slurmd: debug2: hwloc_topology_init slurmd: debug2: hwloc_topology_load slurmd: debug2: hwloc_topology_export_xml slurmd: debug: CPUs:64 Boards:1 Sockets:2 CoresPerSocket:32 ThreadsPerCore:1 slurmd: debug: cgroup/v1: init: Cgroup v1 plugin loaded slurmd: debug2: hwloc_topology_init slurmd: debug2: xcpuinfo_hwloc_topo_load: xml file (/var/lib/slurm/slurmd/hwloc_topo_whole.xml) found slurmd: debug: CPUs:64 Boards:1 Sockets:2 CoresPerSocket:32 ThreadsPerCore:1 slurmd: debug: gres/gpu: init: loaded slurmd: debug: gpu/nvml: init: init: GPU NVML plugin loaded slurmd: debug2: gpu/nvml: _nvml_init: Successfully initialized NVML slurmd: debug: gpu/nvml: _get_system_gpu_list_nvml: Systems Graphics Driver Version: 550.54.15 slurmd: debug: gpu/nvml: _get_system_gpu_list_nvml: NVML Library Version: 12.550.54.15 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: NVML API Version: 11 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Total CPU count: 64 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Device count: 4 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 0: slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Name: nvidia_a100-sxm4-80gb slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: UUID: GPU-808805ee-2ae5-ee6d-14fd-6e63028d4a2a slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Domain/Bus/Device: 0:1:0 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Bus ID: 00000000:01:00.0 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: -1,0,0,0 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Device File (minor number): /dev/nvidia0 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 24-31 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 24-31 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG mode: enabled slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG count: 2 slurmd: debug2: gpu/nvml: _handle_mig: GPU minor 0, MIG index 0: slurmd: debug2: gpu/nvml: _handle_mig: MIG Profile: nvidia_a100_3g.40gb slurmd: debug2: gpu/nvml: _handle_mig: MIG UUID: MIG-ab1b2cb1-d690-5564-885e-056f23c1c618 slurmd: debug2: gpu/nvml: _handle_mig: UniqueID: MIG-ab1b2cb1-d690-5564-885e-056f23c1c618 slurmd: debug2: gpu/nvml: _handle_mig: GPU Instance (GI) ID: 1 slurmd: debug2: gpu/nvml: _handle_mig: Compute Instance (CI) ID: 0 slurmd: debug2: gpu/nvml: _handle_mig: GI Minor Number: 12 slurmd: debug2: gpu/nvml: _handle_mig: CI Minor Number: 13 slurmd: debug2: gpu/nvml: _handle_mig: Device Files: /dev/nvidia0,/dev/nvidia-caps/nvidia-cap12,/dev/nvidia-caps/nvidia-cap13 slurmd: debug2: gpu/nvml: _handle_mig: GPU minor 0, MIG index 1: slurmd: debug2: gpu/nvml: _handle_mig: MIG Profile: nvidia_a100_3g.40gb slurmd: debug2: gpu/nvml: _handle_mig: MIG UUID: MIG-c43ecbcb-9f3f-5972-8eb9-6a2a09848bd5 slurmd: debug2: gpu/nvml: _handle_mig: UniqueID: MIG-c43ecbcb-9f3f-5972-8eb9-6a2a09848bd5 slurmd: debug2: gpu/nvml: _handle_mig: GPU Instance (GI) ID: 2 slurmd: debug2: gpu/nvml: _handle_mig: Compute Instance (CI) ID: 0 slurmd: debug2: gpu/nvml: _handle_mig: GI Minor Number: 21 slurmd: debug2: gpu/nvml: _handle_mig: CI Minor Number: 22 slurmd: debug2: gpu/nvml: _handle_mig: Device Files: /dev/nvidia0,/dev/nvidia-caps/nvidia-cap21,/dev/nvidia-caps/nvidia-cap22 slurmd: debug2: Possible GPU Memory Frequencies (1): slurmd: debug2: ------------------------------- slurmd: debug2: *1593 MHz [0] slurmd: debug2: Possible GPU Graphics Frequencies (81): slurmd: debug2: --------------------------------- slurmd: debug2: *1410 MHz [0] slurmd: debug2: *1395 MHz [1] slurmd: debug2: ... slurmd: debug2: *810 MHz [40] slurmd: debug2: ... slurmd: debug2: *225 MHz [79] slurmd: debug2: *210 MHz [80] slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 1: slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Name: nvidia_a100-sxm4-80gb slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: UUID: GPU-05911075-e3e1-973b-708a-33a77ddd381c slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Domain/Bus/Device: 0:65:0 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Bus ID: 00000000:41:00.0 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: 0,-1,0,0 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Device File (minor number): /dev/nvidia1 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 8-15 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 8-15 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG mode: enabled slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG count: 2 slurmd: debug2: gpu/nvml: _handle_mig: GPU minor 1, MIG index 0: slurmd: debug2: gpu/nvml: _handle_mig: MIG Profile: nvidia_a100_3g.40gb slurmd: debug2: gpu/nvml: _handle_mig: MIG UUID: MIG-36b02f58-efa6-5071-9956-510c8a8e705a slurmd: debug2: gpu/nvml: _handle_mig: UniqueID: MIG-36b02f58-efa6-5071-9956-510c8a8e705a slurmd: debug2: gpu/nvml: _handle_mig: GPU Instance (GI) ID: 1 slurmd: debug2: gpu/nvml: _handle_mig: Compute Instance (CI) ID: 0 slurmd: debug2: gpu/nvml: _handle_mig: GI Minor Number: 147 slurmd: debug2: gpu/nvml: _handle_mig: CI Minor Number: 148 slurmd: debug2: gpu/nvml: _handle_mig: Device Files: /dev/nvidia1,/dev/nvidia-caps/nvidia-cap147,/dev/nvidia-caps/nvidia-cap148 slurmd: debug2: gpu/nvml: _handle_mig: GPU minor 1, MIG index 1: slurmd: debug2: gpu/nvml: _handle_mig: MIG Profile: nvidia_a100_3g.40gb slurmd: debug2: gpu/nvml: _handle_mig: MIG UUID: MIG-c5ca2779-f34d-544b-af52-55ef14aa0af6 slurmd: debug2: gpu/nvml: _handle_mig: UniqueID: MIG-c5ca2779-f34d-544b-af52-55ef14aa0af6 slurmd: debug2: gpu/nvml: _handle_mig: GPU Instance (GI) ID: 2 slurmd: debug2: gpu/nvml: _handle_mig: Compute Instance (CI) ID: 0 slurmd: debug2: gpu/nvml: _handle_mig: GI Minor Number: 156 slurmd: debug2: gpu/nvml: _handle_mig: CI Minor Number: 157 slurmd: debug2: gpu/nvml: _handle_mig: Device Files: /dev/nvidia1,/dev/nvidia-caps/nvidia-cap156,/dev/nvidia-caps/nvidia-cap157 slurmd: debug2: Possible GPU Memory Frequencies (1): slurmd: debug2: ------------------------------- slurmd: debug2: *1593 MHz [0] slurmd: debug2: Possible GPU Graphics Frequencies (81): slurmd: debug2: --------------------------------- slurmd: debug2: *1410 MHz [0] slurmd: debug2: *1395 MHz [1] slurmd: debug2: ... slurmd: debug2: *810 MHz [40] slurmd: debug2: ... slurmd: debug2: *225 MHz [79] slurmd: debug2: *210 MHz [80] slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 2: slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Name: nvidia_a100-sxm4-80gb slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: UUID: GPU-7b67964f-9217-908f-042f-f37f06cbb1a2 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Domain/Bus/Device: 0:129:0 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Bus ID: 00000000:81:00.0 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: 0,0,-1,4 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Device File (minor number): /dev/nvidia2 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 56-63 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 56-63 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG mode: disabled slurmd: debug2: Possible GPU Memory Frequencies (1): slurmd: debug2: ------------------------------- slurmd: debug2: *1593 MHz [0] slurmd: debug2: Possible GPU Graphics Frequencies (81): slurmd: debug2: --------------------------------- slurmd: debug2: *1410 MHz [0] slurmd: debug2: *1395 MHz [1] slurmd: debug2: ... slurmd: debug2: *810 MHz [40] slurmd: debug2: ... slurmd: debug2: *225 MHz [79] slurmd: debug2: *210 MHz [80] slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 3: slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Name: nvidia_a100-sxm4-80gb slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: UUID: GPU-87d7526f-6a29-214e-1c24-722a20211263 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Domain/Bus/Device: 0:193:0 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Bus ID: 00000000:C1:00.0 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: 0,0,4,-1 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Device File (minor number): /dev/nvidia3 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 40-47 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 40-47 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG mode: disabled slurmd: debug2: Possible GPU Memory Frequencies (1): slurmd: debug2: ------------------------------- slurmd: debug2: *1593 MHz [0] slurmd: debug2: Possible GPU Graphics Frequencies (81): slurmd: debug2: --------------------------------- slurmd: debug2: *1410 MHz [0] slurmd: debug2: *1395 MHz [1] slurmd: debug2: ... slurmd: debug2: *810 MHz [40] slurmd: debug2: ... slurmd: debug2: *225 MHz [79] slurmd: debug2: *210 MHz [80] slurmd: gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected slurmd: debug: Gres GPU plugin: Merging configured GRES with system GPUs slurmd: debug2: gres/gpu: _merge_system_gres_conf: gres_list_conf: slurmd: debug2: GRES[gpu] Type:a100_3g.39gb Count:4 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null) slurmd: debug2: GRES[gpu] Type:a100-sxm4-80gb Count:2 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null) slurmd: debug: gres/gpu: _merge_system_gres_conf: Including the following GPU matched between system and configuration: slurmd: debug: GRES[gpu] Type:a100-sxm4-80gb Count:1 Cores(64):56-63 Links:0,0,-1,4 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia2 UniqueId:(null) slurmd: debug: gres/gpu: _merge_system_gres_conf: Including the following GPU matched between system and configuration: slurmd: debug: GRES[gpu] Type:a100-sxm4-80gb Count:1 Cores(64):40-47 Links:0,0,4,-1 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia3 UniqueId:(null) slurmd: error: Discarding the following config-only GPU due to lack of File specification: slurmd: error: GRES[gpu] Type:a100_3g.39gb Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null) slurmd: error: Discarding the following config-only GPU due to lack of File specification: slurmd: error: GRES[gpu] Type:a100_3g.39gb Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null) slurmd: error: Discarding the following config-only GPU due to lack of File specification: slurmd: error: GRES[gpu] Type:a100_3g.39gb Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null) slurmd: error: Discarding the following config-only GPU due to lack of File specification: slurmd: error: GRES[gpu] Type:a100_3g.39gb Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null) slurmd: warning: The following autodetected GPUs are being ignored: slurmd: GRES[gpu] Type:nvidia_a100_3g.40gb Count:1 Cores(64):24-31 Links:(null) Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia0,/dev/nvidia-caps/nvidia-cap12,/dev/nvidia-caps/nvidia-cap13 UniqueId:MIG-ab1b2cb1-d690-5564-885e-056f23c1c618 slurmd: GRES[gpu] Type:nvidia_a100_3g.40gb Count:1 Cores(64):24-31 Links:(null) Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia0,/dev/nvidia-caps/nvidia-cap21,/dev/nvidia-caps/nvidia-cap22 UniqueId:MIG-c43ecbcb-9f3f-5972-8eb9-6a2a09848bd5 slurmd: GRES[gpu] Type:nvidia_a100_3g.40gb Count:1 Cores(64):8-15 Links:(null) Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia1,/dev/nvidia-caps/nvidia-cap147,/dev/nvidia-caps/nvidia-cap148 UniqueId:MIG-36b02f58-efa6-5071-9956-510c8a8e705a slurmd: GRES[gpu] Type:nvidia_a100_3g.40gb Count:1 Cores(64):8-15 Links:(null) Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia1,/dev/nvidia-caps/nvidia-cap156,/dev/nvidia-caps/nvidia-cap157 UniqueId:MIG-c5ca2779-f34d-544b-af52-55ef14aa0af6 slurmd: debug2: gres/gpu: _merge_system_gres_conf: gres_list_gpu slurmd: debug2: GRES[gpu] Type:a100-sxm4-80gb Count:1 Cores(64):56-63 Links:0,0,-1,4 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia2 UniqueId:(null) slurmd: debug2: GRES[gpu] Type:a100-sxm4-80gb Count:1 Cores(64):40-47 Links:0,0,4,-1 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia3 UniqueId:(null) slurmd: debug: Gres GPU plugin: Final merged GRES list: slurmd: debug: GRES[gpu] Type:a100-sxm4-80gb Count:1 Cores(64):56-63 Links:0,0,-1,4 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia2 UniqueId:(null) slurmd: debug: GRES[gpu] Type:a100-sxm4-80gb Count:1 Cores(64):40-47 Links:0,0,4,-1 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia3 UniqueId:(null) slurmd: Gres Name=gpu Type=a100-sxm4-80gb Count=1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=a100-sxm4-80gb Count=1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: topology/default: init: topology Default plugin loaded slurmd: CPU frequency setting not configured for this node slurmd: debug: Resource spec: No specialized cores configured by default on this node slurmd: debug: Resource spec: Reserved system memory limit not configured for this node slurmd: debug: task/cgroup: init: Tasks containment cgroup plugin loaded slurmd: task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff slurmd: debug: spank: opening plugin stack /var/lib/slurm/slurmd/conf-cache/plugstack.conf slurmd: debug: /var/lib/slurm/slurmd/conf-cache/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf" slurmd: cred/munge: init: Munge credential signature plugin loaded slurmd: warning: Core limit is only 0 KB slurmd: slurmd version 23.11.4 started slurmd: debug2: No acct_gather.conf file (/var/lib/slurm/slurmd/conf-cache/acct_gather.conf) slurmd: debug: jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded slurmd: debug: MPI: Loading all types slurmd: debug: mpi/pmix_v4: init: PMIx plugin loaded slurmd: debug: mpi/pmix_v4: init: PMIx plugin loaded slurmd: debug2: No mpi.conf file (/var/lib/slurm/slurmd/conf-cache/mpi.conf) slurmd: slurmd started on Fri, 10 May 2024 14:30:30 +0000 slurmd: CPUs=64 Boards=1 Sockets=2 Cores=32 Threads=1 Memory=1031664 TmpDisk=0 Uptime=2512944 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) ^Cslurmd: got shutdown request slurmd: all threads complete slurmd: debug: mpi/pmix_v4: fini: (null) [0]: mpi_pmix.c:203: (null): call fini() slurmd: debug: mpi/pmix_v4: fini: (null) [0]: mpi_pmix.c:203: (null): call fini() slurmd: debug: jobacct_gather/cgroup: fini: Job accounting gather cgroup plugin unloaded slurmd: debug2: acct_gather_profile_startpoll: poll already ended! slurmd: debug: task/cgroup: fini: Tasks containment cgroup plugin unloaded slurmd: debug: task/affinity: fini: task affinity plugin unloaded slurmd: debug: hash/k12: fini: fini: unloading KangarooTwelve hash plugin slurmd: debug: gres/gpu: fini: unloading ^Cslurmd: debug2: gpu/nvml: _nvml_shutdown: Successfully shut down NVML slurmd: debug: gpu/nvml: fini: fini: unloading GPU NVML plugin slurmd: debug2: acct_gather_profile_startpoll: poll already ended! slurmd: debug: cgroup/v1: fini: unloading Cgroup v1 plugin slurmd: cred/munge: fini: Munge credential signature plugin unloaded slurmd: Slurmd shutdown completing [View Less]

1 0

2025

2024

slurm-users ----- 2025 ----- July 2025 June 2025 May 2025 April 2025 March 2025 February 2025 January 2025 ----- 2024 ----- December 2024 November 2024 October 2024 September 2024 August 2024 July 2024 June 2024 May 2024 April 2024 March 2024 February 2024 January 2024

slurm-users