- slurm-users - lists.schedmd.com

Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)
by Jason Macklin 19 Jan '24

19 Jan '24

If you run "scontrol show jobid <jobid>" of your pending job with the "(Resources)" tag you may see more about what is unavailable to your job. Slurm default configs can cause an entire compute node of resources to be "allocated" to a running job regardless of whether it needs all of them or not so you may need to alter one or both of the following settings to allow more than one job to run on a single node at once. You'll find these in your slurm.conf. Don't forget to "scontrol reconf"… [View More] and even potentially restart both "slurmctld" & "slurmd" on your nodes if you do end up making changes. SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory I hope this helps. Kind regards, Jason ---- Jason Macklin Manager Cyberinfrastructure, Research Cyberinfrastructure 860.837.2142 t | 860.202.7779 m jason.macklin(a)jax.org The Jackson Laboratory Maine | Connecticut | California | Shanghai www.jax.org<http://www.jax.org> The Jackson Laboratory: Leading the search for tomorrow's cures ________________________________ From: slurm-users <slurm-users-bounces(a)lists.schedmd.com> on behalf of slurm-users-request(a)lists.schedmd.com <slurm-users-request(a)lists.schedmd.com> Sent: Friday, January 19, 2024 9:24 AM To: slurm-users(a)lists.schedmd.com <slurm-users(a)lists.schedmd.com> Subject: [EXTERNAL]slurm-users Digest, Vol 75, Issue 31 Send slurm-users mailing list submissions to slurm-users(a)lists.schedmd.com To subscribe or unsubscribe via the World Wide Web, visit https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users or, via email, send a message with subject or body 'help' to slurm-users-request(a)lists.schedmd.com You can reach the person managing the list at slurm-users-owner(a)lists.schedmd.com When replying, please edit your Subject line so it is more specific than "Re: Contents of slurm-users digest..." Today's Topics: 1. Re: Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres) (Marko Markoc) 2. Re: Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres) (?mit Seren) ---------------------------------------------------------------------- Message: 1 Date: Fri, 19 Jan 2024 06:12:24 -0800 From: Marko Markoc <mmarkoc(a)pdx.edu> To: Slurm User Community List <slurm-users(a)lists.schedmd.com> Subject: Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres) Message-ID: <CABnuMe4JTA0e6=VbO8D+To=8FGO+3Byv1dK_MC+OuRitzN5dXg(a)mail.gmail.com> Content-Type: text/plain; charset="utf-8" +1 on checking the memory allocation. Or add/check if you have any DefMemPerX set in your slurm.conf On Fri, Jan 19, 2024 at 12:33?AM mohammed shambakey <shambakey1(a)gmail.com> wrote: > Hi > > I'm not an expert, but is it possible that the currently running jobs is > consuming the whole node because it is allocated the whole memory of the > node (so the other 2 jobs had to wait until it finishes)? > Maybe if you try to restrict the required memory for each job? > > Regards > > On Thu, Jan 18, 2024 at 4:46?PM ?mit Seren <uemit.seren(a)gmail.com> wrote: > >> This line also has tobe changed: >> >> >> #SBATCH --gpus-per-node=4 ? #SBATCH --gpus-per-node=1 >> >> --gpus-per-node seems to be the new parameter that is replacing the --gres= >> one, so you can remove the ?gres line completely. >> >> >> >> Best >> >> ?mit >> >> >> >> *From: *slurm-users <slurm-users-bounces(a)lists.schedmd.com> on behalf of >> Kherfani, Hafedh (Professional Services, TC) <hafedh.kherfani(a)hpe.com> >> *Date: *Thursday, 18. January 2024 at 15:40 >> *To: *Slurm User Community List <slurm-users(a)lists.schedmd.com> >> *Subject: *Re: [slurm-users] Need help with running multiple >> instances/executions of a batch script in parallel (with NVIDIA HGX A100 >> GPU as a Gres) >> >> Hi Noam and Matthias, >> >> >> >> Thanks both for your answers. >> >> >> >> I changed the ?#SBATCH --gres=gpu:4? directive (in the batch script) with >> ?#SBATCH --gres=gpu:1? as you suggested, but it didn?t make a difference, >> as running this batch script 3 times will result in the first job to be in >> a running state, while the second and third jobs will still be in a pending >> state ? >> >> >> >> [slurmtest@c-a100-master test-batch-scripts]$ cat gpu-job.sh >> >> #!/bin/bash >> >> #SBATCH --job-name=gpu-job >> >> #SBATCH --partition=gpu >> >> #SBATCH --nodes=1 >> >> #SBATCH --gpus-per-node=4 >> >> #SBATCH --gres=gpu:1 # <<<< Changed from ?4? >> to ?1? >> >> #SBATCH --tasks-per-node=1 >> >> #SBATCH --output=gpu_job_output.%j >> >> #SBATCH --error=gpu_job_error.%j >> >> >> >> hostname >> >> date >> >> sleep 40 >> >> pwd >> >> >> >> [slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh >> >> Submitted batch job *217* >> >> [slurmtest@c-a100-master test-batch-scripts]$ squeue >> >> JOBID PARTITION NAME USER ST TIME NODES >> NODELIST(REASON) >> >> 217 gpu gpu-job slurmtes R 0:02 1 >> c-a100-cn01 >> >> [slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh >> >> Submitted batch job *218* >> >> [slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh >> >> Submitted batch job *219* >> >> [slurmtest@c-a100-master test-batch-scripts]$ squeue >> >> JOBID PARTITION NAME USER ST TIME NODES >> NODELIST(REASON) >> >> 219 gpu gpu-job slurmtes *PD* 0:00 1 >> (Priority) >> >> 218 gpu gpu-job slurmtes *PD* 0:00 1 >> (Resources) >> >> 217 gpu gpu-job slurmtes *R* 0:07 1 >> c-a100-cn01 >> >> >> >> Basically I?m seeking for some help/hints on how to tell Slurm, from the >> batch script for example: ?I want only 1 or 2 GPUs to be used/consumed by >> the job?, and then I run the batch script/job a couple of times with sbatch >> command, and confirm that we can indeed have multiple jobs using a GPU and >> running in parallel, at the same time. >> >> >> >> Makes sense ? >> >> >> >> >> >> Best regards, >> >> >> >> *Hafedh * >> >> >> >> *From:* slurm-users <slurm-users-bounces(a)lists.schedmd.com> *On Behalf >> Of *Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) >> *Sent:* jeudi 18 janvier 2024 2:30 PM >> *To:* Slurm User Community List <slurm-users(a)lists.schedmd.com> >> *Subject:* Re: [slurm-users] Need help with running multiple >> instances/executions of a batch script in parallel (with NVIDIA HGX A100 >> GPU as a Gres) >> >> >> >> On Jan 18, 2024, at 7:31 AM, Matthias Loose <m.loose(a)mindcode.de> wrote: >> >> >> >> Hi Hafedh, >> >> Im no expert in the GPU side of SLURM, but looking at you current >> configuration to me its working as intended at the moment. You have defined >> 4 GPUs and start multiple jobs each consuming 4 GPUs each. So the jobs wait >> for the ressource the be free again. >> >> I think what you need to look into is the MPS plugin, which seems to do >> what you are trying to achieve: >> https://slurm.schedmd.com/gres.html#MPS_Management >> >> >> >> I agree with the first paragraph. How many GPUs are you expecting each >> job to use? I'd have assumed, based on the original text, that each job is >> supposed to use 1 GPU, and the 4 jobs were supposed to be running >> side-by-side on the one node you have (with 4 GPUs). If so, you need to >> tell each job to request only 1 GPU, and currently each one is requesting 4. >> >> >> >> If your jobs are actually supposed to be using 4 GPUs each, I still don't >> see any advantage to MPS (at least in what is my usual GPU usage pattern): >> all the jobs will take longer to finish, because they are sharing the fixed >> resource. If they take turns, at least the first ones finish as fast as >> they can, and the last one will finish no later than it would have if they >> were all time-sharing the GPUs. I guess NVIDIA had something in mind when >> they developed MPS, so I guess our pattern may not be typical (or at least >> not universal), and in that case the MPS plugin may well be what you need. >> > > > -- > Mohammed > [View Less]

1 0

slurmctld/slurmdbd (code=exited, status=217/USER)
by Miriam Olmi 19 Jan '24

19 Jan '24

Hi all, I am having some issue with the new version of slurm 23.11.0-1. I had already installed and configured slurm 23.02.3-1 on my cluster and all the services were active and running properly. After I install with the same procedure the new version of slurm I have that the slurmctld and slurmdbd daemons fail to start all with the same error: (code=exited, status=217/USER) And investigating the problem with the command journalctl -xe I find: slurmctld.service: Failed to determine user … [View More]

2 1

Jobs exiting together
by Alexander Silva 19 Jan '24

19 Jan '24

Recently, i have built an hpc cluster with slurm as workload. The test jobs with quatum chemistry codes have worked fine. However, production jobs with lammps have shown an unexpected behavior when the first job completed, normally or not, cause the termination of the others in the same compute node. Initially, I thought that was due to mpi malfunction, but this behavior is algo observed for serial lammps code. The lammps group said to me that behavior could be generated by slurm. My … [View More]

1 0

Running slurm job on requested nvidia mig device
by Dražen Jalšovec 19 Jan '24

19 Jan '24

Hi, We are testing the MIG deployment on our new slurm compute node with 4 x H100 GPUs. It looks like everything is configured correctly but we have a problem accessing mig devices. When I submit jobs requesting a mig gpu device #SBATCH --gres=gpu:H100_1g.10gb:1, the jobs get submitted to the node, but only 4 jobs get executed and all other jobs fail. I was able to "solve it" by adding code to submit to the mig device UUID. Am I missing something in the slurm/gres/cgroups configuration that … [View More]would automatically assign correct CUDA_VISIBLE_DEVICES MIG UUID to the job? If I submit 10 jobs requesting --gres=gpu:H100_1g.10gb:1 8 jobs start running: #1 - CUDA_VISIBLE_DEVICES=0 - jobs 1-4 run on one of the mig devices configured on GPU-device ... #8 - CUDA_VISIBLE_DEVICES=7 - jobs 5-8 fail with the message that there are no available CUDA_VISIBLE_DEVICES ... #10 - CUDA_VISIBLE_DEVICES=? - after waiting for free mig resources jobs 9-10 runs or fails depending if GPU-device is free (migs are free) $ nvidia-smi --query-gpu=uuid --id=0 --format=csv,noheader GPU-6c7f34eb-7b0b-9b09-4a2a-a009fb1ba6d3 $ nvidia-smi --query-gpu=uuid --id=7 --format=csv,noheader No devices were found The slurm assigns CUDA_VISIBLE_DEVICES numbers based on the requested resources: Request: gpu:H100_1g.10gb:1 - CUDA_VISIBLE_DEVICES=0 gpu:H100_1g.10gb:2 - CUDA_VISIBLE_DEVICES=0,1 ... Any job with H100_1g.10gb request is going to get device number between 0 and 7 gpu:H100_1g.10gb:8 - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 gpu:H100_2g.20gb:1 - CUDA_VISIBLE_DEVICES=8 Any job with H100_2g.20gb request is going to get device number between 8 and 10 ... gpu:H100_2g.10gb:3 - CUDA_VISIBLE_DEVICES=8,9,10 gpu:H100_3g.40gb:1 - CUDA_VISIBLE_DEVICES=11 gpu:H100_3g.40gb:2 - CUDA_VISIBLE_DEVICES=12 gpu:H100_7g.80gb:1 - CUDA_VISIBLE_DEVICES=13 I was able to "solve it" by adding this code to my slurm script: # Get the CUDA index using srun and awk CUDA_INDEX=$(srun env | grep CUDA_VISIBLE_DEVICES | awk -F '=' '{print $2}') ## GPU0 1g.10gb ## MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a ## MIG-d1679b4c-c8e7-5d78-bd63-07334ecad664 ## MIG-d3ab0675-d318-5e53-b487-b50695cf2e00 ## MIG-c1384014-7937-5942-ba2a-900bd8a4c4b0 ## MIG-85dd76da-c994-5830-adf0-467c66ae1b95 ## MIG-29a6d43b-882e-5b79-868a-15bb2c770b82 ## MIG-558debd2-dc13-5406-9256-73ef4f279737 ## GPU1 2g.20gb ## MIG-a78cd536-25b7-53cb-941a-a2db5eb0375f ## MIG-0a804c0c-aa27-5993-97cb-cedb854735ce ## MIG-8af9af6d-8720-5763-81e2-83afc43eb42b ## GPU1 1g.10gb ## MIG-68d47441-f5fb-5a3b-ab5b-a0f3449c5b2a ## GPU2 3g.40gb ## MIG-2ed972aa-6640-5116-9271-3d61c0dc9f3b ## MIG-3bdad788-60a5-5bef-85b9-fbdb36dc71fc ## GPU3 7g.80gb ## MIG-ec043869-9176-577e-bac0-46c8411e4e37 # Define the list of UUIDs UUIDS=( "MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a" "MIG-d1679b4c-c8e7-5d78-bd63-07334ecad664" "MIG-d3ab0675-d318-5e53-b487-b50695cf2e00" "MIG-c1384014-7937-5942-ba2a-900bd8a4c4b0" "MIG-85dd76da-c994-5830-adf0-467c66ae1b95" "MIG-29a6d43b-882e-5b79-868a-15bb2c770b82" "MIG-558debd2-dc13-5406-9256-73ef4f279737" "MIG-68d47441-f5fb-5a3b-ab5b-a0f3449c5b2a" "MIG-a78cd536-25b7-53cb-941a-a2db5eb0375f" "MIG-0a804c0c-aa27-5993-97cb-cedb854735ce" "MIG-8af9af6d-8720-5763-81e2-83afc43eb42b" "MIG-2ed972aa-6640-5116-9271-3d61c0dc9f3b" "MIG-3bdad788-60a5-5bef-85b9-fbdb36dc71fc" "MIG-ec043869-9176-577e-bac0-46c8411e4e37" ) # Assign the UUID based on the CUDA index SELECTED_UUID=${UUIDS[$CUDA_INDEX]} # Print the selected UUID echo $SELECTED_UUID export CUDA_VISIBLE_DEVICES=$SELECTED_UUID echo "CUDA_VISIBLE_DEVICES set to: "$SELECTED_UUID echo "Test: "$CUDA_VISIBLE_DEVICES CUDA_VISIBLE_DEVICES=$SELECTED_UUID python3.8 gpu_script.py Result: Script start MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a CUDA_VISIBLE_DEVICES set to: MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a Matrix calculation on CUDA device completed successfully. Script end +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 7 0 28686 C ...gpu-cuda12.1-python38/bin/python3.8 866MiB | | 0 8 0 28744 C ...gpu-cuda12.1-python38/bin/python3.8 866MiB | | 0 9 0 28300 C ...gpu-cuda12.1-python38/bin/python3.8 866MiB | | 0 10 0 28512 C ...gpu-cuda12.1-python38/bin/python3.8 866MiB | | 0 11 0 28506 C ...gpu-cuda12.1-python38/bin/python3.8 866MiB | | 0 12 0 28516 C ...gpu-cuda12.1-python38/bin/python3.8 866MiB | | 0 13 0 28511 C ...gpu-cuda12.1-python38/bin/python3.8 866MiB | | 1 9 0 28552 C ...gpu-cuda12.1-python38/bin/python3.8 866MiB | +---------------------------------------------------------------------------------------+ This is the current setup: RockyLinux 9.2 slurm-22.05.9-1.el9.x86_64 CUDA Version: 12.1 Driver Version: 530.30.02 # nvidia-smi -L GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-6c7f34eb-7b0b-9b09-4a2a-a009fb1ba6d3) MIG 1g.10gb Device 0: (UUID: MIG-fbb82c87-989b-5e2c-98e5-a57919c5cb1a) MIG 1g.10gb Device 1: (UUID: MIG-d1679b4c-c8e7-5d78-bd63-07334ecad664) MIG 1g.10gb Device 2: (UUID: MIG-d3ab0675-d318-5e53-b487-b50695cf2e00) MIG 1g.10gb Device 3: (UUID: MIG-c1384014-7937-5942-ba2a-900bd8a4c4b0) MIG 1g.10gb Device 4: (UUID: MIG-85dd76da-c994-5830-adf0-467c66ae1b95) MIG 1g.10gb Device 5: (UUID: MIG-29a6d43b-882e-5b79-868a-15bb2c770b82) MIG 1g.10gb Device 6: (UUID: MIG-558debd2-dc13-5406-9256-73ef4f279737) GPU 1: NVIDIA H100 80GB HBM3 (UUID: GPU-1439a39c-948b-0657-98e9-aff8595a8729) MIG 2g.20gb Device 0: (UUID: MIG-a78cd536-25b7-53cb-941a-a2db5eb0375f) MIG 2g.20gb Device 1: (UUID: MIG-0a804c0c-aa27-5993-97cb-cedb854735ce) MIG 2g.20gb Device 2: (UUID: MIG-8af9af6d-8720-5763-81e2-83afc43eb42b) MIG 1g.10gb Device 3: (UUID: MIG-68d47441-f5fb-5a3b-ab5b-a0f3449c5b2a) GPU 2: NVIDIA H100 80GB HBM3 (UUID: GPU-3b182fed-bfdf-bc11-d1bd-c41fd1468d2a) MIG 3g.40gb Device 0: (UUID: MIG-2ed972aa-6640-5116-9271-3d61c0dc9f3b) MIG 3g.40gb Device 1: (UUID: MIG-3bdad788-60a5-5bef-85b9-fbdb36dc71fc) GPU 3: NVIDIA H100 80GB HBM3 (UUID: GPU-7eba9d78-e908-3db8-2633-a269aeec395e) MIG 7g.80gb Device 0: (UUID: MIG-ec043869-9176-577e-bac0-46c8411e4e37) $ scontrol show node node16 NodeName=node16 Arch=x86_64 CoresPerSocket=1 CPUAlloc=1 CPUEfctv=112 CPUTot=112 CPULoad=0.00 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:H100_1g.10gb:8,gpu:H100_2g.20gb:3,gpu:H100_3g.40gb:2,gpu:H100_7g.80gb:1 NodeAddr=node16 NodeHostName=node16 Version=22.05.9 OS=Linux 5.14.0-284.30.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Sep 16 09:55:41 UTC 2023 RealMemory=1030000 AllocMem=0 FreeMem=1011801 Sockets=112 Boards=1 State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=gpu_H100 BootTime=2024-01-18T13:52:27 SlurmdStartTime=2024-01-19T11:25:44 LastBusyTime=2024-01-19T10:59:13 CfgTRES=cpu=112,mem=1030000M,billing=112,gres/gpu=14 AllocTRES=cpu=1 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s $ scontrol show partition gpu_H100 PartitionName=gpu_H100 AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=node16 PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=112 TotalNodes=1 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED TRES=cpu=112,mem=1030000M,node=1,billing=112,gres/gpu=14 slurm.conf: AccountingStorageTRES=gres/gpu,gres/gpu:A100,gres/gpu:H100,gres/gpu:H100_1g.10gb,gres/gpu:H100_2g.20gb,gres/gpu:H100_3g.40gb,gres/gpu:H100_7g.80gb GresTypes=gpu ... NodeName=node16 CPUs=112 RealMemory=1030000 Gres=gpu:H100_1g.10gb:8,gpu:H100_2g.20gb:3,gpu:H100_3g.40gb:2,gpu:H100_7g.80gb:1 PartitionName=gpu_H100 Nodes=node16 Default=NO MaxTime=INFINITE State=UP gres.conf #AutoDetect=nvml # GPU 0 MIG 0 /proc/driver/nvidia/capabilities/gpu0/mig/gi7/access NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap66,/dev/nvidia-caps/nvidia-cap67 CPUs=0-55 # GPU 0 MIG 1 /proc/driver/nvidia/capabilities/gpu0/mig/gi8/access NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap75,/dev/nvidia-caps/nvidia-cap76 CPUs=0-55 # GPU 0 MIG 2 /proc/driver/nvidia/capabilities/gpu0/mig/gi9/access NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap84,/dev/nvidia-caps/nvidia-cap85 CPUs=0-55 # GPU 0 MIG 3 /proc/driver/nvidia/capabilities/gpu0/mig/gi10/access NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap93,/dev/nvidia-caps/nvidia-cap94 CPUs=0-55 # GPU 0 MIG 4 /proc/driver/nvidia/capabilities/gpu0/mig/gi11/access NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap102,/dev/nvidia-caps/nvidia-cap103 CPUs=0-55 # GPU 0 MIG 5 /proc/driver/nvidia/capabilities/gpu0/mig/gi12/access NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap111,/dev/nvidia-caps/nvidia-cap112 CPUs=0-55 # GPU 0 MIG 6 /proc/driver/nvidia/capabilities/gpu0/mig/gi13/access NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap120,/dev/nvidia-caps/nvidia-cap121 CPUs=0-55 # GPU 1 MIG 0 /proc/driver/nvidia/capabilities/gpu1/mig/gi3/access NodeName=node16 Name=gpu Type=H100_2g.20gb MultipleFiles=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap165,/dev/nvidia-caps/nvidia-cap166 CPUs=0-55 # GPU 1 MIG 1 /proc/driver/nvidia/capabilities/gpu1/mig/gi5/access NodeName=node16 Name=gpu Type=H100_2g.20gb MultipleFiles=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap183,/dev/nvidia-caps/nvidia-cap184 CPUs=0-55 # GPU 1 MIG 2 /proc/driver/nvidia/capabilities/gpu1/mig/gi6/access NodeName=node16 Name=gpu Type=H100_2g.20gb MultipleFiles=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap192,/dev/nvidia-caps/nvidia-cap193 CPUs=0-55 # GPU 1 MIG 3 /proc/driver/nvidia/capabilities/gpu1/mig/gi9/access NodeName=node16 Name=gpu Type=H100_1g.10gb MultipleFiles=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap219,/dev/nvidia-caps/nvidia-cap220 CPUs=0-55 # GPU 2 MIG 0 /proc/driver/nvidia/capabilities/gpu2/mig/gi1/access NodeName=node16 Name=gpu Type=H100_3g.40gb MultipleFiles=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap282,/dev/nvidia-caps/nvidia-cap283 CPUs=0-55 # GPU 2 MIG 1 /proc/driver/nvidia/capabilities/gpu2/mig/gi2/access NodeName=node16 Name=gpu Type=H100_3g.40gb MultipleFiles=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap291,/dev/nvidia-caps/nvidia-cap292 CPUs=0-55 # GPU 3 MIG 0 /proc/driver/nvidia/capabilities/gpu3/mig/gi0/access NodeName=node16 Name=gpu Type=H100_7g.80gb MultipleFiles=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap408,/dev/nvidia-caps/nvidia-cap409 CPUs=0-55 [2024-01-19T14:31:13.701] debug: gres/gpu: init: loaded [2024-01-19T14:31:13.701] debug: gpu/generic: init: init: GPU Generic plugin loaded [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_1g.10gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_2g.20gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_2g.20gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_2g.20gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_3g.40gb Count=1 [2024-01-19T14:31:13.701] Gres Name=gpu Type=H100_3g.40gb Count=1 [2024-01-19T14:31:13.702] Gres Name=gpu Type=H100_7g.80gb Count=1 Best, Drazen Jalsovec [View Less]

1 0

Potential Side Effects of larger MessageTimeout value
by Herc Silverstein 18 Jan '24

18 Jan '24

Hi, What are potential bad side effects of using a large/larger MessageTimeout? And is there a value at which this setting is too large (long)? Thanks, Herc

1 0

error
by Felix 18 Jan '24

18 Jan '24

Hello I started a new AMD node, and the error is as follows: "CPU frequency setting not configured for this node" extended looks like this: [2024-01-18T18:28:06.682] CPU frequency setting not configured for this node [2024-01-18T18:28:06.691] slurmd started on Thu, 18 Jan 2024 18:28:06 +0200 [2024-01-18T18:28:06.691] CPUs=128 Boards=1 Sockets=1 Cores=64 Threads=2 Memory=256786 TmpDisk=875797 Uptime=4569 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) In the configuration … [View More]

2 1

Re: [slurm-users] [BULK] slurm-users Digest, Vol 75, Issue 26
by Jason Macklin 18 Jan '24

18 Jan '24

If you run "scontrol show jobid <jobid>" of your pending job with the "(Resources)" tag you may see more about what is unavailable to your job. Slurm default configs can cause an entire compute node of resources to be "allocated" to a running job regardless of whether it needs all of them or not so you may need to alter one or both of the following settings to allow more than one job to run on a single node at once. You'll find these in your slurm.conf. Don't forget to "scontrol reconf"… [View More] and even potentially restart both "slurmctld" & "slurmd" on your nodes if you do end up making changes. SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory I hope this helps. Kind regards, Jason ---- Jason Macklin Manager Cyberinfrastructure, Research Cyberinfrastructure 860.837.2142 t | 860.202.7779 m jason.macklin(a)jax.org The Jackson Laboratory Maine | Connecticut | California | Shanghai www.jax.org<http://www.jax.org> The Jackson Laboratory: Leading the search for tomorrow's cures ________________________________ From: slurm-users <slurm-users-bounces(a)lists.schedmd.com> on behalf of slurm-users-request(a)lists.schedmd.com <slurm-users-request(a)lists.schedmd.com> Sent: Thursday, January 18, 2024 9:46 AM To: slurm-users(a)lists.schedmd.com <slurm-users(a)lists.schedmd.com> Subject: [BULK] slurm-users Digest, Vol 75, Issue 26 Send slurm-users mailing list submissions to slurm-users(a)lists.schedmd.com To subscribe or unsubscribe via the World Wide Web, visit https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users or, via email, send a message with subject or body 'help' to slurm-users-request(a)lists.schedmd.com You can reach the person managing the list at slurm-users-owner(a)lists.schedmd.com When replying, please edit your Subject line so it is more specific than "Re: Contents of slurm-users digest..." Today's Topics: 1. Re: Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres) (Baer, Troy) ---------------------------------------------------------------------- Message: 1 Date: Thu, 18 Jan 2024 14:46:48 +0000 From: "Baer, Troy" <troy(a)osc.edu> To: Slurm User Community List <slurm-users(a)lists.schedmd.com> Subject: Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres) Message-ID: <CH0PR01MB6924127AF471DED69151805BCF712(a)CH0PR01MB6924.prod.exchangelabs.com> Content-Type: text/plain; charset="utf-8" Hi Hafedh, Your job script has the sbatch directive ??gpus-per-node=4? set. I suspect that if you look at what?s allocated to the running job by doing ?scontrol show job <jobid>? and looking at the TRES field, it?s been allocated 4 GPUs instead of one. Regards, --Troy From: slurm-users <slurm-users-bounces(a)lists.schedmd.com> On Behalf Of Kherfani, Hafedh (Professional Services, TC) Sent: Thursday, January 18, 2024 9:38 AM To: Slurm User Community List <slurm-users(a)lists.schedmd.com> Subject: Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres) Hi Noam and Matthias, Thanks both for your answers. I changed the ?#SBATCH --gres=gpu:?4? directive (in the batch script) with ?#SBATCH --gres=gpu:?1? as you suggested, but it didn?t make a difference, as running Hi Noam and Matthias, Thanks both for your answers. I changed the ?#SBATCH --gres=gpu:4? directive (in the batch script) with ?#SBATCH --gres=gpu:1? as you suggested, but it didn?t make a difference, as running this batch script 3 times will result in the first job to be in a running state, while the second and third jobs will still be in a pending state ? [slurmtest@c-a100-master test-batch-scripts]$ cat gpu-job.sh #!/bin/bash #SBATCH --job-name=gpu-job #SBATCH --partition=gpu #SBATCH --nodes=1 #SBATCH --gpus-per-node=4 #SBATCH --gres=gpu:1 # <<<< Changed from ?4? to ?1? #SBATCH --tasks-per-node=1 #SBATCH --output=gpu_job_output.%j #SBATCH --error=gpu_job_error.%j hostname date sleep 40 pwd [slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh Submitted batch job 217 [slurmtest@c-a100-master test-batch-scripts]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 217 gpu gpu-job slurmtes R 0:02 1 c-a100-cn01 [slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh Submitted batch job 218 [slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh Submitted batch job 219 [slurmtest@c-a100-master test-batch-scripts]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 219 gpu gpu-job slurmtes PD 0:00 1 (Priority) 218 gpu gpu-job slurmtes PD 0:00 1 (Resources) 217 gpu gpu-job slurmtes R 0:07 1 c-a100-cn01 Basically I?m seeking for some help/hints on how to tell Slurm, from the batch script for example: ?I want only 1 or 2 GPUs to be used/consumed by the job?, and then I run the batch script/job a couple of times with sbatch command, and confirm that we can indeed have multiple jobs using a GPU and running in parallel, at the same time. Makes sense ? Best regards, Hafedh From: slurm-users <slurm-users-bounces(a)lists.schedmd.com<mailto:slurm-users-bounces@lists.schedmd.com>> On Behalf Of Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) Sent: jeudi 18 janvier 2024 2:30 PM To: Slurm User Community List <slurm-users(a)lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>> Subject: Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres) On Jan 18, 2024, at 7:31 AM, Matthias Loose <m.loose(a)mindcode.de<mailto:m.loose@mindcode.de>> wrote: Hi Hafedh, Im no expert in the GPU side of SLURM, but looking at you current configuration to me its working as intended at the moment. You have defined 4 GPUs and start multiple jobs each consuming 4 GPUs each. So the jobs wait for the ressource the be free again. I think what you need to look into is the MPS plugin, which seems to do what you are trying to achieve: https://slurm.schedmd.com/gres.html#MPS_Management<https://urldefense.com/v3/__https:/slurm.schedmd.com/gres.html*MPS_Manageme…> I agree with the first paragraph. How many GPUs are you expecting each job to use? I'd have assumed, based on the original text, that each job is supposed to use 1 GPU, and the 4 jobs were supposed to be running side-by-side on the one node you have (with 4 GPUs). If so, you need to tell each job to request only 1 GPU, and currently each one is requesting 4. If your jobs are actually supposed to be using 4 GPUs each, I still don't see any advantage to MPS (at least in what is my usual GPU usage pattern): all the jobs will take longer to finish, because they are sharing the fixed resource. If they take turns, at least the first ones finish as fast as they can, and the last one will finish no later than it would have if they were all time-sharing the GPUs. I guess NVIDIA had something in mind when they developed MPS, so I guess our pattern may not be typical (or at least not universal), and in that case the MPS plugin may well be what you need. [View Less]

1 0

slurm.conf
by LEROY Christine 208562 18 Jan '24

18 Jan '24

Hello all, Is there an env variable in SLURM to tell where the slurm.conf is? We would like to have on the same client node, 2 type of possible submissions to address 2 different cluster. Thanks in advance, Christine

4 3

What happens if GPU GRES exceeding number of GPUs per node
by Purwanto, Wirawan 18 Jan '24

18 Jan '24

Hi, In my HPC center, I found a SLURM job that was submitted with --gres=gpu:6 whereas the cluster has only four GPUs per node each. It is a parallel job. Here are some relevant field printout: AllocCPUS 30 AllocGRES gpu:6 AllocTRES billing=30,cpu=30,gres/gpu=6,node=3 CPUTime 1-01:23:00 CPUTimeRAW 91380 Elapsed 00:50:46 … [View More]

2 1

Strict GrpTRESMins limit
by Kamil Wilczek 17 Jan '24

17 Jan '24

Dear All, I tried to implement a strict limit on the GrpTRESMins for each user. The effect I'm trying to achieve is that after the limit of GPU minutes is reached, no new jobs can be run. No decay, no automatic resource replenishment. After the limit on GPU minutes is reached, each user should ask for more minutes. But despite exceeding the limits users *can* run new jobs. * When I'm adding a user to the cluster I set: sacctmgr --immediate add user name=... ... QOS=2gpu2d … [View More]

1 0

Suspend/Resume request limit
by 김종록 17 Jan '24

17 Jan '24

2 1

preemptable queue
by Davide DelVento 12 Jan '24

12 Jan '24

I would like to add a preemptable queue to our cluster. Actually I already have. We simply want jobs submitted to that queue be preempted if there are no resources available for jobs in other (high priority) queues. Conceptually very simple, no conditionals, no choices, just what I wrote. However it does not work as desired. This is the relevant part: grep -i Preemp /opt/slurm/slurm.conf #PreemptType = preempt/partition_prio PartitionName=regular DefMemPerCPU=4580 Default=True Nodes=node[01-… [View More]

2 4

Re: [slurm-users] sacct --name --status filtering
by Drucker, Daniel 11 Jan '24

11 Jan '24

Yes, that makes sense. Thank you! The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over … [View More]

1 0

sacct --name --status filtering
by Drucker, Daniel 11 Jan '24

11 Jan '24

What am I misunderstanding about how sacct filtering works here? I would have expected the second command to show the exact same results as the first. [root@mickey ddrucker]# sacct --starttime $(date -d "7 days ago" +"%Y-%m-%d") -X --format JobID,JobName,State,Elapsed --name zsh JobID JobName State Elapsed ------------ ---------- ---------- ---------- 257713 zsh COMPLETED 00:01:02 257714 zsh COMPLETED 00:04:01 257715 zsh … [View More]

3 2

Re: [slurm-users] sacct --name --status filtering
by Drucker, Daniel 11 Jan '24

11 Jan '24

> All I can say is that this has to do with --starttime and that you have to read the manual really carefully about how they interact, including when you have --endtime set. It’s a bit fiddly and annoying, IMO, and I can never quite remember how it works. Oh, I think I understand. --starttime actually behaves differently when --state is present: If states are given with the '-s' option then only jobs in this state at this time will be returned. So is there a way to do what I want? I … [View More]

1 0

Cleanup of old clusters in database
by Jeffrey R. Lang 11 Jan '24

11 Jan '24

We have shuttered two clusters and need to remove them from the database. To do this, do we remove the table spaces associated with the cluster names from the Slurm database? Thanks, Jeff

1 0

Re: [slurm-users] DBD_SEND_MULT_MSG - invalid uid error (Timony, Mick)
by Craig Stark 10 Jan '24

10 Jan '24

All good ideas Mick – - I've restarted slurmd on all nodes – no effect - Ran this on all nodes: #!/bin/bash uname -n id slurm id 59999 scontrol show config | grep SlurmUser All show slurm being that 59999 user. - The firewalld already has the internal network interface being used set to the trusted zone I get a bit more info out of setting the slurmctld to debug level, but I'm not sure what to make of it TBH. I'm not sure what "_handle_mult_rc_ret: PERSIST_RC is 2002 from DBD_SEND_MULT_MSG(… [View More]1474)" is trying to tell me. Jan 10 10:48:26 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid Jan 10 10:48:31 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid Jan 10 10:48:35 kirby slurmctld[461138]: slurmctld: debug: accounting_storage/slurmdbd: _handle_mult_rc_ret: PERSIST_RC is 2002 from DBD_SEND_MULT_MSG(1474): DBD_SEND_MULT_MSG message from invalid uid Jan 10 10:48:36 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid Jan 10 10:48:41 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid Jan 10 10:48:46 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid Jan 10 10:48:46 kirby slurmctld[461138]: slurmctld: debug: sched/backfill: _attempt_backfill: beginning Jan 10 10:48:46 kirby slurmctld[461138]: slurmctld: debug: sched/backfill: _attempt_backfill: no jobs to backfill Jan 10 10:48:51 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid Jan 10 10:48:53 kirby slurmctld[461138]: slurmctld: debug: accounting_storage/slurmdbd: _handle_mult_rc_ret: PERSIST_RC is 2002 from DBD_SEND_MULT_MSG(1474): DBD_SEND_MULT_MSG message from invalid uid Jan 10 10:48:56 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid Jan 10 10:49:01 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid Jan 10 10:49:06 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid Jan 10 10:49:11 kirby slurmctld[461138]: slurmctld: debug: accounting_storage/slurmdbd: _handle_mult_rc_ret: PERSIST_RC is 2002 from DBD_SEND_MULT_MSG(1474): DBD_SEND_MULT_MSG message from invalid uid Jan 10 10:49:11 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid Jan 10 10:49:16 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid Jan 10 10:49:17 kirby slurmctld[461138]: slurmctld: debug: sched: Running job scheduler for full queue. Jan 10 10:49:21 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid A bit more info / another possible clue. While "sacctmgr list Account" or "sacctmgr list user" shows expected account groups and users, "sreport user top start=12/1/23" and "sreport cluster utilization start=12/1/23" both report empty tables. Craig Stark, Ph.D. Professor, Department of Neurobiology and Behavior Director, Facility for Imaging and Brain Research (FIBRE) Director, Campus Center for Neuroimaging (CCNI) School of Biological Sciences, University of California, Irvine cestark(a)uci.edu<mailto:cestark@uci.edu> [View Less]

1 0

Beginner admin question: Prioritization within a partition based on time limit
by Kenneth Chiu 09 Jan '24

09 Jan '24

I'm just learning about slurm. I understand that different different partitions can be prioritized separately, and can have different max time limits. I was wondering whether or not there was a way to have a finer-grained prioritization based on the time limit specified by a job, within a single partition. Or perhaps this is already happening by default? Would the backfill scheduler be best for this?

2 1

Re: [slurm-users] DBD_SEND_MULT_MSG - invalid uid error
by Craig Stark 08 Jan '24

08 Jan '24

This ticket with SchedMD implies it's a munged issue: https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=1293__… Is the munge daemon running on all systems? If it is, are all servers running a network time daemon such chronyd or ntpd and the time is in sync on all hosts? Thanks Mick, munge is seemingly running on all systems (systemctl status munge). I do get a warning about the munge file changing on disk, but I'm pretty sure that's from warewulf sync'ing files every … [View More]

1 0

Re: [slurm-users] DBD_SEND_MULT_MSG - invalid uid error
by Craig Stark 08 Jan '24

08 Jan '24

3rd time trying to get this to come through to the list - hopefully this time works. I've been running SLURM for several years now, but in setting it up on a new cluster, I'm hitting a recurring issue. I'm using a MariaDB and configured it just as I had in my several-year-ago setup and in the docs. There's a "slurm" user (59999) on the OS (Rocky 9), that's on all the nodes, and I've added the slurm@localhost as instructed (grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by '… [View More]

2 1

Multifactor fair-share with single account
by Kamil Wilczek 08 Jan '24

08 Jan '24

Dear All, I have a question regarding the fair-share factor of the multifactor priority algorithm. My current understanding is that the fair-share makes sure that different *accounts* have a fair share of the computational power. But what if my organisation structure is flat and I have only one account where all my user reside. Is fair-share algorithm working in this situation -- does it take into account users (associations) from this single account, and tries to assing a fair-factor to each … [View More]

4 6

Tool for profiling resource usage by slurm jobs
by Nicolas Granger 08 Jan '24

08 Jan '24

Hi all, Happy new year everyone! I've been looking for a simple tool that reports how much resources are actually consumed by a job to help my colleagues and I adjust job requirements. I could not find such a tool, or the ones mentioned on this ML were not easy to install and use, so I have written a new one: https://github.com/CEA-LIST/sprofile It's a simple python script which parses cgroup and nvml data from the nvidia driver. It reports duration, cpu load, peak RAM, GPU load and … [View More]

1 0

A fairshare policy that spans multiple clusters
by David Baker 05 Jan '24

05 Jan '24

Hello, We are soon to install new Slurm cluster at our site. That means that we will have a total of three clusters running Slurm. Only two, that is the new clusters, will share a common file system. The original cluster has its own file system is independent of the new arrivals. If possible, we would like to try to prevent users from making significant user of all the clusters and get a 'triple whammy'. In other words, is there any way to share the fairshare information between the clusters … [View More]

2 1

slurmdb endpoints for slurmrestd
by Jackson, Gary L. 05 Jan '24

05 Jan '24

I get an HTTP 404 when I try to GET /slurmdb/v0.0.39/clusters or any other /slurmdb endpoint. I get this against multiple versions of Slurm, including 23.11.1. Using GET against /slurm/v0.0.39/ping works just fine. Is there something I need to do to turn slurmdb endpoints on? -- Gary

1 0

Slurp for sw builds
by Duane Ellis 04 Jan '24

04 Jan '24

In my case I would like to use a slurm cluster for a sw ci/cd like solution for building sw images My current scripted full system build takes 3-5 hours and is done serially we could easily find places where we can choose to build things in parallel hence the idea is to spawn parallel builds on the Linux slurm cluster example: we have a list of images we iterate over in a for loop to build each thing the steps are: cd somedir then type make or run a shell script in that directory The last … [View More]

3 2

Fwd: Fairshare: users not added
by Alex Ninaber 04 Jan '24

04 Jan '24

Hi all, A problem on slurm-23.02.4-1, 10.6.16-MariaDB; Maria and Slurmctld in active/active, SlurmDB in active/off, shared IP. Shared spool via Gluster. DB is an upgraded version of Slurm from somewhere 2017 (upgraded various times). The question is whether we should give up and start from scratch or if there's an easy fix. Problem: whenever we add a new user and add it to sacctmgr, the user shows up properly in sacct/mgr – but never shows up with the sshare commands after running some jobs. … [View More]

1 0

All nodes within one partition reboot unexpectedly
by Jinglei Hu 02 Jan '24

02 Jan '24

Hi all, I had a slurm partition gpu_gmx with the following configuration (Slurm version: 20.11.9): > NodeName=node[09-11] Gres=gpu:rtx4080:1 Sockets=1 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=62000 State=UNKNOWN > NodeName=node[12-14] Gres=gpu:rtx4070ti:1 Sockets=1 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=62000 State=UNKNOWN > PartitionName=gpu_gmx Nodes=node[09-14] Default=NO MaxTime=UNLIMITED State=UP A job running on node11 had a problem, which then triggered the reboot … [View More]of all nodes (node[09-14]) within the same partition (cat /var/log/slurmctld.log): > [2023-12-26T23:04:23.200] Batch JobId=25061 missing from batch node node11 (not found BatchStartTime after startup), Requeuing job > [2023-12-26T23:04:23.200] _job_complete: JobId=25061 WTERMSIG 126 > [2023-12-26T23:04:23.200] _job_complete: JobId=25061 cancelled by node failure > [2023-12-26T23:04:23.200] _job_complete: requeue JobId=25061 due to node failure > [2023-12-26T23:04:23.200] _job_complete: JobId=25061 done > [2023-12-26T23:04:23.200] validate_node_specs: Node node11 unexpectedly rebooted boot_time=1703603052 last response=1703602983 > [2023-12-26T23:04:23.222] validate_node_specs: Node node09 unexpectedly rebooted boot_time=1703603052 last response=1703602983 > [2023-12-26T23:04:23.579] Batch JobId=25060 missing from batch node node10 (not found BatchStartTime after startup), Requeuing job > [2023-12-26T23:04:23.579] _job_complete: JobId=25060 WTERMSIG 126 > [2023-12-26T23:04:23.579] _job_complete: JobId=25060 cancelled by node failure > [2023-12-26T23:04:23.579] _job_complete: requeue JobId=25060 due to node failure > [2023-12-26T23:04:23.579] _job_complete: JobId=25060 done > [2023-12-26T23:04:23.579] validate_node_specs: Node node10 unexpectedly rebooted boot_time=1703603052 last response=1703602983 > [2023-12-26T23:04:23.581] validate_node_specs: Node node14 unexpectedly rebooted boot_time=1703603051 last response=1703602983 > [2023-12-26T23:04:23.654] validate_node_specs: Node node13 unexpectedly rebooted boot_time=1703603052 last response=1703602983 > [2023-12-26T23:04:24.681] validate_node_specs: Node node12 unexpectedly rebooted boot_time=1703603053 last response=1703602983 > [2023-12-27T04:46:42.461] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=25060 uid 0 > [2023-12-27T04:46:43.822] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=25061 uid 0 The operating systems are CentOS 7.9.2009 on master node, and CentOS 8.5.2111 on node[09-14]. Does anyone have a similar experience and have a clue how to resolve this? Thanks in advance. Best, Jinglei [View Less]

1 0

2025

2024

slurm-users ----- 2025 ----- July 2025 June 2025 May 2025 April 2025 March 2025 February 2025 January 2025 ----- 2024 ----- December 2024 November 2024 October 2024 September 2024 August 2024 July 2024 June 2024 May 2024 April 2024 March 2024 February 2024 January 2024

slurm-users