I am writing to seek assistance with a critical issue on our single-node system managed by Slurm. Our jobs are queued and marked as awaiting resources, but they are not starting despite seeming availability. I'm new with SLURM and my only experience was a class on installing it so I have no experience, running it or using it.
Issue Summary:
Main Problem: Jobs submitted only one run and the second says *NODELIST(REASON) (Resources*). I've checked that our single node has enough RAM (2TB) and CPU's (64) available.
# COMPUTE NODES NodeName=cusco CPUs=64 Sockets=2 CoresPerSocket=32 ThreadsPerCore=1 RealMemory=2052077 Gres=gpu:1,gpu:1,gpu:1,gpu:1 PartitionName=mainpart Default=YES MinNodes=1 DefaultTime=00:60:00 MaxTime=UNLIMITED AllowAccounts=ALL Nodes=ALL State=UP OverSubscribe=Force
System Details: We have a single-node setup with Slurm as the workload manager. The node appears to have sufficient resources for the queued jobs.
Troubleshooting Performed: Configuration Checks: I have verified all Slurm configurations and the system's resource availability, which should not be limiting job execution. Service Status: The Slurm daemon slurmdbd is active and running without any reported issues. System resource monitoring shows no shortages that would prevent job initiation.
Any guidance and help will be deeply appreciated!
What does “scontrol show node cusco” and “scontrol show job PENDING_JOB_ID” show?
On one job we currently have that’s pending due to Resources, that job has requested 90 CPUs and 180 GB of memory as seen in its ReqTRES= value, but the node it wants to run on only has 37 CPUs available (seen by comparing its CfgTRES= and AllocTRES= values).
From: Alison Peterson via slurm-users slurm-users@lists.schedmd.com Date: Thursday, April 4, 2024 at 10:43 AM To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] SLURM configuration help
External Email Warning
This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
________________________________ I am writing to seek assistance with a critical issue on our single-node system managed by Slurm. Our jobs are queued and marked as awaiting resources, but they are not starting despite seeming availability. I'm new with SLURM and my only experience was a class on installing it so I have no experience, running it or using it.
Issue Summary:
Main Problem: Jobs submitted only one run and the second says NODELIST(REASON) (Resources). I've checked that our single node has enough RAM (2TB) and CPU's (64) available.
# COMPUTE NODES NodeName=cusco CPUs=64 Sockets=2 CoresPerSocket=32 ThreadsPerCore=1 RealMemory=2052077 Gres=gpu:1,gpu:1,gpu:1,gpu:1 PartitionName=mainpart Default=YES MinNodes=1 DefaultTime=00:60:00 MaxTime=UNLIMITED AllowAccounts=ALL Nodes=ALL State=UP OverSubscribe=Force
System Details: We have a single-node setup with Slurm as the workload manager. The node appears to have sufficient resources for the queued jobs. Troubleshooting Performed: Configuration Checks: I have verified all Slurm configurations and the system's resource availability, which should not be limiting job execution. Service Status: The Slurm daemon slurmdbd is active and running without any reported issues. System resource monitoring shows no shortages that would prevent job initiation.
Any guidance and help will be deeply appreciated!
-- Alison Peterson IT Research Support Analyst Information Technology apeterson5@sdsu.edumailto:mfarley@sdsu.edu O: 619-594-3364 San Diego State University | SDSU.eduhttp://sdsu.edu/ 5500 Campanile Drive | San Diego, CA 92182-8080 [Image removed by sender.]