After a Bright Computing/Base Command update. I'm encountering a slurm.conf error as seen below. I tried removing the MemSpecLimit parameter from the node names but the changes don't seem to be taking affect. Even after restarting slurmd across compute nodes and restarting slurmctld on the head node. I'm also suspicious of the RealMemory being set to zero.
Any insight? Open to suggestions. Thanks ahead of time!
sbatch: error: NodeNames=ch3lahpccn[001-032] MemSpecLimit=0 is invalid, reset to 0 sbatch: error: NodeNames=ch3lahpcgpu1 MemSpecLimit=0 is invalid, reset to 0 sbatch: error: Memory specification can not be satisfied sbatch: error: Batch job submission failed: Requested node configuration is not available
Caterpillar: Confidential Green
Hello,
On 2/6/25 9:50 PM, Chase Schuette via slurm-users wrote:
After a Bright Computing/Base Command update. I'm encountering a slurm.conf error as seen below. I tried removing the MemSpecLimit parameter from the node names but the changes don't seem to be taking affect. Even after restarting slurmd across compute nodes and restarting slurmctld on the head node. I'm also suspicious of the RealMemory being set to zero.
Any insight? Open to suggestions. Thanks ahead of time!
Maybe you can take this to the Nvidia forums to find better help there, but typically, with BCM, the slurm.conf is partially auto-generated, including nodes and partitions (be mindful of any comment blocks that might be mentioning this).
The way you'd change those in a typical BCM+Slurm deployment is by using overlays. Here's an example cmsh path:
home;configurationoverlay;use "slurm-client";roles;use slurmclient;
From there, you should be able to show/get/set/clear the appropriate values. Just remember to commit your changes.
Best,