[slurm-users] CR_Core_Memory behavior

Wed Aug 26 13:26:58 UTC 2020

Hello Durai,

you did not specify the amount of memory in your node configuration.

Perhaps it defaults to 1MB and so your 1MB-job already uses all the 
memory that the scheduler thinks the node has...?

What does "scontrol show node slurm-gpu-1" say? Look for the 
"RealMemory" field in the output.

Best,
Christoph

On 26/08/2020 11.35, Durai Arasan wrote:
> Hello,
> 
> this is my node configuration:
> 
> NodeName=slurm-gpu-1 NodeAddr=192.168.0.200  Procs=16 Gres=gpu:2 
> State=UNKNOWN
> NodeName=slurm-gpu-2 NodeAddr=192.168.0.124  Procs=1 Gres=gpu:0 
> State=UNKNOWN
> PartitionName=gpu Nodes=slurm-gpu-1 Default=NO MaxTime=INFINITE 
> AllowAccounts=whitelist,gpu_users State=UP
> PartitionName=compute Nodes=slurm-gpu-1,slurm-gpu-2 Default=YES 
> MaxTime=INFINITE AllowAccounts=whitelist State=UP
> 
> 
> and this is one of the job scripts. You can see mem is set to 1M, so 
> very minimal.
> 
> #!/bin/bash
> #SBATCH -J Test1
> #SBATCH --nodelist=slurm-gpu-1
> #SBATCH --mem=1M
> #SBATCH --ntasks=1
> #SBATCH --cpus-per-task=1
> #SBATCH -o /home/centos/Test1-%j.out
> #SBATCH -e /home/centos/Test1-%j.err
> srun sleep 60
> 
> Thanks,
> Durai
> 
> On Wed, Aug 26, 2020 at 2:49 AM Jacqueline Scoggins <jscoggins at lbl.gov 
> <mailto:jscoggins at lbl.gov>> wrote:
> 
>     What is the variable for Oversubscribe is set for your partitions?
>     By default Oversubscribe=No which means that none of your Cores will
>     be shared with other jobs.  With oversubscribe set to Yes or Force
>     you should set a number after the FORCE to allow the number of jobs
>     that can run on each core of each node in the partition.
>     Look at this page for a better understanding:
>     https://slurm.schedmd.com/cons_res_share.html#:~:text=OverSubscribe%3DYES-,By%20default%20same%20as%20OverSubscribe%3DNO.,the%20srun%20%2D%2Doversubscribe%20option.&text=Each%20core%20can%20be%20allocated,default%204%20jobs%20per%20core).&text=CPUs%20are%20allocated%20to%20jobs.
> 
>     You can also check the oversubscribe on a partition using sinfo -o
>     "%h" option.
>     sinfo -o '%P %.5a %.10h %N ' | head
> 
>     PARTITION AVAIL OVERSUBSCR NODELIST
> 
> 
>     Look at the sinfo options for further details.
> 
> 
>     Jackie
> 
> 
>     On Tue, Aug 25, 2020 at 9:58 AM Durai Arasan <arasan.durai at gmail.com
>     <mailto:arasan.durai at gmail.com>> wrote:
> 
>         Hello,
> 
>         On our cluster we have SelectTypeParameters set to "CR_Core_Memory".
> 
>         Under these conditions multiple jobs should be able to run on
>         the same node. But they refuse to be allocated on the same node
>         and only one job runs on the node and rest of the jobs are in
>         pending state.
> 
>         When we changed SelectTypeParameters to "CR_Core" however, this
>         issue was resolved and multiple jobs were successfully allocated
>         to the same node and ran concurrently on the same node.
> 
>         Does anyone know why such behavior is seen? Why does including
>         memory as consumable resource lead to node exclusive behavior?
> 
>         Thanks,
>         Durai
> 

-- 
Dr. Christoph Brüning
Universität Würzburg
Rechenzentrum
Am Hubland
D-97074 Würzburg
Tel.: +49 931 31-80499