[slurm-users] CR_Core_Memory behavior

Durai Arasan arasan.durai at gmail.com
Wed Aug 26 09:35:55 UTC 2020


Hello,

this is my node configuration:

NodeName=slurm-gpu-1 NodeAddr=192.168.0.200  Procs=16 Gres=gpu:2
State=UNKNOWN
NodeName=slurm-gpu-2 NodeAddr=192.168.0.124  Procs=1 Gres=gpu:0
State=UNKNOWN
PartitionName=gpu Nodes=slurm-gpu-1 Default=NO MaxTime=INFINITE
AllowAccounts=whitelist,gpu_users State=UP
PartitionName=compute Nodes=slurm-gpu-1,slurm-gpu-2 Default=YES
MaxTime=INFINITE AllowAccounts=whitelist State=UP


and this is one of the job scripts. You can see mem is set to 1M, so very
minimal.

#!/bin/bash
#SBATCH -J Test1
#SBATCH --nodelist=slurm-gpu-1
#SBATCH --mem=1M
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH -o /home/centos/Test1-%j.out
#SBATCH -e /home/centos/Test1-%j.err
srun sleep 60

Thanks,
Durai

On Wed, Aug 26, 2020 at 2:49 AM Jacqueline Scoggins <jscoggins at lbl.gov>
wrote:

> What is the variable for Oversubscribe is set for your partitions? By
> default Oversubscribe=No which means that none of your Cores will be shared
> with other jobs.  With oversubscribe set to Yes or Force you should set a
> number after the FORCE to allow the number of jobs that can run on each
> core of each node in the partition.
> Look at this page for a better understanding:
> https://slurm.schedmd.com/cons_res_share.html#:~:text=OverSubscribe%3DYES-,By%20default%20same%20as%20OverSubscribe%3DNO.,the%20srun%20%2D%2Doversubscribe%20option.&text=Each%20core%20can%20be%20allocated,default%204%20jobs%20per%20core).&text=CPUs%20are%20allocated%20to%20jobs
> .
>
> You can also check the oversubscribe on a partition using sinfo -o "%h"
> option.
> sinfo -o '%P %.5a %.10h %N ' | head
>
> PARTITION AVAIL OVERSUBSCR NODELIST
>
>
> Look at the sinfo options for further details.
>
>
> Jackie
>
> On Tue, Aug 25, 2020 at 9:58 AM Durai Arasan <arasan.durai at gmail.com>
> wrote:
>
>> Hello,
>>
>> On our cluster we have SelectTypeParameters set to "CR_Core_Memory".
>>
>> Under these conditions multiple jobs should be able to run on the same
>> node. But they refuse to be allocated on the same node and only one job
>> runs on the node and rest of the jobs are in pending state.
>>
>> When we changed SelectTypeParameters to "CR_Core" however, this issue was
>> resolved and multiple jobs were successfully allocated to the same node and
>> ran concurrently on the same node.
>>
>> Does anyone know why such behavior is seen? Why does including memory as
>> consumable resource lead to node exclusive behavior?
>>
>> Thanks,
>> Durai
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200826/e15c477c/attachment.htm>


More information about the slurm-users mailing list