[slurm-users] CR_Core_Memory behavior

Durai Arasan arasan.durai at gmail.com
Wed Aug 26 09:35:55 UTC 2020


this is my node configuration:

NodeName=slurm-gpu-1 NodeAddr=  Procs=16 Gres=gpu:2
NodeName=slurm-gpu-2 NodeAddr=  Procs=1 Gres=gpu:0
PartitionName=gpu Nodes=slurm-gpu-1 Default=NO MaxTime=INFINITE
AllowAccounts=whitelist,gpu_users State=UP
PartitionName=compute Nodes=slurm-gpu-1,slurm-gpu-2 Default=YES
MaxTime=INFINITE AllowAccounts=whitelist State=UP

and this is one of the job scripts. You can see mem is set to 1M, so very

#SBATCH -J Test1
#SBATCH --nodelist=slurm-gpu-1
#SBATCH --mem=1M
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH -o /home/centos/Test1-%j.out
#SBATCH -e /home/centos/Test1-%j.err
srun sleep 60


On Wed, Aug 26, 2020 at 2:49 AM Jacqueline Scoggins <jscoggins at lbl.gov>

> What is the variable for Oversubscribe is set for your partitions? By
> default Oversubscribe=No which means that none of your Cores will be shared
> with other jobs.  With oversubscribe set to Yes or Force you should set a
> number after the FORCE to allow the number of jobs that can run on each
> core of each node in the partition.
> Look at this page for a better understanding:
> https://slurm.schedmd.com/cons_res_share.html#:~:text=OverSubscribe%3DYES-,By%20default%20same%20as%20OverSubscribe%3DNO.,the%20srun%20%2D%2Doversubscribe%20option.&text=Each%20core%20can%20be%20allocated,default%204%20jobs%20per%20core).&text=CPUs%20are%20allocated%20to%20jobs
> .
> You can also check the oversubscribe on a partition using sinfo -o "%h"
> option.
> sinfo -o '%P %.5a %.10h %N ' | head
> Look at the sinfo options for further details.
> Jackie
> On Tue, Aug 25, 2020 at 9:58 AM Durai Arasan <arasan.durai at gmail.com>
> wrote:
>> Hello,
>> On our cluster we have SelectTypeParameters set to "CR_Core_Memory".
>> Under these conditions multiple jobs should be able to run on the same
>> node. But they refuse to be allocated on the same node and only one job
>> runs on the node and rest of the jobs are in pending state.
>> When we changed SelectTypeParameters to "CR_Core" however, this issue was
>> resolved and multiple jobs were successfully allocated to the same node and
>> ran concurrently on the same node.
>> Does anyone know why such behavior is seen? Why does including memory as
>> consumable resource lead to node exclusive behavior?
>> Thanks,
>> Durai
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200826/e15c477c/attachment.htm>

More information about the slurm-users mailing list