[slurm-users] CR_Core_Memory behavior
Christoph Brüning
christoph.bruening at uni-wuerzburg.de
Wed Aug 26 13:26:58 UTC 2020
Hello Durai,
you did not specify the amount of memory in your node configuration.
Perhaps it defaults to 1MB and so your 1MB-job already uses all the
memory that the scheduler thinks the node has...?
What does "scontrol show node slurm-gpu-1" say? Look for the
"RealMemory" field in the output.
Best,
Christoph
On 26/08/2020 11.35, Durai Arasan wrote:
> Hello,
>
> this is my node configuration:
>
> NodeName=slurm-gpu-1 NodeAddr=192.168.0.200 Procs=16 Gres=gpu:2
> State=UNKNOWN
> NodeName=slurm-gpu-2 NodeAddr=192.168.0.124 Procs=1 Gres=gpu:0
> State=UNKNOWN
> PartitionName=gpu Nodes=slurm-gpu-1 Default=NO MaxTime=INFINITE
> AllowAccounts=whitelist,gpu_users State=UP
> PartitionName=compute Nodes=slurm-gpu-1,slurm-gpu-2 Default=YES
> MaxTime=INFINITE AllowAccounts=whitelist State=UP
>
>
> and this is one of the job scripts. You can see mem is set to 1M, so
> very minimal.
>
> #!/bin/bash
> #SBATCH -J Test1
> #SBATCH --nodelist=slurm-gpu-1
> #SBATCH --mem=1M
> #SBATCH --ntasks=1
> #SBATCH --cpus-per-task=1
> #SBATCH -o /home/centos/Test1-%j.out
> #SBATCH -e /home/centos/Test1-%j.err
> srun sleep 60
>
> Thanks,
> Durai
>
> On Wed, Aug 26, 2020 at 2:49 AM Jacqueline Scoggins <jscoggins at lbl.gov
> <mailto:jscoggins at lbl.gov>> wrote:
>
> What is the variable for Oversubscribe is set for your partitions?
> By default Oversubscribe=No which means that none of your Cores will
> be shared with other jobs. With oversubscribe set to Yes or Force
> you should set a number after the FORCE to allow the number of jobs
> that can run on each core of each node in the partition.
> Look at this page for a better understanding:
> https://slurm.schedmd.com/cons_res_share.html#:~:text=OverSubscribe%3DYES-,By%20default%20same%20as%20OverSubscribe%3DNO.,the%20srun%20%2D%2Doversubscribe%20option.&text=Each%20core%20can%20be%20allocated,default%204%20jobs%20per%20core).&text=CPUs%20are%20allocated%20to%20jobs.
>
> You can also check the oversubscribe on a partition using sinfo -o
> "%h" option.
> sinfo -o '%P %.5a %.10h %N ' | head
>
> PARTITION AVAIL OVERSUBSCR NODELIST
>
>
> Look at the sinfo options for further details.
>
>
> Jackie
>
>
> On Tue, Aug 25, 2020 at 9:58 AM Durai Arasan <arasan.durai at gmail.com
> <mailto:arasan.durai at gmail.com>> wrote:
>
> Hello,
>
> On our cluster we have SelectTypeParameters set to "CR_Core_Memory".
>
> Under these conditions multiple jobs should be able to run on
> the same node. But they refuse to be allocated on the same node
> and only one job runs on the node and rest of the jobs are in
> pending state.
>
> When we changed SelectTypeParameters to "CR_Core" however, this
> issue was resolved and multiple jobs were successfully allocated
> to the same node and ran concurrently on the same node.
>
> Does anyone know why such behavior is seen? Why does including
> memory as consumable resource lead to node exclusive behavior?
>
> Thanks,
> Durai
>
--
Dr. Christoph Brüning
Universität Würzburg
Rechenzentrum
Am Hubland
D-97074 Würzburg
Tel.: +49 931 31-80499
More information about the slurm-users
mailing list