[slurm-users] Strange error, submission denied
Marcus Wagner
wagner at itc.rwth-aachen.de
Thu Feb 21 07:12:38 UTC 2019
ahh, ...
one thing, I forgot. The following is working again ...
--ntasks=24 --ntasks-per-node=24
NumNodes=1 NumCPUs=48 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=48,mem=120000M,energy=63,node=1,billing=48
Socks/Node=* NtasksPerN:B:S:C=24:0:*:1 CoreSpec=*
MinCPUsNode=24 MinMemoryNode=120000M MinTmpDiskNode=0
Best
Marcus
On 2/21/19 7:17 AM, Marcus Wagner wrote:
> Hi Andreas,
>
> I'll try to sum this up ;)
>
> First of all, I used now a Broadwell node, so there is no interference
> with Skylake and SubNuma clustering.
>
> We are using slurm 18.08.5-2
>
> I have configured the node as slurmd -C tells me:
> NodeName=lnm596 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2
> RealMemory=120000
> Feature=bwx2650,hostok,hpcwork Weight=10430
> State=UNKNOWN
>
> This is, what slurmctld knows about the node:
> NodeName=lnm596 Arch=x86_64 CoresPerSocket=12
> CPUAlloc=0 CPUTot=48 CPULoad=0.03
> AvailableFeatures=bwx2650,hostok,hpcwork
> ActiveFeatures=bwx2650,hostok,hpcwork
> Gres=(null)
> GresDrain=N/A
> GresUsed=gpu:0
> NodeAddr=lnm596 NodeHostName=lnm596 Version=18.08
> OS=Linux 3.10.0-957.5.1.el7.x86_64 #1 SMP Fri Feb 1 14:54:57 UTC 2019
> RealMemory=120000 AllocMem=0 FreeMem=125507 Sockets=2 Boards=1
> State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=10430 Owner=N/A
> MCS_label=N/A
> Partitions=future
> BootTime=2019-02-19T07:43:33 SlurmdStartTime=2019-02-20T12:08:54
> CfgTRES=cpu=48,mem=120000M,billing=48
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=120 LowestJoules=714879 ConsumedJoules=8059263
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>
> Lets first begin with half of the node:
>
> --ntasks=12 -> 12 CPUs asked. I implicitly get the hyperthread for
> free (besides the accounting).
> NumNodes=1 NumCPUs=24 NumTasks=12 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
> TRES=cpu=24,mem=120000M,energy=46,node=1,billing=24
> Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
> MinCPUsNode=1 MinMemoryNode=120000M MinTmpDiskNode=0
>
> --ntasks=12 --cpus-per-tasks=2 -> 24 CPUs asked. I now have explicitly
> asked for 24 CPUs
> NumNodes=1 NumCPUs=24 NumTasks=12 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
> TRES=cpu=24,mem=120000M,energy=55,node=1,billing=24
> Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
> MinCPUsNode=2 MinMemoryNode=120000M MinTmpDiskNode=0
>
> --ntasks=12 --ntasks-per-node=12 --cpus-per-tasks=2 -> 24 CPUs asked.
> Additional constraint: All 12 tasks should be on one node. I also
> asked here for 24 CPUs.
> NumNodes=1 NumCPUs=24 NumTasks=12 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
> TRES=cpu=24,mem=120000M,energy=55,node=1,billing=24
> Socks/Node=* NtasksPerN:B:S:C=12:0:*:1 CoreSpec=*
> MinCPUsNode=24 MinMemoryNode=120000M MinTmpDiskNode=0
>
> Everything good up to now. Now I'll try to use the full node:
>
> --ntasks=24 -> 24 CPUs asked, implicitly got 48.
> NumNodes=1 NumCPUs=48 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
> TRES=cpu=48,mem=120000M,energy=62,node=1,billing=48
> Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
> MinCPUsNode=1 MinMemoryNode=120000M MinTmpDiskNode=0
>
> --ntasks=24 --cpus-per-tasks=2 -> 48 CPUs explicitly asked.
> NumNodes=1 NumCPUs=48 NumTasks=24 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
> TRES=cpu=48,mem=120000M,energy=62,node=1,billing=48
> Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
> MinCPUsNode=2 MinMemoryNode=120000M MinTmpDiskNode=0
>
> And now the funny thing, I don't understand
> --ntasks=24 --ntasks-per-node=24 --cpus-per-tasks=2 -> 48 CPUs asked,
> all 24 tasks on one node. Slurm tells me:
> sbatch: error: Batch job submission failed: Requested node
> configuration is not available
>
> I would have expected the following job, which would have fit onto the
> node:
> NumNodes=1 NumCPUs=48 NumTasks=24 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
> TRES=cpu=48,mem=120000M,energy=62,node=1,billing=48
> Socks/Node=* NtasksPerN:B:S:C=24:0:*:1 CoreSpec=*
> MinCPUsNode=48 MinMemoryNode=120000M MinTmpDiskNode=0
>
> part of the sbatch -vvv output:
> sbatch: ntasks : 24 (set)
> sbatch: cpus_per_task : 2
> sbatch: nodes : 1 (set)
> sbatch: sockets-per-node : -2
> sbatch: cores-per-socket : -2
> sbatch: threads-per-core : -2
> sbatch: ntasks-per-node : 24
> sbatch: ntasks-per-socket : -2
> sbatch: ntasks-per-core : -2
>
> So, again, I see 24 tasks per node, 2 cpus per task and 1 node. This
> is altogether 48 CPUs on one node. Which fits perfectly, as one can
> see with the last two examples
> Sprich 24 tasks pro Knoten, 2 cpus pro task, 1 Knoten. Macht bei mir
> immer noch 48 CPUs.
>
>
> I just ask explicitly what slurm already gives me implicitly, or have
> I understood something wrong.
>
> We will have to look into this further internally. Might be we have to
> give up CR_ONE_TASK_PER_CORE.
>
>
> Best
> Marcus
>
> P.S.:
> Sorry for the lengthy post
>
> On 2/20/19 11:59 AM, Henkel wrote:
>> Hi Chris,
>> Hi Marcus,
>>
>> Just want to understand the cause, too. I'll try to sum it up.
>>
>> Chris you have
>>
>> CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2
>>
>> and
>>
>> srun -C gpu -N 1 --ntasks-per-node=80 hostname
>>
>> works.
>>
>> Marcus has configured
>>
>> CPUs=48 Sockets=4 CoresPerSocket=12 ThreadsPerCore=2
>> (slurmd -C says CPUs=96 Boards=1 SocketsPerBoard=4 CoresPerSocket=12
>> ThreadsPerCore=2)
>>
>> and
>>
>> CR_ONE_TASK_PER_CORE
>>
>> and
>>
>> srun -n 48 WORKS
>>
>> srun -N 1 --ntasks-per-node=48 DOESN'T WORK.
>>
>> I'm not sure if it's caused by CR_ONE_TASK_PER_CORE but at least that's
>> one of the major differences. I'm wondering if the effort to force using
>> only physical cores is doubled by removing the 48 Threads AND setting
>> CR_ONE_TAKS_PER_CORE. My impression is that with CR_ONE_TASK_PER_CORE
>> ntasks-per-node accounts for threads (you have set ThreadsPerCore=2),
>> hence only 24 may work but CR_ONE_TASK_PER_CORE doen't affect the
>> selection of 'cores only' with ntasks.
>>
>> We don't use CR_ONE_TASK_PER_CORE but our users either set -c 2 or
>> --hint=nomultithread, which results in core-only.
>>
>> You could also enforce this with a job-submit-plugin or lua-plugin.
>>
>> The fact that CR_ONE_TASK_PER_CORE is described as "under changed" in
>> the public bugs and that there is a non-accessible bug about this
>> probably points to better not use this unless you have to.
>>
>> Best,
>>
>> Andreas
>>
>> On 2/20/19 7:49 AM, Chris Samuel wrote:
>>> On Tuesday, 19 February 2019 10:14:21 PM PST Marcus Wagner wrote:
>>>
>>>> sbatch -N 1 --ntasks-per-node=48 --wrap hostname
>>>> submission denied, got jobid 199805
>>> On one of our 40 core nodes with 2 hyperthreads:
>>>
>>> $ srun -C gpu -N 1 --ntasks-per-node=80 hostname | uniq -c
>>> 80 nodename02
>>>
>>> The spec is:
>>>
>>> CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2
>>>
>>> Hope this helps!
>>>
>>> All the best,
>>> Chris
>
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de
More information about the slurm-users
mailing list