[slurm-users] Slurm and MPICH don't play well together (salloc)

Antony Cleave antony.cleave at gmail.com
Wed Dec 29 00:15:23 UTC 2021


Hi

I've not used mpich for years but I think I see the problem. By asking for
24 CPUs per task and specifying 2 tasks you are asking slurm to allocate 48
CPUs per node.

Your nodes have 24 CPUs in total so you don't have any nodes that can
service this request

Try asking for 24 tasks. I've only ever used CPU per task for hybrid
MPI/openMP codes with 2 MPI tasks and 12 threads per task.

Antony







On Tue, 28 Dec 2021, 23:02 Mccall, Kurt E. (MSFC-EV41), <
kurt.e.mccall at nasa.gov> wrote:

> Hi,
>
>
>
> My MPICH jobs are being launched and the desired number of processes are
> created, but when one of those processes trys to spawn a new process using
> MPI_Comm_spawn(), that process just spins in the polling code deep within
> the MPICH library.   See the Slurm error message below.   This all works
> without problems on other clusters that have Torque as the process
> manager.   We are using Slurm 20.02.3 on redhat 4.18.0, and MPICH 4.0b1.
>
>
>
> salloc: defined options
>
> salloc: -------------------- --------------------
>
> salloc: cpus-per-task       : 24
>
> salloc: ntasks              : 2
>
> salloc: verbose             : 1
>
> salloc: -------------------- --------------------
>
> salloc: end of defined options
>
> salloc: Linear node selection plugin loaded with argument 4
>
> salloc: select/cons_res loaded with argument 4
>
> salloc: Cray/Aries node selection plugin loaded
>
> salloc: select/cons_tres loaded with argument 4
>
> salloc: Granted job allocation 34330
>
> srun: error: Unable to create step for job 34330: Requested node
> configuration is not availableta
>
>
>
> I’m wondering if the salloc command I am using is correct.   I intend for
> it to launch 2 processes, one per node, but reserve 24 cores on each node
> for the 2 launched processes to spawn new processes using MPI_Comm_spawn.
> Could the reservation of all 24 cores make slurm or MPICH think that there
> are no more cores available?
>
>
>
> *salloc –ntasks=2 –cpus-per-task=24 –verbose runscript.bash …*
>
>
>
>
>
> I think that our cluster’s compute nodes are configured correctly –
>
>
>
> *$ scontrol show node=n001*
>
>
>
> NodeName=n001 Arch=x86_64 CoresPerSocket=6
>
>    CPUAlloc=0 CPUTot=24 CPULoad=0.00
>
>    AvailableFeatures=(null)
>
>    ActiveFeatures=(null)
>
>    Gres=(null)
>
>    NodeAddr=n001 NodeHostName=n001 Version=20.02.3
>
>    OS=Linux 4.18.0-348.el8.x86_64 #1 SMP Mon Oct 4 12:17:22 EDT 2021
>
>    RealMemory=128351 AllocMem=0 FreeMem=126160 Sockets=4 Boards=1
>
>    State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>
>    Partitions=normal,low,high
>
>    BootTime=2021-12-21T14:25:05 SlurmdStartTime=2021-12-21T14:25:52
>
>    CfgTRES=cpu=24,mem=128351M,billing=24
>
>    AllocTRES=
>
>    CapWatts=n/a
>
>    CurrentWatts=0 AveWatts=0
>
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>
>
> Thanks for any help.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211229/8ebef49a/attachment-0001.htm>


More information about the slurm-users mailing list