[slurm-users] Getting multiple steps to run out of interactive allocation with MPI processes.

Tue Apr 21 14:01:04 UTC 2020

Hello,

I am running SLURM 17.11 have a user who has a complicated workflow.  The user wants 250 cores for 2 weeks to do some work semi-interactively.  I'm not going to give the user a reservation to do this work, because the whole point of having a scheduler is to minimize human intervention in job scheduling.

The code uses MPI (openmpi-1.8 with gcc-4.9.2).  The process that I originally envisioned was to allocate an interactive job, a new shell gets spawned and then run `mpirun`, with SLURM dispatching the work to allocation.

i.e.

```
    [headnode01] $ salloc --ntasks=2 --nodes=2
                  (SLURM grants allocation on node[01,02] and new shell spawns)
    [headnode01] $ mpirun -np 2 ./executable   # SLURM dispatches work to node[01,02]
```

This doesn't work in the user's situation. Their workflow involves having master job that automatically spawns daughter MPI jobs (using 5 cores per job, for a total of 50 jobs) that get dispatched using `sbatch`.  It would be impractical to have manage 50 interactive shells.

I was imagining doing something like the following :

1. Get interactive allocation using `salloc`
2. Submitting a batch job, that within it uses `srun --jobid=XXXX` to use the resources allocated in step 1.

I created a simple code, `tmp.c`, to test this process.

`tmp.c`:

```
    #include <unistd.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <mpi.h>

    int main(int argc, char * argv[])
    {
        int taskID = -1;
        int ntasks = -1;
        int nSteps = 100;           // Number of times to do the integration
        int step = 0;               // Current step
        char hostname[250];
        hostname[249]='\0';

        /* MPI Initializations */
        MPI_Init(&argc, &argv);
        MPI_Comm_rank(MPI_COMM_WORLD, &taskID);
        MPI_Comm_size(MPI_COMM_WORLD, &ntasks);
        gethostname(hostname, 1023);
        printf("Hello from Task %i on %s\n", taskID,hostname);

        /* Master Loop */
        for(step=0; step<nSteps; step++){
            printf("%i   : task %i   hostname %s\n", step, taskID, hostname);
            usleep(1000000);
            fflush(stdout);
            MPI_Barrier(MPI_COMM_WORLD);    // Ensure every task completes
        }

        MPI_Finalize();
        return 0;
    }
```

I compile, I allocate resources and then I try to use `srun` to utilize those resources.

i.e.

```
    [headnode01] $ mpicc tmp.c -o tmp
    [headnode01] $ salloc --ntasks=3 --nodes=3
                  (SLURM grants allocation on node[14-16] and new shell spawns)
    [headnode01] $ srun --jobid=XXXX --ntasks=1 --pty ./tmp      # Done from a different shell, not the new shell...
    Hello World from Task 0 on node14.cluster
    0   : task 0   hostname node14.cluster
    1   : task 0   hostname node14.cluster
    2   : task 0   hostname node14.cluster
```

Ok, this is expected.  1 MPI task running on 1 node with 1 core.  If I do

```
    [headnode01] $ srun --jobid=XXXX --ntasks=2 --pty ./tmp     # Done from a different shell, not the new shell...
    Hello World from Task 0 on node14.cluster
    0   : task 0   hostname node14.cluster
    1   : task 0   hostname node14.cluster
    2   : task 0   hostname node14.cluster
```

This is unexpected. I would expect task 0 and task 1 to be on node[14,15] because I have 3 cores/tasks allocated across 3 nodes. Instead, if I look at node[14,15] I see that both nodes have a process `tmp` running, but I only catch the stdout from node14.  Why is that?

If I try instead, excluding --pty

```
    srun --jobid=2440814 --ntasks=2 --mpi=openmpi ./tmp
    Hello from Task 0 on node14.cluster
    0   : task 0   hostname node14.cluster
    Hello from Task 0 on node15.cluster
    0   : task 0   hostname node15.cluster
    1   : task 0   hostname node14.cluster
    1   : task 0   hostname node15.cluster
```
This is also not what I want.  I don't want two separate instances of `tmp` running on two separate nodes.  I want the program `tmp` to utilize two cores on two different nodes.  I'd instead expect the output to be :

```
    Hello from Task 0 on node14.cluster
    0   : task 0   hostname node14.cluster
    Hello from Task 1 on node15.cluster
    0   : task 1   hostname node15.cluster
    1   : task 0   hostname node14.cluster
    1   : task 1   hostname node15.cluster
```

I can achieve the above expected output if I run
```
sbatch --ntasks=2 --nodes=2 --wrap="mpirun -np 2 ./tmp"
```
but I'd like to do this interactively.

QUESTION:

How do I create an allocation and then utilize parts and pieces of that single allocation using `srun` with MPI processes?  I'd like for an MPI process being run via `srun` to be able to utilize multiple cores spread across multiple nodes.

Best,

======================================
Ali Snedden, Ph.D.
HPC Scientific Programmer
The High Performance Computing Facility
Nationwide Children’s Hospital Research Institute

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200421/b0c14d16/attachment-0001.htm>