[slurm-users] Unable to run multiple jobs parallel in slurm >= v20 using emulate mode

Alper Alimoglu alper.alimoglu at gmail.com
Sat Nov 13 12:07:42 UTC 2021


My goal is to set up a single server `slurm` cluster (only using a single
computer) that can run multiple jobs in parallel.

In my node `nproc` returns 4 so I believe I can run 4 jobs in parallel if
they use a single core. In order to do it I run the controller and the
worker daemon on the same node.
When I submit four jobs at the same time, only one of them is able to run
and the other three are not able to run due to the following error: `queued
and waiting for resources`.

I am using `Ubuntu 20.04.3 LTS"`. I have observe that this approach was
working on tag version `<=19`:

```
$ git clone https://github.com/SchedMD/slurm ~/slurm && cd ~/slurm
$ git checkout e2e21cb571ce88a6dd52989ec6fe30da8c4ef15f  #slurm-19-05-8-1`
$ ./configure --enable-debug --enable-front-end --enable-multiple-slurmd
$ sudo make && sudo make install
```

but does not work on higher versions like `slurm 20.02.1` or its `master`
branch.

------

```
❯ sinfo
Sat Nov 06 14:17:04 2021
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT
AVAIL_FE REASON
home1          1    debug*        idle 1       1:1:1      1        0      1
  (null) none
home2          1    debug*        idle 1       1:1:1      1        0      1
  (null) none
home3          1    debug*        idle 1       1:1:1      1        0      1
  (null) none
home4          1    debug*        idle 1       1:1:1      1        0      1
  (null) none
$ srun -N1 sleep 10  # runs
$ srun -N1 sleep 10  # queued and waiting for resources
$ srun -N1 sleep 10  # queued and waiting for resources
$ srun -N1 sleep 10  # queued and waiting for resources
```

Here, I get lost where since its [emulate-mode][1] they should be able to
run in parallel.


They way I build from the source-code:

```bash
git clone https://github.com/SchedMD/slurm ~/slurm && cd ~/slurm
./configure --enable-debug --enable-multiple-slurmd
make
sudo make install
```

--------

```
$ hostname -s
home
$ nproc
4
```

##### Compute_node setup:

```
NodeName=home[1-4] NodeHostName=home NodeAddr=127.0.0.1 CPUs=1
ThreadsPerCore=1 Port=17001
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
Shared=FORCE:1
```

I have also tried: `NodeHostName=localhost`

`slurm.conf` file:

```bash
ControlMachine=home  # $(hostname -s)
ControlAddr=127.0.0.1
ClusterName=cluster
SlurmUser=alper
MailProg=/home/user/slurm_mail_prog.sh
MinJobAge=172800  # 48 h
SlurmdSpoolDir=/var/spool/slurmd
SlurmdLogFile=/var/log/slurm/slurmd.%n.log
SlurmdPidFile=/var/run/slurmd.%n.pid
AuthType=auth/munge
CryptoType=crypto/munge
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPort=6820
SlurmctldPort=6821
StateSaveLocation=/tmp/slurmstate
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
Waittime=0
SchedulerType=sched/backfill
SelectType=select/linear
PriorityDecayHalfLife=0
PriorityUsageResetPeriod=NONE
AccountingStorageEnforce=limits
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreFlags=YES
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
NodeName=home[1-2] NodeHostName=home NodeAddr=127.0.0.1 CPUs=2
ThreadsPerCore=1 Port=17001
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
Shared=FORCE:1
```

`slurmdbd.conf`:

```bash
AuthType=auth/munge
AuthInfo=/var/run/munge/munge.socket.2
DbdAddr=localhost
DbdHost=localhost
SlurmUser=alper
DebugLevel=4
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageUser=alper
StoragePass=12345
```

The way I run slurm:

```
sudo /usr/local/sbin/slurmd
sudo /usr/local/sbin/slurmdbd &
sudo /usr/local/sbin/slurmctld -cDvvvvvv
```
---------

Related:
- minimum number of computers for a slurm cluster (
https://stackoverflow.com/a/27788311/2402577)
- [Running multiple worker daemons SLURM](
https://stackoverflow.com/a/40707189/2402577)
- https://stackoverflow.com/a/47009930/2402577


  [1]: https://slurm.schedmd.com/faq.html#multi_slurmd
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211113/a3594cb8/attachment.htm>


More information about the slurm-users mailing list