[slurm-users] Unable to run multiple jobs parallel in slurm >= v20 using emulate mode
Alper Alimoglu
alper.alimoglu at gmail.com
Sat Nov 13 12:07:42 UTC 2021
My goal is to set up a single server `slurm` cluster (only using a single
computer) that can run multiple jobs in parallel.
In my node `nproc` returns 4 so I believe I can run 4 jobs in parallel if
they use a single core. In order to do it I run the controller and the
worker daemon on the same node.
When I submit four jobs at the same time, only one of them is able to run
and the other three are not able to run due to the following error: `queued
and waiting for resources`.
I am using `Ubuntu 20.04.3 LTS"`. I have observe that this approach was
working on tag version `<=19`:
```
$ git clone https://github.com/SchedMD/slurm ~/slurm && cd ~/slurm
$ git checkout e2e21cb571ce88a6dd52989ec6fe30da8c4ef15f #slurm-19-05-8-1`
$ ./configure --enable-debug --enable-front-end --enable-multiple-slurmd
$ sudo make && sudo make install
```
but does not work on higher versions like `slurm 20.02.1` or its `master`
branch.
------
```
❯ sinfo
Sat Nov 06 14:17:04 2021
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT
AVAIL_FE REASON
home1 1 debug* idle 1 1:1:1 1 0 1
(null) none
home2 1 debug* idle 1 1:1:1 1 0 1
(null) none
home3 1 debug* idle 1 1:1:1 1 0 1
(null) none
home4 1 debug* idle 1 1:1:1 1 0 1
(null) none
$ srun -N1 sleep 10 # runs
$ srun -N1 sleep 10 # queued and waiting for resources
$ srun -N1 sleep 10 # queued and waiting for resources
$ srun -N1 sleep 10 # queued and waiting for resources
```
Here, I get lost where since its [emulate-mode][1] they should be able to
run in parallel.
They way I build from the source-code:
```bash
git clone https://github.com/SchedMD/slurm ~/slurm && cd ~/slurm
./configure --enable-debug --enable-multiple-slurmd
make
sudo make install
```
--------
```
$ hostname -s
home
$ nproc
4
```
##### Compute_node setup:
```
NodeName=home[1-4] NodeHostName=home NodeAddr=127.0.0.1 CPUs=1
ThreadsPerCore=1 Port=17001
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
Shared=FORCE:1
```
I have also tried: `NodeHostName=localhost`
`slurm.conf` file:
```bash
ControlMachine=home # $(hostname -s)
ControlAddr=127.0.0.1
ClusterName=cluster
SlurmUser=alper
MailProg=/home/user/slurm_mail_prog.sh
MinJobAge=172800 # 48 h
SlurmdSpoolDir=/var/spool/slurmd
SlurmdLogFile=/var/log/slurm/slurmd.%n.log
SlurmdPidFile=/var/run/slurmd.%n.pid
AuthType=auth/munge
CryptoType=crypto/munge
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPort=6820
SlurmctldPort=6821
StateSaveLocation=/tmp/slurmstate
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
Waittime=0
SchedulerType=sched/backfill
SelectType=select/linear
PriorityDecayHalfLife=0
PriorityUsageResetPeriod=NONE
AccountingStorageEnforce=limits
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreFlags=YES
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
NodeName=home[1-2] NodeHostName=home NodeAddr=127.0.0.1 CPUs=2
ThreadsPerCore=1 Port=17001
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
Shared=FORCE:1
```
`slurmdbd.conf`:
```bash
AuthType=auth/munge
AuthInfo=/var/run/munge/munge.socket.2
DbdAddr=localhost
DbdHost=localhost
SlurmUser=alper
DebugLevel=4
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageUser=alper
StoragePass=12345
```
The way I run slurm:
```
sudo /usr/local/sbin/slurmd
sudo /usr/local/sbin/slurmdbd &
sudo /usr/local/sbin/slurmctld -cDvvvvvv
```
---------
Related:
- minimum number of computers for a slurm cluster (
https://stackoverflow.com/a/27788311/2402577)
- [Running multiple worker daemons SLURM](
https://stackoverflow.com/a/40707189/2402577)
- https://stackoverflow.com/a/47009930/2402577
[1]: https://slurm.schedmd.com/faq.html#multi_slurmd
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211113/a3594cb8/attachment.htm>
More information about the slurm-users
mailing list