[slurm-users] Unable to run multiple jobs parallel in slurm >= v20 using emulate mode

Sat Nov 13 23:07:52 UTC 2021

Hi Alper,

Maybe this is relevant to you:

*Can Slurm emulate nodes with more resources than physically exist on the
node?*
Yes. In the slurm.conf file, configure *SlurmdParameters=config_overrides*
and specify any desired node resource specifications (*CPUs*, *Sockets*,
*CoresPerSocket*, *ThreadsPerCore*, and/or *TmpDisk*). Slurm will use the
resource specification for each node that is given in *slurm.conf* and will
not check these specifications against those actually found on the node.
The system would best be configured with *TaskPlugin=task/none*, so that
launched tasks can run on any available CPU under operating system control.

Best

On Sat, Nov 13, 2021 at 4:10 AM Alper Alimoglu <alper.alimoglu at gmail.com>
wrote:

> My goal is to set up a single server `slurm` cluster (only using a single
> computer) that can run multiple jobs in parallel.
>
> In my node `nproc` returns 4 so I believe I can run 4 jobs in parallel if
> they use a single core. In order to do it I run the controller and the
> worker daemon on the same node.
> When I submit four jobs at the same time, only one of them is able to run
> and the other three are not able to run due to the following error: `queued
> and waiting for resources`.
>
> I am using `Ubuntu 20.04.3 LTS"`. I have observe that this approach was
> working on tag version `<=19`:
>
> ```
> $ git clone https://github.com/SchedMD/slurm ~/slurm && cd ~/slurm
> $ git checkout e2e21cb571ce88a6dd52989ec6fe30da8c4ef15f  #slurm-19-05-8-1`
> $ ./configure --enable-debug --enable-front-end --enable-multiple-slurmd
> $ sudo make && sudo make install
> ```
>
> but does not work on higher versions like `slurm 20.02.1` or its `master`
> branch.
>
> ------
>
> ```
> ❯ sinfo
> Sat Nov 06 14:17:04 2021
> NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK
> WEIGHT AVAIL_FE REASON
> home1          1    debug*        idle 1       1:1:1      1        0
>  1   (null) none
> home2          1    debug*        idle 1       1:1:1      1        0
>  1   (null) none
> home3          1    debug*        idle 1       1:1:1      1        0
>  1   (null) none
> home4          1    debug*        idle 1       1:1:1      1        0
>  1   (null) none
> $ srun -N1 sleep 10  # runs
> $ srun -N1 sleep 10  # queued and waiting for resources
> $ srun -N1 sleep 10  # queued and waiting for resources
> $ srun -N1 sleep 10  # queued and waiting for resources
> ```
>
> Here, I get lost where since its [emulate-mode][1] they should be able to
> run in parallel.
>
>
> They way I build from the source-code:
>
> ```bash
> git clone https://github.com/SchedMD/slurm ~/slurm && cd ~/slurm
> ./configure --enable-debug --enable-multiple-slurmd
> make
> sudo make install
> ```
>
> --------
>
> ```
> $ hostname -s
> home
> $ nproc
> 4
> ```
>
> ##### Compute_node setup:
>
> ```
> NodeName=home[1-4] NodeHostName=home NodeAddr=127.0.0.1 CPUs=1
> ThreadsPerCore=1 Port=17001
> PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
> Shared=FORCE:1
> ```
>
> I have also tried: `NodeHostName=localhost`
>
> `slurm.conf` file:
>
> ```bash
> ControlMachine=home  # $(hostname -s)
> ControlAddr=127.0.0.1
> ClusterName=cluster
> SlurmUser=alper
> MailProg=/home/user/slurm_mail_prog.sh
> MinJobAge=172800  # 48 h
> SlurmdSpoolDir=/var/spool/slurmd
> SlurmdLogFile=/var/log/slurm/slurmd.%n.log
> SlurmdPidFile=/var/run/slurmd.%n.pid
> AuthType=auth/munge
> CryptoType=crypto/munge
> MpiDefault=none
> ProctrackType=proctrack/pgid
> ReturnToService=1
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmdPort=6820
> SlurmctldPort=6821
> StateSaveLocation=/tmp/slurmstate
> SwitchType=switch/none
> TaskPlugin=task/none
> InactiveLimit=0
> Waittime=0
> SchedulerType=sched/backfill
> SelectType=select/linear
> PriorityDecayHalfLife=0
> PriorityUsageResetPeriod=NONE
> AccountingStorageEnforce=limits
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStoreFlags=YES
> JobCompType=jobcomp/none
> JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/none
> NodeName=home[1-2] NodeHostName=home NodeAddr=127.0.0.1 CPUs=2
> ThreadsPerCore=1 Port=17001
> PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
> Shared=FORCE:1
> ```
>
> `slurmdbd.conf`:
>
> ```bash
> AuthType=auth/munge
> AuthInfo=/var/run/munge/munge.socket.2
> DbdAddr=localhost
> DbdHost=localhost
> SlurmUser=alper
> DebugLevel=4
> LogFile=/var/log/slurm/slurmdbd.log
> PidFile=/var/run/slurmdbd.pid
> StorageType=accounting_storage/mysql
> StorageUser=alper
> StoragePass=12345
> ```
>
> The way I run slurm:
>
> ```
> sudo /usr/local/sbin/slurmd
> sudo /usr/local/sbin/slurmdbd &
> sudo /usr/local/sbin/slurmctld -cDvvvvvv
> ```
> ---------
>
> Related:
> - minimum number of computers for a slurm cluster (
> https://stackoverflow.com/a/27788311/2402577)
> - [Running multiple worker daemons SLURM](
> https://stackoverflow.com/a/40707189/2402577)
> - https://stackoverflow.com/a/47009930/2402577
>
>
>   [1]: https://slurm.schedmd.com/faq.html#multi_slurmd
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211113/29b9e66c/attachment.htm>