[slurm-users] Unable to run multiple jobs parallel in slurm >= v20 using emulate mode

Sun Nov 14 12:09:17 UTC 2021

@Rodrigo Santibáñez I think I was not able to clarify my question.

I am able to successfully run `slurm` that has versions lower than 20, such
as `19-05-8-1`. But with the same configuration slurm that has version 20
or higher does not properly work.
So I get lost to figure out correct configuration structure to work on
the latest stable slurm version.

On Sun, Nov 14, 2021 at 2:11 AM Rodrigo Santibáñez <
rsantibanez.uchile at gmail.com> wrote:

> Hi Alper,
>
> Maybe this is relevant to you:
>
> *Can Slurm emulate nodes with more resources than physically exist on the
> node?*
> Yes. In the slurm.conf file, configure *SlurmdParameters=config_overrides*
> and specify any desired node resource specifications (*CPUs*, *Sockets*,
> *CoresPerSocket*, *ThreadsPerCore*, and/or *TmpDisk*). Slurm will use the
> resource specification for each node that is given in *slurm.conf* and
> will not check these specifications against those actually found on the
> node. The system would best be configured with *TaskPlugin=task/none*, so
> that launched tasks can run on any available CPU under operating system
> control.
>
> Best
>
> On Sat, Nov 13, 2021 at 4:10 AM Alper Alimoglu <alper.alimoglu at gmail.com>
> wrote:
>
>> My goal is to set up a single server `slurm` cluster (only using a single
>> computer) that can run multiple jobs in parallel.
>>
>> In my node `nproc` returns 4 so I believe I can run 4 jobs in parallel if
>> they use a single core. In order to do it I run the controller and the
>> worker daemon on the same node.
>> When I submit four jobs at the same time, only one of them is able to run
>> and the other three are not able to run due to the following error: `queued
>> and waiting for resources`.
>>
>> I am using `Ubuntu 20.04.3 LTS"`. I have observe that this approach was
>> working on tag version `<=19`:
>>
>> ```
>> $ git clone https://github.com/SchedMD/slurm ~/slurm && cd ~/slurm
>> $ git checkout e2e21cb571ce88a6dd52989ec6fe30da8c4ef15f
>>  #slurm-19-05-8-1`
>> $ ./configure --enable-debug --enable-front-end --enable-multiple-slurmd
>> $ sudo make && sudo make install
>> ```
>>
>> but does not work on higher versions like `slurm 20.02.1` or its `master`
>> branch.
>>
>> ------
>>
>> ```
>> ❯ sinfo
>> Sat Nov 06 14:17:04 2021
>> NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK
>> WEIGHT AVAIL_FE REASON
>> home1          1    debug*        idle 1       1:1:1      1        0
>>  1   (null) none
>> home2          1    debug*        idle 1       1:1:1      1        0
>>  1   (null) none
>> home3          1    debug*        idle 1       1:1:1      1        0
>>  1   (null) none
>> home4          1    debug*        idle 1       1:1:1      1        0
>>  1   (null) none
>> $ srun -N1 sleep 10  # runs
>> $ srun -N1 sleep 10  # queued and waiting for resources
>> $ srun -N1 sleep 10  # queued and waiting for resources
>> $ srun -N1 sleep 10  # queued and waiting for resources
>> ```
>>
>> Here, I get lost where since its [emulate-mode][1] they should be able to
>> run in parallel.
>>
>>
>> They way I build from the source-code:
>>
>> ```bash
>> git clone https://github.com/SchedMD/slurm ~/slurm && cd ~/slurm
>> ./configure --enable-debug --enable-multiple-slurmd
>> make
>> sudo make install
>> ```
>>
>> --------
>>
>> ```
>> $ hostname -s
>> home
>> $ nproc
>> 4
>> ```
>>
>> ##### Compute_node setup:
>>
>> ```
>> NodeName=home[1-4] NodeHostName=home NodeAddr=127.0.0.1 CPUs=1
>> ThreadsPerCore=1 Port=17001
>> PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>> Shared=FORCE:1
>> ```
>>
>> I have also tried: `NodeHostName=localhost`
>>
>> `slurm.conf` file:
>>
>> ```bash
>> ControlMachine=home  # $(hostname -s)
>> ControlAddr=127.0.0.1
>> ClusterName=cluster
>> SlurmUser=alper
>> MailProg=/home/user/slurm_mail_prog.sh
>> MinJobAge=172800  # 48 h
>> SlurmdSpoolDir=/var/spool/slurmd
>> SlurmdLogFile=/var/log/slurm/slurmd.%n.log
>> SlurmdPidFile=/var/run/slurmd.%n.pid
>> AuthType=auth/munge
>> CryptoType=crypto/munge
>> MpiDefault=none
>> ProctrackType=proctrack/pgid
>> ReturnToService=1
>> SlurmctldPidFile=/var/run/slurmctld.pid
>> SlurmdPort=6820
>> SlurmctldPort=6821
>> StateSaveLocation=/tmp/slurmstate
>> SwitchType=switch/none
>> TaskPlugin=task/none
>> InactiveLimit=0
>> Waittime=0
>> SchedulerType=sched/backfill
>> SelectType=select/linear
>> PriorityDecayHalfLife=0
>> PriorityUsageResetPeriod=NONE
>> AccountingStorageEnforce=limits
>> AccountingStorageType=accounting_storage/slurmdbd
>> AccountingStoreFlags=YES
>> JobCompType=jobcomp/none
>> JobAcctGatherFrequency=30
>> JobAcctGatherType=jobacct_gather/none
>> NodeName=home[1-2] NodeHostName=home NodeAddr=127.0.0.1 CPUs=2
>> ThreadsPerCore=1 Port=17001
>> PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>> Shared=FORCE:1
>> ```
>>
>> `slurmdbd.conf`:
>>
>> ```bash
>> AuthType=auth/munge
>> AuthInfo=/var/run/munge/munge.socket.2
>> DbdAddr=localhost
>> DbdHost=localhost
>> SlurmUser=alper
>> DebugLevel=4
>> LogFile=/var/log/slurm/slurmdbd.log
>> PidFile=/var/run/slurmdbd.pid
>> StorageType=accounting_storage/mysql
>> StorageUser=alper
>> StoragePass=12345
>> ```
>>
>> The way I run slurm:
>>
>> ```
>> sudo /usr/local/sbin/slurmd
>> sudo /usr/local/sbin/slurmdbd &
>> sudo /usr/local/sbin/slurmctld -cDvvvvvv
>> ```
>> ---------
>>
>> Related:
>> - minimum number of computers for a slurm cluster (
>> https://stackoverflow.com/a/27788311/2402577)
>> - [Running multiple worker daemons SLURM](
>> https://stackoverflow.com/a/40707189/2402577)
>> - https://stackoverflow.com/a/47009930/2402577
>>
>>
>>   [1]: https://slurm.schedmd.com/faq.html#multi_slurmd
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211114/cd1a7673/attachment.htm>