[slurm-users] Unable to run multiple jobs parallel in slurm >= v20 using emulate mode
Alper Alimoglu
alper.alimoglu at gmail.com
Mon Nov 22 17:21:03 UTC 2021
@Rodrigo Santibáñez Please see the updated setup related to my question.
I have compiled in the latest tag `slurm-21-08-4-1` using `./configure
--enable-debug --enable-front-end` in order to test in (`non-emulator
git checkout slurm-21-08-4-1
./configure --enable-debug --enable-multiple-slurmd
sudo make install
I have following lines in my `/usr/local/etc/slurm.conf` file:
NodeName=home NodeAddr= CPUs=4
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
I get the following error: `slurmd: fatal: Frontend not configured
correctly in slurm.conf. See FrontEndName in slurm.conf man page.`
When I try:
I get following error message:
$ sudo slurmd -Dvvv
slurmd: debug: Log file re-opened
slurmd: error: _find_node_record: lookup failure for node "home"
slurmd: error: _find_node_record: lookup failure for node "home", alias
slurmd: error: slurmd initialization failed
Than when I have tried following:
NodeName=home NodeHostName=localhost NodeAddr= CPUs=4
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
which keeps all the submitted jobs in a pending state with following error
`slurmctld: error: _slurm_rpc_node_registration node=home: Invalid node
name specified`
On Sun, Nov 14, 2021 at 3:09 PM Alper Alimoglu <alper.alimoglu at gmail.com>
> @Rodrigo Santibáñez I think I was not able to clarify my question.
> I am able to successfully run `slurm` that has versions lower than 20,
> such as `19-05-8-1`. But with the same configuration slurm that has
> version 20 or higher does not properly work.
> So I get lost to figure out correct configuration structure to work on
> the latest stable slurm version.
> On Sun, Nov 14, 2021 at 2:11 AM Rodrigo Santibáñez <
> rsantibanez.uchile at gmail.com> wrote:
>> Hi Alper,
>> Maybe this is relevant to you:
>> *Can Slurm emulate nodes with more resources than physically exist on the
>> node?*
>> Yes. In the slurm.conf file, configure
>> *SlurmdParameters=config_overrides* and specify any desired node
>> resource specifications (*CPUs*, *Sockets*, *CoresPerSocket*,
>> *ThreadsPerCore*, and/or *TmpDisk*). Slurm will use the resource
>> specification for each node that is given in *slurm.conf* and will not
>> check these specifications against those actually found on the node. The
>> system would best be configured with *TaskPlugin=task/none*, so that
>> launched tasks can run on any available CPU under operating system control.
>> Best
>> On Sat, Nov 13, 2021 at 4:10 AM Alper Alimoglu <alper.alimoglu at gmail.com>
>> wrote:
>>> My goal is to set up a single server `slurm` cluster (only using a
>>> single computer) that can run multiple jobs in parallel.
>>> In my node `nproc` returns 4 so I believe I can run 4 jobs in parallel
>>> if they use a single core. In order to do it I run the controller and the
>>> worker daemon on the same node.
>>> When I submit four jobs at the same time, only one of them is able to
>>> run and the other three are not able to run due to the following error:
>>> `queued and waiting for resources`.
>>> I am using `Ubuntu 20.04.3 LTS"`. I have observe that this approach was
>>> working on tag version `<=19`:
>>> ```
>>> $ git clone https://github.com/SchedMD/slurm ~/slurm && cd ~/slurm
>>> $ git checkout e2e21cb571ce88a6dd52989ec6fe30da8c4ef15f
>>> #slurm-19-05-8-1`
>>> $ ./configure --enable-debug --enable-front-end --enable-multiple-slurmd
>>> $ sudo make && sudo make install
>>> ```
>>> but does not work on higher versions like `slurm 20.02.1` or its
>>> `master` branch.
>>> ------
>>> ```
>>> ❯ sinfo
>>> Sat Nov 06 14:17:04 2021
>>> home1 1 debug* idle 1 1:1:1 1 0
>>> 1 (null) none
>>> home2 1 debug* idle 1 1:1:1 1 0
>>> 1 (null) none
>>> home3 1 debug* idle 1 1:1:1 1 0
>>> 1 (null) none
>>> home4 1 debug* idle 1 1:1:1 1 0
>>> 1 (null) none
>>> $ srun -N1 sleep 10 # runs
>>> $ srun -N1 sleep 10 # queued and waiting for resources
>>> $ srun -N1 sleep 10 # queued and waiting for resources
>>> $ srun -N1 sleep 10 # queued and waiting for resources
>>> ```
>>> Here, I get lost where since its [emulate-mode][1] they should be able
>>> to run in parallel.
>>> They way I build from the source-code:
>>> ```bash
>>> git clone https://github.com/SchedMD/slurm ~/slurm && cd ~/slurm
>>> ./configure --enable-debug --enable-multiple-slurmd
>>> make
>>> sudo make install
>>> ```
>>> --------
>>> ```
>>> $ hostname -s
>>> home
>>> $ nproc
>>> 4
>>> ```
>>> ##### Compute_node setup:
>>> ```
>>> NodeName=home[1-4] NodeHostName=home NodeAddr= CPUs=1
>>> ThreadsPerCore=1 Port=17001
>>> PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>>> Shared=FORCE:1
>>> ```
>>> I have also tried: `NodeHostName=localhost`
>>> `slurm.conf` file:
>>> ```bash
>>> ControlMachine=home # $(hostname -s)
>>> ControlAddr=
>>> ClusterName=cluster
>>> SlurmUser=alper
>>> MailProg=/home/user/slurm_mail_prog.sh
>>> MinJobAge=172800 # 48 h
>>> SlurmdSpoolDir=/var/spool/slurmd
>>> SlurmdLogFile=/var/log/slurm/slurmd.%n.log
>>> SlurmdPidFile=/var/run/slurmd.%n.pid
>>> AuthType=auth/munge
>>> CryptoType=crypto/munge
>>> MpiDefault=none
>>> ProctrackType=proctrack/pgid
>>> ReturnToService=1
>>> SlurmctldPidFile=/var/run/slurmctld.pid
>>> SlurmdPort=6820
>>> SlurmctldPort=6821
>>> StateSaveLocation=/tmp/slurmstate
>>> SwitchType=switch/none
>>> TaskPlugin=task/none
>>> InactiveLimit=0
>>> Waittime=0
>>> SchedulerType=sched/backfill
>>> SelectType=select/linear
>>> PriorityDecayHalfLife=0
>>> PriorityUsageResetPeriod=NONE
>>> AccountingStorageEnforce=limits
>>> AccountingStorageType=accounting_storage/slurmdbd
>>> AccountingStoreFlags=YES
>>> JobCompType=jobcomp/none
>>> JobAcctGatherFrequency=30
>>> JobAcctGatherType=jobacct_gather/none
>>> NodeName=home[1-2] NodeHostName=home NodeAddr= CPUs=2
>>> ThreadsPerCore=1 Port=17001
>>> PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>>> Shared=FORCE:1
>>> ```
>>> `slurmdbd.conf`:
>>> ```bash
>>> AuthType=auth/munge
>>> AuthInfo=/var/run/munge/munge.socket.2
>>> DbdAddr=localhost
>>> DbdHost=localhost
>>> SlurmUser=alper
>>> DebugLevel=4
>>> LogFile=/var/log/slurm/slurmdbd.log
>>> PidFile=/var/run/slurmdbd.pid
>>> StorageType=accounting_storage/mysql
>>> StorageUser=alper
>>> StoragePass=12345
>>> ```
>>> The way I run slurm:
>>> ```
>>> sudo /usr/local/sbin/slurmd
>>> sudo /usr/local/sbin/slurmdbd &
>>> sudo /usr/local/sbin/slurmctld -cDvvvvvv
>>> ```
>>> ---------
>>> Related:
>>> - minimum number of computers for a slurm cluster (
>>> https://stackoverflow.com/a/27788311/2402577)
>>> - [Running multiple worker daemons SLURM](
>>> https://stackoverflow.com/a/40707189/2402577)
>>> - https://stackoverflow.com/a/47009930/2402577
>>> [1]: https://slurm.schedmd.com/faq.html#multi_slurmd
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211122/347e24f0/attachment-0001.htm>
More information about the slurm-users
mailing list