[slurm-users] Unable to run multiple jobs parallel in slurm >= v20 using emulate mode

Alper Alimoglu alper.alimoglu at gmail.com
Mon Nov 22 17:21:03 UTC 2021


@Rodrigo Santibáñez Please see the updated setup related to my question.

I have compiled in the latest tag `slurm-21-08-4-1` using `./configure
--enable-debug --enable-front-end` in order to test in (`non-emulator
mode`):

```
git checkout slurm-21-08-4-1
./configure --enable-debug --enable-multiple-slurmd
make
sudo make install
```

I have following lines in my `/usr/local/etc/slurm.conf` file:

For:

```
NodeName=home NodeAddr=127.0.0.1 CPUs=4
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
```

I get the following error: `slurmd: fatal: Frontend not configured
correctly in slurm.conf. See FrontEndName in slurm.conf man page.`


When I try:
```
FrontEndName=home
```

I get following error message:

```
$ sudo slurmd -Dvvv
slurmd: debug:  Log file re-opened
slurmd: error: _find_node_record: lookup failure for node "home"
slurmd: error: _find_node_record: lookup failure for node "home", alias
"home"
slurmd: error: slurmd initialization failed
```

Than when I have tried following:

```
FrontEndName=127.0.0.1
NodeName=home NodeHostName=localhost NodeAddr=127.0.0.1 CPUs=4
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
```

which keeps all the submitted jobs in a pending state with following error
message:

`slurmctld: error: _slurm_rpc_node_registration node=home: Invalid node
name specified`



On Sun, Nov 14, 2021 at 3:09 PM Alper Alimoglu <alper.alimoglu at gmail.com>
wrote:

> @Rodrigo Santibáñez I think I was not able to clarify my question.
>
> I am able to successfully run `slurm` that has versions lower than 20,
> such as `19-05-8-1`. But with the same configuration slurm that has
> version 20 or higher does not properly work.
> So I get lost to figure out correct configuration structure to work on
> the latest stable slurm version.
>
>
>
>
> On Sun, Nov 14, 2021 at 2:11 AM Rodrigo Santibáñez <
> rsantibanez.uchile at gmail.com> wrote:
>
>> Hi Alper,
>>
>> Maybe this is relevant to you:
>>
>> *Can Slurm emulate nodes with more resources than physically exist on the
>> node?*
>> Yes. In the slurm.conf file, configure
>> *SlurmdParameters=config_overrides* and specify any desired node
>> resource specifications (*CPUs*, *Sockets*, *CoresPerSocket*,
>> *ThreadsPerCore*, and/or *TmpDisk*). Slurm will use the resource
>> specification for each node that is given in *slurm.conf* and will not
>> check these specifications against those actually found on the node. The
>> system would best be configured with *TaskPlugin=task/none*, so that
>> launched tasks can run on any available CPU under operating system control.
>>
>> Best
>>
>> On Sat, Nov 13, 2021 at 4:10 AM Alper Alimoglu <alper.alimoglu at gmail.com>
>> wrote:
>>
>>> My goal is to set up a single server `slurm` cluster (only using a
>>> single computer) that can run multiple jobs in parallel.
>>>
>>> In my node `nproc` returns 4 so I believe I can run 4 jobs in parallel
>>> if they use a single core. In order to do it I run the controller and the
>>> worker daemon on the same node.
>>> When I submit four jobs at the same time, only one of them is able to
>>> run and the other three are not able to run due to the following error:
>>> `queued and waiting for resources`.
>>>
>>> I am using `Ubuntu 20.04.3 LTS"`. I have observe that this approach was
>>> working on tag version `<=19`:
>>>
>>> ```
>>> $ git clone https://github.com/SchedMD/slurm ~/slurm && cd ~/slurm
>>> $ git checkout e2e21cb571ce88a6dd52989ec6fe30da8c4ef15f
>>>  #slurm-19-05-8-1`
>>> $ ./configure --enable-debug --enable-front-end --enable-multiple-slurmd
>>> $ sudo make && sudo make install
>>> ```
>>>
>>> but does not work on higher versions like `slurm 20.02.1` or its
>>> `master` branch.
>>>
>>> ------
>>>
>>> ```
>>> ❯ sinfo
>>> Sat Nov 06 14:17:04 2021
>>> NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK
>>> WEIGHT AVAIL_FE REASON
>>> home1          1    debug*        idle 1       1:1:1      1        0
>>>  1   (null) none
>>> home2          1    debug*        idle 1       1:1:1      1        0
>>>  1   (null) none
>>> home3          1    debug*        idle 1       1:1:1      1        0
>>>  1   (null) none
>>> home4          1    debug*        idle 1       1:1:1      1        0
>>>  1   (null) none
>>> $ srun -N1 sleep 10  # runs
>>> $ srun -N1 sleep 10  # queued and waiting for resources
>>> $ srun -N1 sleep 10  # queued and waiting for resources
>>> $ srun -N1 sleep 10  # queued and waiting for resources
>>> ```
>>>
>>> Here, I get lost where since its [emulate-mode][1] they should be able
>>> to run in parallel.
>>>
>>>
>>> They way I build from the source-code:
>>>
>>> ```bash
>>> git clone https://github.com/SchedMD/slurm ~/slurm && cd ~/slurm
>>> ./configure --enable-debug --enable-multiple-slurmd
>>> make
>>> sudo make install
>>> ```
>>>
>>> --------
>>>
>>> ```
>>> $ hostname -s
>>> home
>>> $ nproc
>>> 4
>>> ```
>>>
>>> ##### Compute_node setup:
>>>
>>> ```
>>> NodeName=home[1-4] NodeHostName=home NodeAddr=127.0.0.1 CPUs=1
>>> ThreadsPerCore=1 Port=17001
>>> PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>>> Shared=FORCE:1
>>> ```
>>>
>>> I have also tried: `NodeHostName=localhost`
>>>
>>> `slurm.conf` file:
>>>
>>> ```bash
>>> ControlMachine=home  # $(hostname -s)
>>> ControlAddr=127.0.0.1
>>> ClusterName=cluster
>>> SlurmUser=alper
>>> MailProg=/home/user/slurm_mail_prog.sh
>>> MinJobAge=172800  # 48 h
>>> SlurmdSpoolDir=/var/spool/slurmd
>>> SlurmdLogFile=/var/log/slurm/slurmd.%n.log
>>> SlurmdPidFile=/var/run/slurmd.%n.pid
>>> AuthType=auth/munge
>>> CryptoType=crypto/munge
>>> MpiDefault=none
>>> ProctrackType=proctrack/pgid
>>> ReturnToService=1
>>> SlurmctldPidFile=/var/run/slurmctld.pid
>>> SlurmdPort=6820
>>> SlurmctldPort=6821
>>> StateSaveLocation=/tmp/slurmstate
>>> SwitchType=switch/none
>>> TaskPlugin=task/none
>>> InactiveLimit=0
>>> Waittime=0
>>> SchedulerType=sched/backfill
>>> SelectType=select/linear
>>> PriorityDecayHalfLife=0
>>> PriorityUsageResetPeriod=NONE
>>> AccountingStorageEnforce=limits
>>> AccountingStorageType=accounting_storage/slurmdbd
>>> AccountingStoreFlags=YES
>>> JobCompType=jobcomp/none
>>> JobAcctGatherFrequency=30
>>> JobAcctGatherType=jobacct_gather/none
>>> NodeName=home[1-2] NodeHostName=home NodeAddr=127.0.0.1 CPUs=2
>>> ThreadsPerCore=1 Port=17001
>>> PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>>> Shared=FORCE:1
>>> ```
>>>
>>> `slurmdbd.conf`:
>>>
>>> ```bash
>>> AuthType=auth/munge
>>> AuthInfo=/var/run/munge/munge.socket.2
>>> DbdAddr=localhost
>>> DbdHost=localhost
>>> SlurmUser=alper
>>> DebugLevel=4
>>> LogFile=/var/log/slurm/slurmdbd.log
>>> PidFile=/var/run/slurmdbd.pid
>>> StorageType=accounting_storage/mysql
>>> StorageUser=alper
>>> StoragePass=12345
>>> ```
>>>
>>> The way I run slurm:
>>>
>>> ```
>>> sudo /usr/local/sbin/slurmd
>>> sudo /usr/local/sbin/slurmdbd &
>>> sudo /usr/local/sbin/slurmctld -cDvvvvvv
>>> ```
>>> ---------
>>>
>>> Related:
>>> - minimum number of computers for a slurm cluster (
>>> https://stackoverflow.com/a/27788311/2402577)
>>> - [Running multiple worker daemons SLURM](
>>> https://stackoverflow.com/a/40707189/2402577)
>>> - https://stackoverflow.com/a/47009930/2402577
>>>
>>>
>>>   [1]: https://slurm.schedmd.com/faq.html#multi_slurmd
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211122/347e24f0/attachment-0001.htm>


More information about the slurm-users mailing list