[slurm-users] New slurm configuration - multiple jobs per host

Lyn Gerner schedulerqueen at gmail.com
Fri Jun 3 00:51:29 UTC 2022


Jake, my hunch is that your jobs are getting hung up on mem allocation,
such that Slurm is assigning all of memory to each job as it runs; you can
verify w/scontrol show job. If that's what's happening, try setting a
DefMemPerCPU value for your partition(s).

Best of luck,
Lyn

On Thu, May 26, 2022 at 1:39 PM Jake Jellinek <jakejellinek at outlook.com>
wrote:

> Hi Ole
>
> I only added the oversubscribe option because without it, it didn’t work -
> so in fact, it appears not to have made any difference
>
> I though the RealMemory option just said not to offer any jobs to the node
> that didn’t have AT LEAST that amount of RAM
> My large node has more than 64GB RAM (and more will be allocated later)
> but I have yet to get to a memory issue…still working on cores
>
>
> jake at compute001:~$ slurmd -C
> NodeName=compute001 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=8
> ThreadsPerCore=2 RealMemory=64359
> UpTime=0-06:58:54
>
>
> Thanks
> Jake
>
> > On 26 May 2022, at 21:11, Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk>
> wrote:
> >
> > Hi Jake,
> >
> > Firstly, which Slurm version and which OS do you use?
> >
> > Next, try simplifying by removing the oversubscribe configuration.  Read
> the slurm.conf manual page about oversubscribe, it looks a bit tricky.
> >
> > The RealMemory=1000 is extremely low and might prevent jobs from
> starting!  Run "slurmd -C" on the nodes to read appropriate node parameters
> for slurm.conf.
> >
> > I hope this helps.
> >
> > /Ole
> >
> >
> >> On 26-05-2022 21:12, Jake Jellinek wrote:
> >> Hi
> >> I am just building my first Slurm setup and have got everything running
> – well, almost.
> >> I have a two node configuration. All of my setup exists on a single
> HyperV server and I have divided up the resources to create my VMs
> >> One node I will use for heavy duty work; this is called compute001
> >> One node I will use for normal work; this is called compute002
> >> My compute node specification in slurm.conf is
> >> NodeName=DEFAULT CPUs=1 RealMemory=1000 State=UNKNOWN
> >> NodeName=compute001 CPUs=32
> >> NodeName=compute002 CPUs=2
> >> The partition specification is
> >> PartitionName=DEFAULT State=UP
> >> PartitionName=interactive Nodes=compute002 MaxTime=INFINITE
> OverSubscribe=FORCE
> >> PartitionName=simulation Nodes=compute001 MaxTime=30 OverSubscribe=FORCE
> >> I have added the OverSubscribe=FORCE option as I want more than one job
> to be able to land on my interactive/simulation queues.
> >> All of the nodes and cluster master start up fine and they all talk to
> each other but no matter what I do, I cannot get my cluster to accept more
> than one job per node.
> >> Can you help me determine where I am going wrong?
> >> Thanks a lot
> >> Jake
> >> The entire slurm.conf is pasted below
> >> # slurm.conf file generated by configurator.html.
> >> ClusterName=pm-slurm
> >> SlurmctldHost=slurm-master
> >> MpiDefault=none
> >> ProctrackType=proctrack/cgroup
> >> ReturnToService=2
> >> SlurmctldPidFile=/var/run/slurmctld.pid
> >> SlurmctldPort=6817
> >> SlurmdPidFile=/var/run/slurmd.pid
> >> SlurmdPort=6818
> >> SlurmdSpoolDir=/var/spool/slurmd
> >> SlurmUser=slurm
> >> StateSaveLocation=/home/slurm/var/spool/slurmctld
> >> SwitchType=switch/none
> >> TaskPlugin=task/cgroup
> >> #
> >> # TIMERS
> >> InactiveLimit=0
> >> KillWait=30
> >> MinJobAge=300
> >> SlurmctldTimeout=120
> >> SlurmdTimeout=300
> >> Waittime=0
> >> #
> >> # SCHEDULING
> >> SchedulerType=sched/backfill
> >> SelectType=select/cons_tres
> >> SelectTypeParameters=CR_Core_Memory
> >> #
> >> # LOGGING AND ACCOUNTING
> >> JobAcctGatherFrequency=30
> >> JobAcctGatherType=jobacct_gather/cgroup
> >> SlurmctldDebug=info
> >> SlurmctldLogFile=/var/log/slurmctld.log
> >> SlurmdDebug=info
> >> SlurmdLogFile=/var/log/slurmd.log
> >> # COMPUTE NODES
> >> NodeName=DEFAULT CPUs=1 RealMemory=1000 State=UNKNOWN
> >> NodeName=compute001 CPUs=32
> >> NodeName=compute002 CPUs=2
> >> PartitionName=DEFAULT State=UP
> >> PartitionName=interactive Nodes=compute002 MaxTime=INFINITE
> OverSubscribe=FORCE
> >> PartitionName=simulation Nodes=compute001 MaxTime=30 OverSubscribe=FORCE
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220602/e4e93b0d/attachment.htm>


More information about the slurm-users mailing list