[slurm-users] New slurm configuration - multiple jobs per host

Jake Jellinek jakejellinek at outlook.com
Thu May 26 20:36:55 UTC 2022


Hi Ole

I only added the oversubscribe option because without it, it didn’t work - so in fact, it appears not to have made any difference

I though the RealMemory option just said not to offer any jobs to the node that didn’t have AT LEAST that amount of RAM
My large node has more than 64GB RAM (and more will be allocated later) but I have yet to get to a memory issue…still working on cores


jake at compute001:~$ slurmd -C
NodeName=compute001 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=64359
UpTime=0-06:58:54


Thanks
Jake

> On 26 May 2022, at 21:11, Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk> wrote:
> 
> Hi Jake,
> 
> Firstly, which Slurm version and which OS do you use?
> 
> Next, try simplifying by removing the oversubscribe configuration.  Read the slurm.conf manual page about oversubscribe, it looks a bit tricky.
> 
> The RealMemory=1000 is extremely low and might prevent jobs from starting!  Run "slurmd -C" on the nodes to read appropriate node parameters for slurm.conf.
> 
> I hope this helps.
> 
> /Ole
> 
> 
>> On 26-05-2022 21:12, Jake Jellinek wrote:
>> Hi
>> I am just building my first Slurm setup and have got everything running – well, almost.
>> I have a two node configuration. All of my setup exists on a single HyperV server and I have divided up the resources to create my VMs
>> One node I will use for heavy duty work; this is called compute001
>> One node I will use for normal work; this is called compute002
>> My compute node specification in slurm.conf is
>> NodeName=DEFAULT CPUs=1 RealMemory=1000 State=UNKNOWN
>> NodeName=compute001 CPUs=32
>> NodeName=compute002 CPUs=2
>> The partition specification is
>> PartitionName=DEFAULT State=UP
>> PartitionName=interactive Nodes=compute002 MaxTime=INFINITE OverSubscribe=FORCE
>> PartitionName=simulation Nodes=compute001 MaxTime=30 OverSubscribe=FORCE
>> I have added the OverSubscribe=FORCE option as I want more than one job to be able to land on my interactive/simulation queues.
>> All of the nodes and cluster master start up fine and they all talk to each other but no matter what I do, I cannot get my cluster to accept more than one job per node.
>> Can you help me determine where I am going wrong?
>> Thanks a lot
>> Jake
>> The entire slurm.conf is pasted below
>> # slurm.conf file generated by configurator.html.
>> ClusterName=pm-slurm
>> SlurmctldHost=slurm-master
>> MpiDefault=none
>> ProctrackType=proctrack/cgroup
>> ReturnToService=2
>> SlurmctldPidFile=/var/run/slurmctld.pid
>> SlurmctldPort=6817
>> SlurmdPidFile=/var/run/slurmd.pid
>> SlurmdPort=6818
>> SlurmdSpoolDir=/var/spool/slurmd
>> SlurmUser=slurm
>> StateSaveLocation=/home/slurm/var/spool/slurmctld
>> SwitchType=switch/none
>> TaskPlugin=task/cgroup
>> #
>> # TIMERS
>> InactiveLimit=0
>> KillWait=30
>> MinJobAge=300
>> SlurmctldTimeout=120
>> SlurmdTimeout=300
>> Waittime=0
>> #
>> # SCHEDULING
>> SchedulerType=sched/backfill
>> SelectType=select/cons_tres
>> SelectTypeParameters=CR_Core_Memory
>> #
>> # LOGGING AND ACCOUNTING
>> JobAcctGatherFrequency=30
>> JobAcctGatherType=jobacct_gather/cgroup
>> SlurmctldDebug=info
>> SlurmctldLogFile=/var/log/slurmctld.log
>> SlurmdDebug=info
>> SlurmdLogFile=/var/log/slurmd.log
>> # COMPUTE NODES
>> NodeName=DEFAULT CPUs=1 RealMemory=1000 State=UNKNOWN
>> NodeName=compute001 CPUs=32
>> NodeName=compute002 CPUs=2
>> PartitionName=DEFAULT State=UP
>> PartitionName=interactive Nodes=compute002 MaxTime=INFINITE OverSubscribe=FORCE
>> PartitionName=simulation Nodes=compute001 MaxTime=30 OverSubscribe=FORCE
> 
> 


More information about the slurm-users mailing list