<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">

</head>

<body style="overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">

Hello Joe,

<div><br>

</div>

<div>You haven't defined any memory allocation or oversubscription in your slurm.conf so by default it is giving a full node's worth of memory to each job. There are multiple options that you can do but what you probably want to do is make both CPU and memory

 a selected type with the parameter:</div>

<div><br>

</div>

<div>SelectTypeParameters=CR_CPU_Memory</div>

<div><br>

</div>

<div>Then you'll want to define the amount of memory (in megabytes) on a node as part of the definition with</div>

<div><br>

</div>

<div>RealMemory=</div>

<div><br>

</div>

<div>Lastly, you'll need to define a default memory (in megabytes) per job, typically by memory per cpu, with</div>

<div><br>

</div>

<div>DefMemPerCpu=</div>

<div><br>

</div>

<div>With those changes when you submit a job by default it'll do #cpus x defmempercpu for the memory given to a job. You can then use either the flags --mem or --mempercpu to request more or less memory for a job.</div>

<div><br>

</div>

<div>There's also oversubscription where you can allow more memory than available on the node to be used by jobs and then you don't technically need to define the memory for a job but run into the issue that a single job could use all of it and get OOM errors

 on the nodes.</div>

<div><br>

<div>

<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">

Regards,<br>

<br>

--<br>

Willy Markuske<br>

<br>

HPC Systems Engineer</div>

<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">

MS Data Science and Engineering<br>

SDSC - Research Data Services<br>

(619) 519-4435<br>

<div>wmarkuske@sdsc.edu</div>

</div>

</div>

<div><br>

<blockquote type="cite">

<div>On Jun 16, 2023, at 12:43, Joe Waliga <jwaliga@umich.edu> wrote:</div>

<br class="Apple-interchange-newline">

<div>

<div>Hello,<br>

<br>

(This is my first time submitting a question to the list)<br>

<br>

We have a test-HPC with 1 login node and 2 computer nodes. When we submit 90 jobs onto the test-HPC, we can only run one job per node. We seem to be allocating all memory to the one job and other jobs can run until the memory is freed up.<br>

<br>

Any ideas on what we need to change inorder to free up the memory?<br>

<br>

~ ~<br>

<br>

We noticed this from the 'slurmctld.log' ...<br>

<br>

[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp01, allocated memory = 1 and all memory requested for JobId=71_*<br>

[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp02, allocated memory = 1 and all memory requested for JobId=71_*<br>

<br>

The test-HPC is running on hardware, but we also created a test-HPC using a 3 VM set constructed by Vagrant running on a Virtualbox backend.<br>

<br>

I have included some of the 'slurmctld.log' file, the batch submission script, the slurm.conf file (of the hardware based test-HPC), and the 'Vagrantfile' file (in case someone wants to recreate our test-HPC in a set of VMs.)<br>

<br>

- Joe<br>

<br>

<br>

----- (some of) slurmctld.log -----------------------------------------<br>

<br>

[2023-06-15T20:11:32.631] select/cons_tres: node_data_dump: Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)<br>

[2023-06-15T20:11:32.631] select/cons_tres: node_data_dump: Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)<br>

[2023-06-15T20:11:32.631] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp01, allocated memory = 1 and all memory requested for JobId=71_7(71)<br>

[2023-06-15T20:11:32.631] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp02, allocated memory = 1 and all memory requested for JobId=71_7(71)<br>

[2023-06-15T20:11:32.631] select/cons_tres: _job_test: SELECT_TYPE: evaluating JobId=71_7(71) on 0 nodes<br>

[2023-06-15T20:11:32.631] select/cons_tres: _job_test: SELECT_TYPE: test 0 fail: insufficient resources<br>

[2023-06-15T20:11:32.631] select/cons_tres: common_job_test: no job_resources info for JobId=71_7(71) rc=-1<br>

[2023-06-15T20:11:32.631] debug2: select/cons_tres: select_p_job_test: evaluating JobId=71_7(71)<br>

[2023-06-15T20:11:32.631] select/cons_tres: common_job_test: JobId=71_7(71) node_mode:Normal alloc_mode:Test_Only<br>

[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list & exc_cores<br>

[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02]<br>

[2023-06-15T20:11:32.632] select/cons_tres: common_job_test: nodes: min:1 max:1 requested:1 avail:2<br>

[2023-06-15T20:11:32.632] select/cons_tres: node_data_dump: Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)<br>

[2023-06-15T20:11:32.632] select/cons_tres: node_data_dump: Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)<br>

[2023-06-15T20:11:32.632] select/cons_tres: _job_test: SELECT_TYPE: evaluating JobId=71_7(71) on 2 nodes<br>

[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/enter<br>

[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02]<br>

[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: core_list:node[0]:0-15,node[1]:0-15<br>

[2023-06-15T20:11:32.632] select/cons_tres: can_job_run_on_node: SELECT_TYPE: 32 CPUs on hpc2-comp01(state:1), mem 1/1<br>

[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Node:hpc2-comp01 Sockets:2 SpecThreads:0 CPUs:Min-Max,Avail:2-32,32 ThreadsPerCore:2<br>

[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE:   Socket[0] Cores:8<br>

[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE:   Socket[1] Cores:8<br>

[2023-06-15T20:11:32.632] select/cons_tres: can_job_run_on_node: SELECT_TYPE: 32 CPUs on hpc2-comp02(state:1), mem 1/1<br>

[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Node:hpc2-comp02 Sockets:2 SpecThreads:0 CPUs:Min-Max,Avail:2-32,32 ThreadsPerCore:2<br>

[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE:   Socket[0] Cores:8<br>

[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE:   Socket[1] Cores:8<br>

[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/elim_nodes<br>

[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02]<br>

[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: core_list:node[0]:0-15,node[1]:0-15<br>

[2023-06-15T20:11:32.632] select/cons_tres: _eval_nodes: set:0 consec CPUs:64 nodes:2:hpc2-comp[01-02] begin:0 end:1 required:-1 weight:511<br>

[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/choose_nodes<br>

[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp01<br>

[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: core_list:node[0]:0-15,node[1]:0-15<br>

[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/sync_cores<br>

[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp01<br>

[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: core_list:node[0]:0-15<br>

[2023-06-15T20:11:32.632] select/cons_tres: _job_test: SELECT_TYPE: test 0 pass: test_only<br>

[2023-06-15T20:11:32.632] select/cons_tres: common_job_test: no job_resources info for JobId=71_7(71) rc=0<br>

[2023-06-15T20:11:32.632] debug3: sched: JobId=71_*. State=PENDING. Reason=Resources. Priority=4294901759. Partition=debug.<br>

[2023-06-15T20:11:56.645] debug:  Spawning ping agent for hpc2-comp[01-02]<br>

[2023-06-15T20:11:56.645] debug2: Spawning RPC agent for msg_type REQUEST_PING<br>

[2023-06-15T20:11:56.646] debug3: Tree sending to hpc2-comp01<br>

[2023-06-15T20:11:56.646] debug2: Tree head got back 0 looking for 2<br>

[2023-06-15T20:11:56.646] debug3: Tree sending to hpc2-comp02<br>

[2023-06-15T20:11:56.647] debug2: Tree head got back 1<br>

[2023-06-15T20:11:56.647] debug2: Tree head got back 2<br>

[2023-06-15T20:11:56.651] debug2: node_did_resp hpc2-comp01<br>

[2023-06-15T20:11:56.651] debug2: node_did_resp hpc2-comp02<br>

[2023-06-15T20:11:57.329] debug:  sched/backfill: _attempt_backfill: beginning<br>

[2023-06-15T20:11:57.329] debug:  sched/backfill: _attempt_backfill: 1 jobs to backfill<br>

[2023-06-15T20:11:57.329] debug2: sched/backfill: _attempt_backfill: entering _try_sched for JobId=71_*.<br>

[2023-06-15T20:11:57.329] debug2: select/cons_tres: select_p_job_test: evaluating JobId=71_*<br>

[2023-06-15T20:11:57.329] select/cons_tres: common_job_test: JobId=71_* node_mode:Normal alloc_mode:Will_Run<br>

[2023-06-15T20:11:57.329] select/cons_tres: core_array_log: node_list & exc_cores<br>

[2023-06-15T20:11:57.329] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02]<br>

[2023-06-15T20:11:57.329] select/cons_tres: common_job_test: nodes: min:1 max:1 requested:1 avail:2<br>

[2023-06-15T20:11:57.330] select/cons_tres: node_data_dump: Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)<br>

[2023-06-15T20:11:57.330] select/cons_tres: node_data_dump: Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)<br>

[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp01, allocated memory = 1 and all memory requested for JobId=71_*<br>

[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp02, allocated memory = 1 and all memory requested for JobId=71_*<br>

[2023-06-15T20:11:57.330] select/cons_tres: _job_test: SELECT_TYPE: evaluating JobId=71_* on 0 nodes<br>

[2023-06-15T20:11:57.330] select/cons_tres: _job_test: SELECT_TYPE: test 0 fail: insufficient resources<br>

[2023-06-15T20:11:57.330] select/cons_tres: _will_run_test: JobId=71_5(76): overlap=1<br>

[2023-06-15T20:11:57.330] select/cons_tres: job_res_rm_job: JobId=71_5(76) action:normal<br>

[2023-06-15T20:11:57.330] ====================<br>

[2023-06-15T20:11:57.330] JobId=71_5(76) nhosts:1 ncpus:1 node_req:1 nodes=hpc2-comp01<br>

[2023-06-15T20:11:57.330] Node[0]:<br>

[2023-06-15T20:11:57.330]   Mem(MB):1:0  Sockets:2  Cores:8  CPUs:2:0<br>

[2023-06-15T20:11:57.330]   Socket[0] Core[0] is allocated<br>

[2023-06-15T20:11:57.330] --------------------<br>

[2023-06-15T20:11:57.330] cpu_array_value[0]:2 reps:1<br>

[2023-06-15T20:11:57.330] ====================<br>

[2023-06-15T20:11:57.330] debug3: select/cons_tres: job_res_rm_job: removed JobId=71_5(76) from part debug row 0<br>

[2023-06-15T20:11:57.330] select/cons_tres: job_res_rm_job: JobId=71_5(76) finished<br>

[2023-06-15T20:11:57.330] select/cons_tres: _will_run_test: JobId=71_6(77): overlap=1<br>

[2023-06-15T20:11:57.330] select/cons_tres: job_res_rm_job: JobId=71_6(77) action:normal<br>

[2023-06-15T20:11:57.330] ====================<br>

[2023-06-15T20:11:57.330] JobId=71_6(77) nhosts:1 ncpus:1 node_req:1 nodes=hpc2-comp02<br>

[2023-06-15T20:11:57.330] Node[0]:<br>

[2023-06-15T20:11:57.330]   Mem(MB):1:0  Sockets:2  Cores:8  CPUs:2:0<br>

[2023-06-15T20:11:57.330]   Socket[0] Core[0] is allocated<br>

[2023-06-15T20:11:57.330] --------------------<br>

[2023-06-15T20:11:57.330] cpu_array_value[0]:2 reps:1<br>

[2023-06-15T20:11:57.330] ====================<br>

<br>

----- batch script -----------------------------------<br>

<br>

#!/bin/bash<br>

<br>

echo "Running on: ${SLURM_CLUSTER_NAME}, node list: ${SLURM_JOB_NODELIST}, node names: ${SLURMD_NODENAME} in: `pwd` at `date`"<br>

echo "SLURM_NTASKS: ${SLURM_NTASKS} SLURM_TASKS_PER_NODE: ${SLURM_TASKS_PER_NODE} "<br>

echo "SLURM_ARRAY_TASK_ID: ${SLURM_ARRAY_TASK_ID}"<br>

echo "SLURM_MEM_PER_CPU: ${SLURM_MEM_PER_CPU}"<br>

<br>

sleep 3600<br>

<br>

echo "END"<br>

<br>

Here is the sbatch command to run it:<br>

<br>

sbatch -J test -a1-10 -t 999:00:00 -N 1 -n 1 -p debug sbatch.slurm<br>

<br>

----- slurm.conf -----------------------------------<br>

<br>

# slurm.conf file generated by configurator.html.<br>

# Put this file on all nodes of your cluster.<br>

# See the slurm.conf man page for more information.<br>

#<br>

ClusterName=cluster<br>

SlurmctldHost=hpc2-comp00<br>

#SlurmctldHost=<br>

#DisableRootJobs=NO<br>

#EnforcePartLimits=NO<br>

#Epilog=<br>

#EpilogSlurmctld=<br>

#FirstJobId=1<br>

#MaxJobId=67043328<br>

#GresTypes=<br>

#GroupUpdateForce=0<br>

#GroupUpdateTime=600<br>

#JobFileAppend=0<br>

#JobRequeue=1<br>

#JobSubmitPlugins=lua<br>

#KillOnBadExit=0<br>

#LaunchType=launch/slurm<br>

#Licenses=foo*4,bar<br>

#MailProg=/bin/mail<br>

#MaxJobCount=10000<br>

#MaxStepCount=40000<br>

#MaxTasksPerNode=512<br>

MpiDefault=none<br>

#MpiParams=ports=#-#<br>

#PluginDir=<br>

#PlugStackConfig=<br>

#PrivateData=jobs<br>

ProctrackType=proctrack/linuxproc<br>

#Prolog=<br>

#PrologFlags=<br>

#PrologSlurmctld=<br>

#PropagatePrioProcess=0<br>

#PropagateResourceLimits=<br>

#PropagateResourceLimitsExcept=<br>

#RebootProgram=<br>

ReturnToService=1<br>

SlurmctldPidFile=/var/run/slurmctld.pid<br>

SlurmctldPort=6817<br>

SlurmdPidFile=/var/run/slurmd.pid<br>

SlurmdPort=6818<br>

SlurmdSpoolDir=/var/spool/slurmd<br>

SlurmUser=slurm<br>

#SlurmdUser=root<br>

SrunPortRange=60001-60005<br>

#SrunEpilog=<br>

#SrunProlog=<br>

StateSaveLocation=/var/spool/slurmctld<br>

SwitchType=switch/none<br>

#TaskEpilog=<br>

#TaskPlugin=task/affinity,task/cgroup<br>

TaskPlugin=task/none<br>

#TaskProlog=<br>

#TopologyPlugin=topology/tree<br>

#TmpFS=/tmp<br>

#TrackWCKey=no<br>

#TreeWidth=<br>

#UnkillableStepProgram=<br>

#UsePAM=0<br>

#<br>

#<br>

# TIMERS<br>

#BatchStartTimeout=10<br>

#CompleteWait=0<br>

#EpilogMsgTime=2000<br>

#GetEnvTimeout=2<br>

#HealthCheckInterval=0<br>

#HealthCheckProgram=<br>

InactiveLimit=0<br>

KillWait=30<br>

#MessageTimeout=10<br>

#ResvOverRun=0<br>

MinJobAge=300<br>

#OverTimeLimit=0<br>

SlurmctldTimeout=120<br>

SlurmdTimeout=300<br>

#UnkillableStepTimeout=60<br>

#VSizeFactor=0<br>

Waittime=0<br>

#<br>

#<br>

# SCHEDULING<br>

#DefMemPerCPU=0<br>

#MaxMemPerCPU=0<br>

#SchedulerTimeSlice=30<br>

SchedulerType=sched/backfill<br>

SelectType=select/cons_tres<br>

#<br>

#<br>

# JOB PRIORITY<br>

#PriorityFlags=<br>

#PriorityType=priority/basic<br>

#PriorityDecayHalfLife=<br>

#PriorityCalcPeriod=<br>

#PriorityFavorSmall=<br>

#PriorityMaxAge=<br>

#PriorityUsageResetPeriod=<br>

#PriorityWeightAge=<br>

#PriorityWeightFairshare=<br>

#PriorityWeightJobSize=<br>

#PriorityWeightPartition=<br>

#PriorityWeightQOS=<br>

#<br>

#<br>

# LOGGING AND ACCOUNTING<br>

#AccountingStorageEnforce=0<br>

AccountingStorageHost=hpc2-comp00<br>

AccountingStoragePass=/var/run/munge/munge.socket.2<br>

AccountingStoragePort=6819<br>

AccountingStorageType=accounting_storage/slurmdbd<br>

AccountingStorageUser=slurm<br>

AccountingStoreFlags=job_comment,job_env,job_extra,job_script<br>

#JobCompHost=localhost<br>

#JobCompLoc=slurm_jobcomp_db<br>

##JobCompParams=<br>

#JobCompPass=/var/run/munge/munge.socket.2<br>

#JobCompPort=3306<br>

#JobCompType=jobcomp/mysql<br>

JobCompType=jobcomp/none<br>

#JobCompUser=slurm<br>

#JobContainerType=job_container/none<br>

JobAcctGatherFrequency=30<br>

JobAcctGatherType=jobacct_gather/linux<br>

# Enabled next line - 06-15-2023<br>

SlurmctldDebug=debug5<br>

SlurmctldLogFile=/var/log/slurmctld.log<br>

# Enabled next line - 06-15-2023<br>

SlurmdDebug=debug5<br>

SlurmdLogFile=/var/log/slurmd.log<br>

#SlurmSchedLogFile=<br>

<br>

# Added next line : 06-15-2023<br>

DebugFlags=Cgroup,CPU_Bind,Data,Gres,NodeFeatures,SelectType,Steps,TraceJobs<br>

<br>

#<br>

# POWER SAVE SUPPORT FOR IDLE NODES (optional)<br>

#SuspendProgram=<br>

#ResumeProgram=<br>

#SuspendTimeout=<br>

#ResumeTimeout=<br>

#ResumeRate=<br>

#SuspendExcNodes=<br>

#SuspendExcParts=<br>

#SuspendRate=<br>

#SuspendTime=<br>

#<br>

#<br>

# COMPUTE NODES<br>

NodeName=hpc2-comp[01-02] CPUs=32 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN<br>

PartitionName=debug Nodes=hpc2-comp[01-02] Default=YES MaxTime=INFINITE State=UP<br>

<br>

----- Vagrantfile file -----------------------------------<br>

<br>

# -*- mode: ruby -*-<br>

# vi: set ft=ruby :<br>

<br>

# All Vagrant configuration is done below. The "2" in Vagrant.configure<br>

# configures the configuration version (we support older styles for<br>

# backwards compatibility). Please don't change it unless you know what<br>

# you're doing.<br>

Vagrant.configure("2") do |config|<br>

 # The most common configuration options are documented and commented below.<br>

 # For a complete reference, please see the online documentation at<br>

 # https://urldefense.com/v3/__https://docs.vagrantup.com__;!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e2wTKzoiY$ .<br>

<br>

 # Every Vagrant development environment requires a box. You can search for<br>

 # boxes at https://urldefense.com/v3/__https://vagrantcloud.com/search__;!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e2f04oxLo$ .<br>

 config.vm.box = "generic/fedora37"<br>

<br>

 # Stop vagrant from generating a new key for each host to allow ssh between<br>

 # machines.<br>

 config.ssh.insert_key = false<br>

<br>

 # The Vagrant commands are too limited to configure a NAT network,<br>

 # so run the VBoxManager commands by hand.<br>

 config.vm.provider "virtualbox" do |vbox|<br>

   # Add nic2 (eth1 on the guest VM) as the physical router.  Never change<br>

   # nic1, because that's what the host uses to communicate with the guest VM.<br>

   vbox.customize ["modifyvm", :id,<br>

                   "--nic2", "bridged",<br>

                   "--bridge-adapter2", "enp8s0"]<br>

 end<br>

<br>

 # Common provisioning for all guest VMs.<br>

 config.vm.provision "shell", inline: <<-SHELL<br>

   # Show which command is being run to associate with command output!<br>

   set -x<br>

<br>

   # Remove suprious hosts from the VM image.<br>

   sed -i '/fedora37/d' /etc/hosts<br>

   sed -i '/^127[.]0[.]1[.]1/d' /etc/hosts<br>

<br>

   # Add NAT network to /etc/hosts.<br>

   for host in 10.0.1.{100..102}<br>

   do<br>

       hostname=hpc2-comp${host:8}<br>

       grep -q $host /etc/hosts ||<br>

            echo "$host<span class="Apple-tab-span" style="white-space:pre"> </span>

$hostname" >> /etc/hosts<br>

   done<br>

   unset host hostname<br>

<br>

   # Use latest set of packages.<br>

   dnf -y update<br>

<br>

   # Install MUNGE.<br>

   dnf -y install munge<br>

<br>

   # Create the SLURM user.<br>

   id -u slurm ||<br>

      useradd -r -s /sbin/nologin -d /etc/slurm -c "SLURM job scheduler" slurm<br>

 SHELL<br>

<br>

 config.vm.define "hpc2_comp00" do |hpc2_comp00|<br>

   hpc2_comp00.vm.hostname = "hpc2-comp00"<br>

   hpc2_comp00.vm.synced_folder ".", "/vagrant", automount: true<br>

   hpc2_comp00.vm.provision :shell, inline: <<-SHELL<br>

     # Show which command is being run to associate with command output!<br>

     set -x<br>

<br>

     # Set static IP address for NAT network.<br>

     HOST=10.0.1.100<br>

     ETH1=/etc/NetworkManager/system-connections/eth1.nmconnection<br>

     sed "s|address1=10.0.1.100|address1=${HOST}|" \<br>

         /vagrant/eth1.nmconnection > $ETH1<br>

     chmod go-r $ETH1<br>

     nmcli con load $ETH1<br>

     unset $HOST<br>

<br>

     # Create the MUNGE key.<br>

     [[ -f /etc/munge/munge.key ]] || sudo -u munge /usr/sbin/mungekey -v<br>

     cp -av /etc/munge/munge.key /vagrant/<br>

<br>

     # Enable and start munge.<br>

     systemctl enable munge<br>

     systemctl start munge<br>

     systemctl status munge<br>

<br>

     # Setup database on the head node:<br>

     dnf -y install mariadb-devel mariadb-server<br>

<br>

     # Set recomended memory (5%-50% RAM) and timeout.<br>

     CNF=/etc/my.cnf.d/mariadb-server.cnf<br>

<br>

     # Note we need to use a double slash for the newline character below<br>

     # because Vagrant's inline shell script.<br>

<br>

     MYSQL_RAM=$(awk '/^MemTotal/ {printf "%.0f\\n", $2*0.05}' /proc/meminfo)<br>

     grep -q innodb_buffer_pool_size $CNF ||<br>

          sed -i '/InnoDB/a innodb_buffer_pool_size='${MYSQL_RAM}K $CNF<br>

     grep -q innodb_lock_wait_timeout $CNF ||<br>

          sed -i '/innodb_buffer_pool_size/a innodb_lock_wait_timeout=900' $CNF<br>

     unset CNF MYSQL_RAM<br>

<br>

     # Run the head node services:<br>

     systemctl enable mariadb<br>

     systemctl start mariadb<br>

     systemctl status mariadb<br>

<br>

     # Secure the server.<br>

     #<br>

     # Send interactive commands using printf per<br>

     # https://urldefense.com/v3/__https://unix.stackexchange.com/a/112348__;!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e23TZJlog$       printf "%s\n" "" n n y y y y | mariadb-secure-installation<br>

<br>

     # Install the RPM package builder for SLURM.<br>

     dnf -y install rpmdevtools<br>

<br>

     # Download SLURM.<br>

     wget -nc https://urldefense.com/v3/__https://download.schedmd.com/slurm/slurm-23.02.0.tar.bz2__;!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e2-8gcs08$       # Install the source package dependencies to determine the

 dependencies<br>

     # to build the binary.<br>

     dnf -y install \<br>

         dbus-devel \<br>

         freeipmi-devel \<br>

         hdf5-devel \<br>

         http-parser-devel \<br>

         json-c-devel \<br>

         libcurl-devel \<br>

         libjwt-devel \<br>

         libyaml-devel \<br>

         lua-devel \<br>

         lz4-devel \<br>

         man2html \<br>

         readline-devel \<br>

         rrdtool-devel<br>

     SLURM_SRPM=~/rpmbuild/SRPMS/slurm-23.02.0-1.fc37.src.rpm<br>

     # Create SLURM .rpmmacros file.<br>

     cp -av /vagrant/.rpmmacros .<br>

     [[ -f ${SLURM_SRPM} ]] ||<br>

           rpmbuild -ts slurm-23.02.0.tar.bz2 |& tee build-slurm-source.log<br>

     # Installs the source package dependencies to build the binary.<br>

     dnf -y builddep ${SLURM_SRPM}<br>

     unset SLURM_SRPM<br>

     # Build the SLURM binaries.<br>

     SLURM_RPM=~/rpmbuild/RPMS/x86_64/slurm-23.02.0-1.fc37.x86_64.rpm<br>

     [[ -f ${SLURM_RPM} ]] ||<br>

           rpmbuild -ta slurm-23.02.0.tar.bz2 |& tee build-slurm-binary.log<br>

     unset SLURM_RPM<br>

<br>

     # Copy SLURM packages to the compute nodes.<br>

     DIR_RPM=~/rpmbuild/RPMS/x86_64<br>

     cp -av ${DIR_RPM}/slurm-23*.rpm ${DIR_RPM}/slurm-slurmd-23*.rpm /vagrant/ &&<br>

        touch /vagrant/sentinel-copied-rpms.done<br>

<br>

     # Install all SLURM packages on the head node.<br>

     find ${DIR_RPM} -type f -not -name '*-slurmd-*' -not -name '*-torque-*' \<br>

          -exec dnf -y install {} +<br>

     unset DIR_RPM<br>

     # Copy the configuration files.<br>

     cp -av /vagrant/slurmdbd.conf /etc/slurm/slurmdbd.conf<br>

     chown slurm:slurm /etc/slurm/slurmdbd.conf<br>

     chmod 600 /etc/slurm/slurmdbd.conf<br>

     cp -av /vagrant/slurm.conf /etc/slurm/slurm.conf<br>

     chown root:root /etc/slurm/slurm.conf<br>

     chmod 644 /etc/slurm/slurm.conf<br>

<br>

     # Now create the slurm MySQL user.<br>

     SLURM_PASSWORD=$(awk -vFS='=' '/StoragePass/ {print $2}' /etc/slurm/slurmdbd.conf)<br>

     DBD_HOST=localhost<br>

     # https://urldefense.com/v3/__https://docs.fedoraproject.org/en-US/quick-docs/installing-mysql-mariadb/*_start_mysql_service_and_enable_at_loggin__;Iw!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e2km7qEbo$       mysql

 -e "create user if not exists 'slurm'@'$DBD_HOST' identified by '$SLURM_PASSWORD';"<br>

     mysql -e "grant all on slurm_acct_db.* to 'slurm'@'$DBD_HOST';"<br>

     mysql -e "show grants for 'slurm'@'$DBD_HOST';"<br>

     DBD_HOST=hpc2-comp00<br>

     mysql -e "create user if not exists 'slurm'@'$DBD_HOST' identified by '$SLURM_PASSWORD';"<br>

     mysql -e "grant all on slurm_acct_db.* to 'slurm'@'$DBD_HOST';"<br>

     mysql -e "show grants for 'slurm'@'$DBD_HOST';"<br>

     unset SLURM_PASSWORD DBD_HOST<br>

<br>

     systemctl enable slurmdbd<br>

     systemctl start slurmdbd<br>

     systemctl status slurmdbd<br>

<br>

     systemctl enable slurmctld<br>

     mkdir -p /var/spool/slurmctld<br>

     chown slurm:slurm /var/spool/slurmctld<br>

     # Open ports for slurmctld (6817) and slurmdbd (6819).<br>

     firewall-cmd --add-port=6817/tcp<br>

     firewall-cmd --add-port=6819/tcp<br>

     firewall-cmd --runtime-to-permanent<br>

     systemctl start slurmctld<br>

     systemctl status slurmctld<br>

<br>

     # Clear any previous node DOWN errors.<br>

     sinfo -s<br>

     sinfo -R<br>

     scontrol update nodename=ALL state=RESUME<br>

     sinfo -s<br>

     sinfo -R<br>

 SHELL<br>

 end<br>

<br>

 config.vm.define "hpc2_comp01" do |hpc2_comp01|<br>

   hpc2_comp01.vm.hostname = "hpc2-comp01"<br>

   hpc2_comp01.vm.synced_folder ".", "/vagrant", automount: true<br>

   hpc2_comp01.vm.provision :shell, inline: <<-SHELL<br>

     # Show which command is being run to associate with command output!<br>

     set -x<br>

<br>

     # Set static IP address for NAT network.<br>

     HOST=10.0.1.101<br>

     ETH1=/etc/NetworkManager/system-connections/eth1.nmconnection<br>

     sed "s|address1=10.0.1.100|address1=${HOST}|" \<br>

         /vagrant/eth1.nmconnection > $ETH1<br>

     chmod go-r $ETH1<br>

     nmcli con load $ETH1<br>

     unset $HOST<br>

<br>

     # Copy the MUNGE key.<br>

     KEY=/etc/munge/munge.key<br>

     cp -av /vagrant/munge.key /etc/munge/<br>

     chown munge:munge $KEY<br>

     chmod 600 $KEY<br>

<br>

     # Enable and start munge.<br>

     systemctl enable munge<br>

     systemctl start munge<br>

     systemctl status munge<br>

<br>

     # SLURM packages to be installed on compute nodes.<br>

     DIR_RPM=~/rpmbuild/RPMS/x86_64<br>

     mkdir -p ${DIR_RPM}<br>

     rsync -avP /vagrant/slurm*.rpm ${DIR_RPM}/<br>

     dnf -y install ${DIR_RPM}/slurm-23*.rpm<br>

     dnf -y install ${DIR_RPM}/slurm-slurmd-23*.rpm<br>

     unset DIR_RPM<br>

     # Copy the configuration file.<br>

     mkdir -p /etc/slurm<br>

     cp -av /vagrant/slurm.conf /etc/slurm/slurm.conf<br>

     chown root:root /etc/slurm/slurm.conf<br>

     chmod 644 /etc/slurm/slurm.conf<br>

     # Only enable slurmd on the worker nodes.<br>

     systemctl enable slurmd<br>

     # Open port for slurmd (6818).<br>

     firewall-cmd --add-port=6818/tcp<br>

     firewall-cmd --runtime-to-permanent<br>

     # Open port range for srun.<br>

     SRUN_PORT_RANGE=$(awk -vFS='=' '/SrunPortRange/ {print $2}' /etc/slurm/slurm.conf)<br>

     firewall-cmd --add-port=$SRUN_PORT_RANGE/tcp<br>

     firewall-cmd --runtime-to-permanent<br>

     systemctl start slurmd<br>

     systemctl status slurmd<br>

<br>

     # Clear any previous node DOWN errors.<br>

     sinfo -s<br>

     sinfo -R<br>

     scontrol update nodename=ALL state=RESUME<br>

     sinfo -s<br>

     sinfo -R<br>

 SHELL<br>

 end<br>

<br>

 config.vm.define "hpc2_comp02" do |hpc2_comp02|<br>

   hpc2_comp02.vm.hostname = "hpc2-comp02"<br>

   hpc2_comp02.vm.synced_folder ".", "/vagrant", automount: true<br>

   hpc2_comp02.vm.provision :shell, inline: <<-SHELL<br>

     # Show which command is being run to associate with command output!<br>

     set -x<br>

<br>

     # Set static IP address for NAT network.<br>

     HOST=10.0.1.102<br>

     ETH1=/etc/NetworkManager/system-connections/eth1.nmconnection<br>

     sed "s|address1=10.0.1.100|address1=${HOST}|" \<br>

         /vagrant/eth1.nmconnection > $ETH1<br>

     chmod go-r $ETH1<br>

     nmcli con load $ETH1<br>

     unset $HOST<br>

<br>

     # Copy the MUNGE key.<br>

     KEY=/etc/munge/munge.key<br>

     cp -av /vagrant/munge.key /etc/munge/<br>

     chown munge:munge $KEY<br>

     chmod 600 $KEY<br>

<br>

     # Enable and start munge.<br>

     systemctl enable munge<br>

     systemctl start munge<br>

     systemctl status munge<br>

<br>

     # SLURM packages to be installed on compute nodes.<br>

     DIR_RPM=~/rpmbuild/RPMS/x86_64<br>

     mkdir -p ${DIR_RPM}<br>

     rsync -avP /vagrant/slurm*.rpm ${DIR_RPM}/<br>

     dnf -y install ${DIR_RPM}/slurm-23*.rpm<br>

     dnf -y install ${DIR_RPM}/slurm-slurmd-23*.rpm<br>

     unset DIR_RPM<br>

     # Copy the configuration file.<br>

     mkdir -p /etc/slurm<br>

     cp -av /vagrant/slurm.conf /etc/slurm/slurm.conf<br>

     chown root:root /etc/slurm/slurm.conf<br>

     chmod 644 /etc/slurm/slurm.conf<br>

     # Only enable slurmd on the worker nodes.<br>

     systemctl enable slurmd<br>

     # Open port for slurmd (6818).<br>

     firewall-cmd --add-port=6818/tcp<br>

     firewall-cmd --runtime-to-permanent<br>

     systemctl start slurmd<br>

     systemctl status slurmd<br>

<br>

     # Clear any previous node DOWN errors.<br>

     sinfo -s<br>

     sinfo -R<br>

     scontrol update nodename=ALL state=RESUME<br>

     sinfo -s<br>

     sinfo -R<br>

 SHELL<br>

 end<br>

end<br>

<br>

-------------------------------------------------------------------------<br>

<br>

</div>

</div>

</blockquote>

</div>

<br>

</div>

</body>

</html>