<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
</head>
<body style="overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">
Hello Joe,
<div><br>
</div>
<div>You haven't defined any memory allocation or oversubscription in your slurm.conf so by default it is giving a full node's worth of memory to each job. There are multiple options that you can do but what you probably want to do is make both CPU and memory
a selected type with the parameter:</div>
<div><br>
</div>
<div>SelectTypeParameters=CR_CPU_Memory</div>
<div><br>
</div>
<div>Then you'll want to define the amount of memory (in megabytes) on a node as part of the definition with</div>
<div><br>
</div>
<div>RealMemory=</div>
<div><br>
</div>
<div>Lastly, you'll need to define a default memory (in megabytes) per job, typically by memory per cpu, with</div>
<div><br>
</div>
<div>DefMemPerCpu=</div>
<div><br>
</div>
<div>With those changes when you submit a job by default it'll do #cpus x defmempercpu for the memory given to a job. You can then use either the flags --mem or --mempercpu to request more or less memory for a job.</div>
<div><br>
</div>
<div>There's also oversubscription where you can allow more memory than available on the node to be used by jobs and then you don't technically need to define the memory for a job but run into the issue that a single job could use all of it and get OOM errors
on the nodes.</div>
<div><br>
<div>
<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">
Regards,<br>
<br>
--<br>
Willy Markuske<br>
<br>
HPC Systems Engineer</div>
<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">
MS Data Science and Engineering<br>
SDSC - Research Data Services<br>
(619) 519-4435<br>
<div>wmarkuske@sdsc.edu</div>
</div>
</div>
<div><br>
<blockquote type="cite">
<div>On Jun 16, 2023, at 12:43, Joe Waliga <jwaliga@umich.edu> wrote:</div>
<br class="Apple-interchange-newline">
<div>
<div>Hello,<br>
<br>
(This is my first time submitting a question to the list)<br>
<br>
We have a test-HPC with 1 login node and 2 computer nodes. When we submit 90 jobs onto the test-HPC, we can only run one job per node. We seem to be allocating all memory to the one job and other jobs can run until the memory is freed up.<br>
<br>
Any ideas on what we need to change inorder to free up the memory?<br>
<br>
~ ~<br>
<br>
We noticed this from the 'slurmctld.log' ...<br>
<br>
[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp01, allocated memory = 1 and all memory requested for JobId=71_*<br>
[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp02, allocated memory = 1 and all memory requested for JobId=71_*<br>
<br>
The test-HPC is running on hardware, but we also created a test-HPC using a 3 VM set constructed by Vagrant running on a Virtualbox backend.<br>
<br>
I have included some of the 'slurmctld.log' file, the batch submission script, the slurm.conf file (of the hardware based test-HPC), and the 'Vagrantfile' file (in case someone wants to recreate our test-HPC in a set of VMs.)<br>
<br>
- Joe<br>
<br>
<br>
----- (some of) slurmctld.log -----------------------------------------<br>
<br>
[2023-06-15T20:11:32.631] select/cons_tres: node_data_dump: Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)<br>
[2023-06-15T20:11:32.631] select/cons_tres: node_data_dump: Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)<br>
[2023-06-15T20:11:32.631] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp01, allocated memory = 1 and all memory requested for JobId=71_7(71)<br>
[2023-06-15T20:11:32.631] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp02, allocated memory = 1 and all memory requested for JobId=71_7(71)<br>
[2023-06-15T20:11:32.631] select/cons_tres: _job_test: SELECT_TYPE: evaluating JobId=71_7(71) on 0 nodes<br>
[2023-06-15T20:11:32.631] select/cons_tres: _job_test: SELECT_TYPE: test 0 fail: insufficient resources<br>
[2023-06-15T20:11:32.631] select/cons_tres: common_job_test: no job_resources info for JobId=71_7(71) rc=-1<br>
[2023-06-15T20:11:32.631] debug2: select/cons_tres: select_p_job_test: evaluating JobId=71_7(71)<br>
[2023-06-15T20:11:32.631] select/cons_tres: common_job_test: JobId=71_7(71) node_mode:Normal alloc_mode:Test_Only<br>
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list & exc_cores<br>
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02]<br>
[2023-06-15T20:11:32.632] select/cons_tres: common_job_test: nodes: min:1 max:1 requested:1 avail:2<br>
[2023-06-15T20:11:32.632] select/cons_tres: node_data_dump: Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)<br>
[2023-06-15T20:11:32.632] select/cons_tres: node_data_dump: Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)<br>
[2023-06-15T20:11:32.632] select/cons_tres: _job_test: SELECT_TYPE: evaluating JobId=71_7(71) on 2 nodes<br>
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/enter<br>
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02]<br>
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: core_list:node[0]:0-15,node[1]:0-15<br>
[2023-06-15T20:11:32.632] select/cons_tres: can_job_run_on_node: SELECT_TYPE: 32 CPUs on hpc2-comp01(state:1), mem 1/1<br>
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Node:hpc2-comp01 Sockets:2 SpecThreads:0 CPUs:Min-Max,Avail:2-32,32 ThreadsPerCore:2<br>
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Socket[0] Cores:8<br>
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Socket[1] Cores:8<br>
[2023-06-15T20:11:32.632] select/cons_tres: can_job_run_on_node: SELECT_TYPE: 32 CPUs on hpc2-comp02(state:1), mem 1/1<br>
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Node:hpc2-comp02 Sockets:2 SpecThreads:0 CPUs:Min-Max,Avail:2-32,32 ThreadsPerCore:2<br>
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Socket[0] Cores:8<br>
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Socket[1] Cores:8<br>
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/elim_nodes<br>
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02]<br>
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: core_list:node[0]:0-15,node[1]:0-15<br>
[2023-06-15T20:11:32.632] select/cons_tres: _eval_nodes: set:0 consec CPUs:64 nodes:2:hpc2-comp[01-02] begin:0 end:1 required:-1 weight:511<br>
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/choose_nodes<br>
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp01<br>
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: core_list:node[0]:0-15,node[1]:0-15<br>
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/sync_cores<br>
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp01<br>
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: core_list:node[0]:0-15<br>
[2023-06-15T20:11:32.632] select/cons_tres: _job_test: SELECT_TYPE: test 0 pass: test_only<br>
[2023-06-15T20:11:32.632] select/cons_tres: common_job_test: no job_resources info for JobId=71_7(71) rc=0<br>
[2023-06-15T20:11:32.632] debug3: sched: JobId=71_*. State=PENDING. Reason=Resources. Priority=4294901759. Partition=debug.<br>
[2023-06-15T20:11:56.645] debug: Spawning ping agent for hpc2-comp[01-02]<br>
[2023-06-15T20:11:56.645] debug2: Spawning RPC agent for msg_type REQUEST_PING<br>
[2023-06-15T20:11:56.646] debug3: Tree sending to hpc2-comp01<br>
[2023-06-15T20:11:56.646] debug2: Tree head got back 0 looking for 2<br>
[2023-06-15T20:11:56.646] debug3: Tree sending to hpc2-comp02<br>
[2023-06-15T20:11:56.647] debug2: Tree head got back 1<br>
[2023-06-15T20:11:56.647] debug2: Tree head got back 2<br>
[2023-06-15T20:11:56.651] debug2: node_did_resp hpc2-comp01<br>
[2023-06-15T20:11:56.651] debug2: node_did_resp hpc2-comp02<br>
[2023-06-15T20:11:57.329] debug: sched/backfill: _attempt_backfill: beginning<br>
[2023-06-15T20:11:57.329] debug: sched/backfill: _attempt_backfill: 1 jobs to backfill<br>
[2023-06-15T20:11:57.329] debug2: sched/backfill: _attempt_backfill: entering _try_sched for JobId=71_*.<br>
[2023-06-15T20:11:57.329] debug2: select/cons_tres: select_p_job_test: evaluating JobId=71_*<br>
[2023-06-15T20:11:57.329] select/cons_tres: common_job_test: JobId=71_* node_mode:Normal alloc_mode:Will_Run<br>
[2023-06-15T20:11:57.329] select/cons_tres: core_array_log: node_list & exc_cores<br>
[2023-06-15T20:11:57.329] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02]<br>
[2023-06-15T20:11:57.329] select/cons_tres: common_job_test: nodes: min:1 max:1 requested:1 avail:2<br>
[2023-06-15T20:11:57.330] select/cons_tres: node_data_dump: Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)<br>
[2023-06-15T20:11:57.330] select/cons_tres: node_data_dump: Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)<br>
[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp01, allocated memory = 1 and all memory requested for JobId=71_*<br>
[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp02, allocated memory = 1 and all memory requested for JobId=71_*<br>
[2023-06-15T20:11:57.330] select/cons_tres: _job_test: SELECT_TYPE: evaluating JobId=71_* on 0 nodes<br>
[2023-06-15T20:11:57.330] select/cons_tres: _job_test: SELECT_TYPE: test 0 fail: insufficient resources<br>
[2023-06-15T20:11:57.330] select/cons_tres: _will_run_test: JobId=71_5(76): overlap=1<br>
[2023-06-15T20:11:57.330] select/cons_tres: job_res_rm_job: JobId=71_5(76) action:normal<br>
[2023-06-15T20:11:57.330] ====================<br>
[2023-06-15T20:11:57.330] JobId=71_5(76) nhosts:1 ncpus:1 node_req:1 nodes=hpc2-comp01<br>
[2023-06-15T20:11:57.330] Node[0]:<br>
[2023-06-15T20:11:57.330] Mem(MB):1:0 Sockets:2 Cores:8 CPUs:2:0<br>
[2023-06-15T20:11:57.330] Socket[0] Core[0] is allocated<br>
[2023-06-15T20:11:57.330] --------------------<br>
[2023-06-15T20:11:57.330] cpu_array_value[0]:2 reps:1<br>
[2023-06-15T20:11:57.330] ====================<br>
[2023-06-15T20:11:57.330] debug3: select/cons_tres: job_res_rm_job: removed JobId=71_5(76) from part debug row 0<br>
[2023-06-15T20:11:57.330] select/cons_tres: job_res_rm_job: JobId=71_5(76) finished<br>
[2023-06-15T20:11:57.330] select/cons_tres: _will_run_test: JobId=71_6(77): overlap=1<br>
[2023-06-15T20:11:57.330] select/cons_tres: job_res_rm_job: JobId=71_6(77) action:normal<br>
[2023-06-15T20:11:57.330] ====================<br>
[2023-06-15T20:11:57.330] JobId=71_6(77) nhosts:1 ncpus:1 node_req:1 nodes=hpc2-comp02<br>
[2023-06-15T20:11:57.330] Node[0]:<br>
[2023-06-15T20:11:57.330] Mem(MB):1:0 Sockets:2 Cores:8 CPUs:2:0<br>
[2023-06-15T20:11:57.330] Socket[0] Core[0] is allocated<br>
[2023-06-15T20:11:57.330] --------------------<br>
[2023-06-15T20:11:57.330] cpu_array_value[0]:2 reps:1<br>
[2023-06-15T20:11:57.330] ====================<br>
<br>
----- batch script -----------------------------------<br>
<br>
#!/bin/bash<br>
<br>
echo "Running on: ${SLURM_CLUSTER_NAME}, node list: ${SLURM_JOB_NODELIST}, node names: ${SLURMD_NODENAME} in: `pwd` at `date`"<br>
echo "SLURM_NTASKS: ${SLURM_NTASKS} SLURM_TASKS_PER_NODE: ${SLURM_TASKS_PER_NODE} "<br>
echo "SLURM_ARRAY_TASK_ID: ${SLURM_ARRAY_TASK_ID}"<br>
echo "SLURM_MEM_PER_CPU: ${SLURM_MEM_PER_CPU}"<br>
<br>
sleep 3600<br>
<br>
echo "END"<br>
<br>
Here is the sbatch command to run it:<br>
<br>
sbatch -J test -a1-10 -t 999:00:00 -N 1 -n 1 -p debug sbatch.slurm<br>
<br>
----- slurm.conf -----------------------------------<br>
<br>
# slurm.conf file generated by configurator.html.<br>
# Put this file on all nodes of your cluster.<br>
# See the slurm.conf man page for more information.<br>
#<br>
ClusterName=cluster<br>
SlurmctldHost=hpc2-comp00<br>
#SlurmctldHost=<br>
#DisableRootJobs=NO<br>
#EnforcePartLimits=NO<br>
#Epilog=<br>
#EpilogSlurmctld=<br>
#FirstJobId=1<br>
#MaxJobId=67043328<br>
#GresTypes=<br>
#GroupUpdateForce=0<br>
#GroupUpdateTime=600<br>
#JobFileAppend=0<br>
#JobRequeue=1<br>
#JobSubmitPlugins=lua<br>
#KillOnBadExit=0<br>
#LaunchType=launch/slurm<br>
#Licenses=foo*4,bar<br>
#MailProg=/bin/mail<br>
#MaxJobCount=10000<br>
#MaxStepCount=40000<br>
#MaxTasksPerNode=512<br>
MpiDefault=none<br>
#MpiParams=ports=#-#<br>
#PluginDir=<br>
#PlugStackConfig=<br>
#PrivateData=jobs<br>
ProctrackType=proctrack/linuxproc<br>
#Prolog=<br>
#PrologFlags=<br>
#PrologSlurmctld=<br>
#PropagatePrioProcess=0<br>
#PropagateResourceLimits=<br>
#PropagateResourceLimitsExcept=<br>
#RebootProgram=<br>
ReturnToService=1<br>
SlurmctldPidFile=/var/run/slurmctld.pid<br>
SlurmctldPort=6817<br>
SlurmdPidFile=/var/run/slurmd.pid<br>
SlurmdPort=6818<br>
SlurmdSpoolDir=/var/spool/slurmd<br>
SlurmUser=slurm<br>
#SlurmdUser=root<br>
SrunPortRange=60001-60005<br>
#SrunEpilog=<br>
#SrunProlog=<br>
StateSaveLocation=/var/spool/slurmctld<br>
SwitchType=switch/none<br>
#TaskEpilog=<br>
#TaskPlugin=task/affinity,task/cgroup<br>
TaskPlugin=task/none<br>
#TaskProlog=<br>
#TopologyPlugin=topology/tree<br>
#TmpFS=/tmp<br>
#TrackWCKey=no<br>
#TreeWidth=<br>
#UnkillableStepProgram=<br>
#UsePAM=0<br>
#<br>
#<br>
# TIMERS<br>
#BatchStartTimeout=10<br>
#CompleteWait=0<br>
#EpilogMsgTime=2000<br>
#GetEnvTimeout=2<br>
#HealthCheckInterval=0<br>
#HealthCheckProgram=<br>
InactiveLimit=0<br>
KillWait=30<br>
#MessageTimeout=10<br>
#ResvOverRun=0<br>
MinJobAge=300<br>
#OverTimeLimit=0<br>
SlurmctldTimeout=120<br>
SlurmdTimeout=300<br>
#UnkillableStepTimeout=60<br>
#VSizeFactor=0<br>
Waittime=0<br>
#<br>
#<br>
# SCHEDULING<br>
#DefMemPerCPU=0<br>
#MaxMemPerCPU=0<br>
#SchedulerTimeSlice=30<br>
SchedulerType=sched/backfill<br>
SelectType=select/cons_tres<br>
#<br>
#<br>
# JOB PRIORITY<br>
#PriorityFlags=<br>
#PriorityType=priority/basic<br>
#PriorityDecayHalfLife=<br>
#PriorityCalcPeriod=<br>
#PriorityFavorSmall=<br>
#PriorityMaxAge=<br>
#PriorityUsageResetPeriod=<br>
#PriorityWeightAge=<br>
#PriorityWeightFairshare=<br>
#PriorityWeightJobSize=<br>
#PriorityWeightPartition=<br>
#PriorityWeightQOS=<br>
#<br>
#<br>
# LOGGING AND ACCOUNTING<br>
#AccountingStorageEnforce=0<br>
AccountingStorageHost=hpc2-comp00<br>
AccountingStoragePass=/var/run/munge/munge.socket.2<br>
AccountingStoragePort=6819<br>
AccountingStorageType=accounting_storage/slurmdbd<br>
AccountingStorageUser=slurm<br>
AccountingStoreFlags=job_comment,job_env,job_extra,job_script<br>
#JobCompHost=localhost<br>
#JobCompLoc=slurm_jobcomp_db<br>
##JobCompParams=<br>
#JobCompPass=/var/run/munge/munge.socket.2<br>
#JobCompPort=3306<br>
#JobCompType=jobcomp/mysql<br>
JobCompType=jobcomp/none<br>
#JobCompUser=slurm<br>
#JobContainerType=job_container/none<br>
JobAcctGatherFrequency=30<br>
JobAcctGatherType=jobacct_gather/linux<br>
# Enabled next line - 06-15-2023<br>
SlurmctldDebug=debug5<br>
SlurmctldLogFile=/var/log/slurmctld.log<br>
# Enabled next line - 06-15-2023<br>
SlurmdDebug=debug5<br>
SlurmdLogFile=/var/log/slurmd.log<br>
#SlurmSchedLogFile=<br>
<br>
# Added next line : 06-15-2023<br>
DebugFlags=Cgroup,CPU_Bind,Data,Gres,NodeFeatures,SelectType,Steps,TraceJobs<br>
<br>
#<br>
# POWER SAVE SUPPORT FOR IDLE NODES (optional)<br>
#SuspendProgram=<br>
#ResumeProgram=<br>
#SuspendTimeout=<br>
#ResumeTimeout=<br>
#ResumeRate=<br>
#SuspendExcNodes=<br>
#SuspendExcParts=<br>
#SuspendRate=<br>
#SuspendTime=<br>
#<br>
#<br>
# COMPUTE NODES<br>
NodeName=hpc2-comp[01-02] CPUs=32 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN<br>
PartitionName=debug Nodes=hpc2-comp[01-02] Default=YES MaxTime=INFINITE State=UP<br>
<br>
----- Vagrantfile file -----------------------------------<br>
<br>
# -*- mode: ruby -*-<br>
# vi: set ft=ruby :<br>
<br>
# All Vagrant configuration is done below. The "2" in Vagrant.configure<br>
# configures the configuration version (we support older styles for<br>
# backwards compatibility). Please don't change it unless you know what<br>
# you're doing.<br>
Vagrant.configure("2") do |config|<br>
# The most common configuration options are documented and commented below.<br>
# For a complete reference, please see the online documentation at<br>
# https://urldefense.com/v3/__https://docs.vagrantup.com__;!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e2wTKzoiY$ .<br>
<br>
# Every Vagrant development environment requires a box. You can search for<br>
# boxes at https://urldefense.com/v3/__https://vagrantcloud.com/search__;!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e2f04oxLo$ .<br>
config.vm.box = "generic/fedora37"<br>
<br>
# Stop vagrant from generating a new key for each host to allow ssh between<br>
# machines.<br>
config.ssh.insert_key = false<br>
<br>
# The Vagrant commands are too limited to configure a NAT network,<br>
# so run the VBoxManager commands by hand.<br>
config.vm.provider "virtualbox" do |vbox|<br>
# Add nic2 (eth1 on the guest VM) as the physical router. Never change<br>
# nic1, because that's what the host uses to communicate with the guest VM.<br>
vbox.customize ["modifyvm", :id,<br>
"--nic2", "bridged",<br>
"--bridge-adapter2", "enp8s0"]<br>
end<br>
<br>
# Common provisioning for all guest VMs.<br>
config.vm.provision "shell", inline: <<-SHELL<br>
# Show which command is being run to associate with command output!<br>
set -x<br>
<br>
# Remove suprious hosts from the VM image.<br>
sed -i '/fedora37/d' /etc/hosts<br>
sed -i '/^127[.]0[.]1[.]1/d' /etc/hosts<br>
<br>
# Add NAT network to /etc/hosts.<br>
for host in 10.0.1.{100..102}<br>
do<br>
hostname=hpc2-comp${host:8}<br>
grep -q $host /etc/hosts ||<br>
echo "$host<span class="Apple-tab-span" style="white-space:pre"> </span>
$hostname" >> /etc/hosts<br>
done<br>
unset host hostname<br>
<br>
# Use latest set of packages.<br>
dnf -y update<br>
<br>
# Install MUNGE.<br>
dnf -y install munge<br>
<br>
# Create the SLURM user.<br>
id -u slurm ||<br>
useradd -r -s /sbin/nologin -d /etc/slurm -c "SLURM job scheduler" slurm<br>
SHELL<br>
<br>
config.vm.define "hpc2_comp00" do |hpc2_comp00|<br>
hpc2_comp00.vm.hostname = "hpc2-comp00"<br>
hpc2_comp00.vm.synced_folder ".", "/vagrant", automount: true<br>
hpc2_comp00.vm.provision :shell, inline: <<-SHELL<br>
# Show which command is being run to associate with command output!<br>
set -x<br>
<br>
# Set static IP address for NAT network.<br>
HOST=10.0.1.100<br>
ETH1=/etc/NetworkManager/system-connections/eth1.nmconnection<br>
sed "s|address1=10.0.1.100|address1=${HOST}|" \<br>
/vagrant/eth1.nmconnection > $ETH1<br>
chmod go-r $ETH1<br>
nmcli con load $ETH1<br>
unset $HOST<br>
<br>
# Create the MUNGE key.<br>
[[ -f /etc/munge/munge.key ]] || sudo -u munge /usr/sbin/mungekey -v<br>
cp -av /etc/munge/munge.key /vagrant/<br>
<br>
# Enable and start munge.<br>
systemctl enable munge<br>
systemctl start munge<br>
systemctl status munge<br>
<br>
# Setup database on the head node:<br>
dnf -y install mariadb-devel mariadb-server<br>
<br>
# Set recomended memory (5%-50% RAM) and timeout.<br>
CNF=/etc/my.cnf.d/mariadb-server.cnf<br>
<br>
# Note we need to use a double slash for the newline character below<br>
# because Vagrant's inline shell script.<br>
<br>
MYSQL_RAM=$(awk '/^MemTotal/ {printf "%.0f\\n", $2*0.05}' /proc/meminfo)<br>
grep -q innodb_buffer_pool_size $CNF ||<br>
sed -i '/InnoDB/a innodb_buffer_pool_size='${MYSQL_RAM}K $CNF<br>
grep -q innodb_lock_wait_timeout $CNF ||<br>
sed -i '/innodb_buffer_pool_size/a innodb_lock_wait_timeout=900' $CNF<br>
unset CNF MYSQL_RAM<br>
<br>
# Run the head node services:<br>
systemctl enable mariadb<br>
systemctl start mariadb<br>
systemctl status mariadb<br>
<br>
# Secure the server.<br>
#<br>
# Send interactive commands using printf per<br>
# https://urldefense.com/v3/__https://unix.stackexchange.com/a/112348__;!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e23TZJlog$ printf "%s\n" "" n n y y y y | mariadb-secure-installation<br>
<br>
# Install the RPM package builder for SLURM.<br>
dnf -y install rpmdevtools<br>
<br>
# Download SLURM.<br>
wget -nc https://urldefense.com/v3/__https://download.schedmd.com/slurm/slurm-23.02.0.tar.bz2__;!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e2-8gcs08$ # Install the source package dependencies to determine the
dependencies<br>
# to build the binary.<br>
dnf -y install \<br>
dbus-devel \<br>
freeipmi-devel \<br>
hdf5-devel \<br>
http-parser-devel \<br>
json-c-devel \<br>
libcurl-devel \<br>
libjwt-devel \<br>
libyaml-devel \<br>
lua-devel \<br>
lz4-devel \<br>
man2html \<br>
readline-devel \<br>
rrdtool-devel<br>
SLURM_SRPM=~/rpmbuild/SRPMS/slurm-23.02.0-1.fc37.src.rpm<br>
# Create SLURM .rpmmacros file.<br>
cp -av /vagrant/.rpmmacros .<br>
[[ -f ${SLURM_SRPM} ]] ||<br>
rpmbuild -ts slurm-23.02.0.tar.bz2 |& tee build-slurm-source.log<br>
# Installs the source package dependencies to build the binary.<br>
dnf -y builddep ${SLURM_SRPM}<br>
unset SLURM_SRPM<br>
# Build the SLURM binaries.<br>
SLURM_RPM=~/rpmbuild/RPMS/x86_64/slurm-23.02.0-1.fc37.x86_64.rpm<br>
[[ -f ${SLURM_RPM} ]] ||<br>
rpmbuild -ta slurm-23.02.0.tar.bz2 |& tee build-slurm-binary.log<br>
unset SLURM_RPM<br>
<br>
# Copy SLURM packages to the compute nodes.<br>
DIR_RPM=~/rpmbuild/RPMS/x86_64<br>
cp -av ${DIR_RPM}/slurm-23*.rpm ${DIR_RPM}/slurm-slurmd-23*.rpm /vagrant/ &&<br>
touch /vagrant/sentinel-copied-rpms.done<br>
<br>
# Install all SLURM packages on the head node.<br>
find ${DIR_RPM} -type f -not -name '*-slurmd-*' -not -name '*-torque-*' \<br>
-exec dnf -y install {} +<br>
unset DIR_RPM<br>
# Copy the configuration files.<br>
cp -av /vagrant/slurmdbd.conf /etc/slurm/slurmdbd.conf<br>
chown slurm:slurm /etc/slurm/slurmdbd.conf<br>
chmod 600 /etc/slurm/slurmdbd.conf<br>
cp -av /vagrant/slurm.conf /etc/slurm/slurm.conf<br>
chown root:root /etc/slurm/slurm.conf<br>
chmod 644 /etc/slurm/slurm.conf<br>
<br>
# Now create the slurm MySQL user.<br>
SLURM_PASSWORD=$(awk -vFS='=' '/StoragePass/ {print $2}' /etc/slurm/slurmdbd.conf)<br>
DBD_HOST=localhost<br>
# https://urldefense.com/v3/__https://docs.fedoraproject.org/en-US/quick-docs/installing-mysql-mariadb/*_start_mysql_service_and_enable_at_loggin__;Iw!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e2km7qEbo$ mysql
-e "create user if not exists 'slurm'@'$DBD_HOST' identified by '$SLURM_PASSWORD';"<br>
mysql -e "grant all on slurm_acct_db.* to 'slurm'@'$DBD_HOST';"<br>
mysql -e "show grants for 'slurm'@'$DBD_HOST';"<br>
DBD_HOST=hpc2-comp00<br>
mysql -e "create user if not exists 'slurm'@'$DBD_HOST' identified by '$SLURM_PASSWORD';"<br>
mysql -e "grant all on slurm_acct_db.* to 'slurm'@'$DBD_HOST';"<br>
mysql -e "show grants for 'slurm'@'$DBD_HOST';"<br>
unset SLURM_PASSWORD DBD_HOST<br>
<br>
systemctl enable slurmdbd<br>
systemctl start slurmdbd<br>
systemctl status slurmdbd<br>
<br>
systemctl enable slurmctld<br>
mkdir -p /var/spool/slurmctld<br>
chown slurm:slurm /var/spool/slurmctld<br>
# Open ports for slurmctld (6817) and slurmdbd (6819).<br>
firewall-cmd --add-port=6817/tcp<br>
firewall-cmd --add-port=6819/tcp<br>
firewall-cmd --runtime-to-permanent<br>
systemctl start slurmctld<br>
systemctl status slurmctld<br>
<br>
# Clear any previous node DOWN errors.<br>
sinfo -s<br>
sinfo -R<br>
scontrol update nodename=ALL state=RESUME<br>
sinfo -s<br>
sinfo -R<br>
SHELL<br>
end<br>
<br>
config.vm.define "hpc2_comp01" do |hpc2_comp01|<br>
hpc2_comp01.vm.hostname = "hpc2-comp01"<br>
hpc2_comp01.vm.synced_folder ".", "/vagrant", automount: true<br>
hpc2_comp01.vm.provision :shell, inline: <<-SHELL<br>
# Show which command is being run to associate with command output!<br>
set -x<br>
<br>
# Set static IP address for NAT network.<br>
HOST=10.0.1.101<br>
ETH1=/etc/NetworkManager/system-connections/eth1.nmconnection<br>
sed "s|address1=10.0.1.100|address1=${HOST}|" \<br>
/vagrant/eth1.nmconnection > $ETH1<br>
chmod go-r $ETH1<br>
nmcli con load $ETH1<br>
unset $HOST<br>
<br>
# Copy the MUNGE key.<br>
KEY=/etc/munge/munge.key<br>
cp -av /vagrant/munge.key /etc/munge/<br>
chown munge:munge $KEY<br>
chmod 600 $KEY<br>
<br>
# Enable and start munge.<br>
systemctl enable munge<br>
systemctl start munge<br>
systemctl status munge<br>
<br>
# SLURM packages to be installed on compute nodes.<br>
DIR_RPM=~/rpmbuild/RPMS/x86_64<br>
mkdir -p ${DIR_RPM}<br>
rsync -avP /vagrant/slurm*.rpm ${DIR_RPM}/<br>
dnf -y install ${DIR_RPM}/slurm-23*.rpm<br>
dnf -y install ${DIR_RPM}/slurm-slurmd-23*.rpm<br>
unset DIR_RPM<br>
# Copy the configuration file.<br>
mkdir -p /etc/slurm<br>
cp -av /vagrant/slurm.conf /etc/slurm/slurm.conf<br>
chown root:root /etc/slurm/slurm.conf<br>
chmod 644 /etc/slurm/slurm.conf<br>
# Only enable slurmd on the worker nodes.<br>
systemctl enable slurmd<br>
# Open port for slurmd (6818).<br>
firewall-cmd --add-port=6818/tcp<br>
firewall-cmd --runtime-to-permanent<br>
# Open port range for srun.<br>
SRUN_PORT_RANGE=$(awk -vFS='=' '/SrunPortRange/ {print $2}' /etc/slurm/slurm.conf)<br>
firewall-cmd --add-port=$SRUN_PORT_RANGE/tcp<br>
firewall-cmd --runtime-to-permanent<br>
systemctl start slurmd<br>
systemctl status slurmd<br>
<br>
# Clear any previous node DOWN errors.<br>
sinfo -s<br>
sinfo -R<br>
scontrol update nodename=ALL state=RESUME<br>
sinfo -s<br>
sinfo -R<br>
SHELL<br>
end<br>
<br>
config.vm.define "hpc2_comp02" do |hpc2_comp02|<br>
hpc2_comp02.vm.hostname = "hpc2-comp02"<br>
hpc2_comp02.vm.synced_folder ".", "/vagrant", automount: true<br>
hpc2_comp02.vm.provision :shell, inline: <<-SHELL<br>
# Show which command is being run to associate with command output!<br>
set -x<br>
<br>
# Set static IP address for NAT network.<br>
HOST=10.0.1.102<br>
ETH1=/etc/NetworkManager/system-connections/eth1.nmconnection<br>
sed "s|address1=10.0.1.100|address1=${HOST}|" \<br>
/vagrant/eth1.nmconnection > $ETH1<br>
chmod go-r $ETH1<br>
nmcli con load $ETH1<br>
unset $HOST<br>
<br>
# Copy the MUNGE key.<br>
KEY=/etc/munge/munge.key<br>
cp -av /vagrant/munge.key /etc/munge/<br>
chown munge:munge $KEY<br>
chmod 600 $KEY<br>
<br>
# Enable and start munge.<br>
systemctl enable munge<br>
systemctl start munge<br>
systemctl status munge<br>
<br>
# SLURM packages to be installed on compute nodes.<br>
DIR_RPM=~/rpmbuild/RPMS/x86_64<br>
mkdir -p ${DIR_RPM}<br>
rsync -avP /vagrant/slurm*.rpm ${DIR_RPM}/<br>
dnf -y install ${DIR_RPM}/slurm-23*.rpm<br>
dnf -y install ${DIR_RPM}/slurm-slurmd-23*.rpm<br>
unset DIR_RPM<br>
# Copy the configuration file.<br>
mkdir -p /etc/slurm<br>
cp -av /vagrant/slurm.conf /etc/slurm/slurm.conf<br>
chown root:root /etc/slurm/slurm.conf<br>
chmod 644 /etc/slurm/slurm.conf<br>
# Only enable slurmd on the worker nodes.<br>
systemctl enable slurmd<br>
# Open port for slurmd (6818).<br>
firewall-cmd --add-port=6818/tcp<br>
firewall-cmd --runtime-to-permanent<br>
systemctl start slurmd<br>
systemctl status slurmd<br>
<br>
# Clear any previous node DOWN errors.<br>
sinfo -s<br>
sinfo -R<br>
scontrol update nodename=ALL state=RESUME<br>
sinfo -s<br>
sinfo -R<br>
SHELL<br>
end<br>
end<br>
<br>
-------------------------------------------------------------------------<br>
<br>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</body>
</html>