[slurm-users] Need to free up memory for running more than one job on a node
Markuske, William
wmarkuske at sdsc.edu
Fri Jun 16 19:59:08 UTC 2023
Hello Joe,
You haven't defined any memory allocation or oversubscription in your slurm.conf so by default it is giving a full node's worth of memory to each job. There are multiple options that you can do but what you probably want to do is make both CPU and memory a selected type with the parameter:
SelectTypeParameters=CR_CPU_Memory
Then you'll want to define the amount of memory (in megabytes) on a node as part of the definition with
RealMemory=
Lastly, you'll need to define a default memory (in megabytes) per job, typically by memory per cpu, with
DefMemPerCpu=
With those changes when you submit a job by default it'll do #cpus x defmempercpu for the memory given to a job. You can then use either the flags --mem or --mempercpu to request more or less memory for a job.
There's also oversubscription where you can allow more memory than available on the node to be used by jobs and then you don't technically need to define the memory for a job but run into the issue that a single job could use all of it and get OOM errors on the nodes.
Regards,
--
Willy Markuske
HPC Systems Engineer
MS Data Science and Engineering
SDSC - Research Data Services
(619) 519-4435
wmarkuske at sdsc.edu
On Jun 16, 2023, at 12:43, Joe Waliga <jwaliga at umich.edu> wrote:
Hello,
(This is my first time submitting a question to the list)
We have a test-HPC with 1 login node and 2 computer nodes. When we submit 90 jobs onto the test-HPC, we can only run one job per node. We seem to be allocating all memory to the one job and other jobs can run until the memory is freed up.
Any ideas on what we need to change inorder to free up the memory?
~ ~
We noticed this from the 'slurmctld.log' ...
[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp01, allocated memory = 1 and all memory requested for JobId=71_*
[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp02, allocated memory = 1 and all memory requested for JobId=71_*
The test-HPC is running on hardware, but we also created a test-HPC using a 3 VM set constructed by Vagrant running on a Virtualbox backend.
I have included some of the 'slurmctld.log' file, the batch submission script, the slurm.conf file (of the hardware based test-HPC), and the 'Vagrantfile' file (in case someone wants to recreate our test-HPC in a set of VMs.)
- Joe
----- (some of) slurmctld.log -----------------------------------------
[2023-06-15T20:11:32.631] select/cons_tres: node_data_dump: Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:32.631] select/cons_tres: node_data_dump: Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:32.631] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp01, allocated memory = 1 and all memory requested for JobId=71_7(71)
[2023-06-15T20:11:32.631] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp02, allocated memory = 1 and all memory requested for JobId=71_7(71)
[2023-06-15T20:11:32.631] select/cons_tres: _job_test: SELECT_TYPE: evaluating JobId=71_7(71) on 0 nodes
[2023-06-15T20:11:32.631] select/cons_tres: _job_test: SELECT_TYPE: test 0 fail: insufficient resources
[2023-06-15T20:11:32.631] select/cons_tres: common_job_test: no job_resources info for JobId=71_7(71) rc=-1
[2023-06-15T20:11:32.631] debug2: select/cons_tres: select_p_job_test: evaluating JobId=71_7(71)
[2023-06-15T20:11:32.631] select/cons_tres: common_job_test: JobId=71_7(71) node_mode:Normal alloc_mode:Test_Only
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list & exc_cores
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02]
[2023-06-15T20:11:32.632] select/cons_tres: common_job_test: nodes: min:1 max:1 requested:1 avail:2
[2023-06-15T20:11:32.632] select/cons_tres: node_data_dump: Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:32.632] select/cons_tres: node_data_dump: Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:32.632] select/cons_tres: _job_test: SELECT_TYPE: evaluating JobId=71_7(71) on 2 nodes
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/enter
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02]
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: core_list:node[0]:0-15,node[1]:0-15
[2023-06-15T20:11:32.632] select/cons_tres: can_job_run_on_node: SELECT_TYPE: 32 CPUs on hpc2-comp01(state:1), mem 1/1
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Node:hpc2-comp01 Sockets:2 SpecThreads:0 CPUs:Min-Max,Avail:2-32,32 ThreadsPerCore:2
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Socket[0] Cores:8
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Socket[1] Cores:8
[2023-06-15T20:11:32.632] select/cons_tres: can_job_run_on_node: SELECT_TYPE: 32 CPUs on hpc2-comp02(state:1), mem 1/1
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Node:hpc2-comp02 Sockets:2 SpecThreads:0 CPUs:Min-Max,Avail:2-32,32 ThreadsPerCore:2
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Socket[0] Cores:8
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Socket[1] Cores:8
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/elim_nodes
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02]
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: core_list:node[0]:0-15,node[1]:0-15
[2023-06-15T20:11:32.632] select/cons_tres: _eval_nodes: set:0 consec CPUs:64 nodes:2:hpc2-comp[01-02] begin:0 end:1 required:-1 weight:511
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/choose_nodes
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp01
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: core_list:node[0]:0-15,node[1]:0-15
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/sync_cores
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp01
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: core_list:node[0]:0-15
[2023-06-15T20:11:32.632] select/cons_tres: _job_test: SELECT_TYPE: test 0 pass: test_only
[2023-06-15T20:11:32.632] select/cons_tres: common_job_test: no job_resources info for JobId=71_7(71) rc=0
[2023-06-15T20:11:32.632] debug3: sched: JobId=71_*. State=PENDING. Reason=Resources. Priority=4294901759. Partition=debug.
[2023-06-15T20:11:56.645] debug: Spawning ping agent for hpc2-comp[01-02]
[2023-06-15T20:11:56.645] debug2: Spawning RPC agent for msg_type REQUEST_PING
[2023-06-15T20:11:56.646] debug3: Tree sending to hpc2-comp01
[2023-06-15T20:11:56.646] debug2: Tree head got back 0 looking for 2
[2023-06-15T20:11:56.646] debug3: Tree sending to hpc2-comp02
[2023-06-15T20:11:56.647] debug2: Tree head got back 1
[2023-06-15T20:11:56.647] debug2: Tree head got back 2
[2023-06-15T20:11:56.651] debug2: node_did_resp hpc2-comp01
[2023-06-15T20:11:56.651] debug2: node_did_resp hpc2-comp02
[2023-06-15T20:11:57.329] debug: sched/backfill: _attempt_backfill: beginning
[2023-06-15T20:11:57.329] debug: sched/backfill: _attempt_backfill: 1 jobs to backfill
[2023-06-15T20:11:57.329] debug2: sched/backfill: _attempt_backfill: entering _try_sched for JobId=71_*.
[2023-06-15T20:11:57.329] debug2: select/cons_tres: select_p_job_test: evaluating JobId=71_*
[2023-06-15T20:11:57.329] select/cons_tres: common_job_test: JobId=71_* node_mode:Normal alloc_mode:Will_Run
[2023-06-15T20:11:57.329] select/cons_tres: core_array_log: node_list & exc_cores
[2023-06-15T20:11:57.329] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02]
[2023-06-15T20:11:57.329] select/cons_tres: common_job_test: nodes: min:1 max:1 requested:1 avail:2
[2023-06-15T20:11:57.330] select/cons_tres: node_data_dump: Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:57.330] select/cons_tres: node_data_dump: Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp01, allocated memory = 1 and all memory requested for JobId=71_*
[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp02, allocated memory = 1 and all memory requested for JobId=71_*
[2023-06-15T20:11:57.330] select/cons_tres: _job_test: SELECT_TYPE: evaluating JobId=71_* on 0 nodes
[2023-06-15T20:11:57.330] select/cons_tres: _job_test: SELECT_TYPE: test 0 fail: insufficient resources
[2023-06-15T20:11:57.330] select/cons_tres: _will_run_test: JobId=71_5(76): overlap=1
[2023-06-15T20:11:57.330] select/cons_tres: job_res_rm_job: JobId=71_5(76) action:normal
[2023-06-15T20:11:57.330] ====================
[2023-06-15T20:11:57.330] JobId=71_5(76) nhosts:1 ncpus:1 node_req:1 nodes=hpc2-comp01
[2023-06-15T20:11:57.330] Node[0]:
[2023-06-15T20:11:57.330] Mem(MB):1:0 Sockets:2 Cores:8 CPUs:2:0
[2023-06-15T20:11:57.330] Socket[0] Core[0] is allocated
[2023-06-15T20:11:57.330] --------------------
[2023-06-15T20:11:57.330] cpu_array_value[0]:2 reps:1
[2023-06-15T20:11:57.330] ====================
[2023-06-15T20:11:57.330] debug3: select/cons_tres: job_res_rm_job: removed JobId=71_5(76) from part debug row 0
[2023-06-15T20:11:57.330] select/cons_tres: job_res_rm_job: JobId=71_5(76) finished
[2023-06-15T20:11:57.330] select/cons_tres: _will_run_test: JobId=71_6(77): overlap=1
[2023-06-15T20:11:57.330] select/cons_tres: job_res_rm_job: JobId=71_6(77) action:normal
[2023-06-15T20:11:57.330] ====================
[2023-06-15T20:11:57.330] JobId=71_6(77) nhosts:1 ncpus:1 node_req:1 nodes=hpc2-comp02
[2023-06-15T20:11:57.330] Node[0]:
[2023-06-15T20:11:57.330] Mem(MB):1:0 Sockets:2 Cores:8 CPUs:2:0
[2023-06-15T20:11:57.330] Socket[0] Core[0] is allocated
[2023-06-15T20:11:57.330] --------------------
[2023-06-15T20:11:57.330] cpu_array_value[0]:2 reps:1
[2023-06-15T20:11:57.330] ====================
----- batch script -----------------------------------
#!/bin/bash
echo "Running on: ${SLURM_CLUSTER_NAME}, node list: ${SLURM_JOB_NODELIST}, node names: ${SLURMD_NODENAME} in: `pwd` at `date`"
echo "SLURM_NTASKS: ${SLURM_NTASKS} SLURM_TASKS_PER_NODE: ${SLURM_TASKS_PER_NODE} "
echo "SLURM_ARRAY_TASK_ID: ${SLURM_ARRAY_TASK_ID}"
echo "SLURM_MEM_PER_CPU: ${SLURM_MEM_PER_CPU}"
sleep 3600
echo "END"
Here is the sbatch command to run it:
sbatch -J test -a1-10 -t 999:00:00 -N 1 -n 1 -p debug sbatch.slurm
----- slurm.conf -----------------------------------
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster
SlurmctldHost=hpc2-comp00
#SlurmctldHost=
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=10000
#MaxStepCount=40000
#MaxTasksPerNode=512
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/linuxproc
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
SrunPortRange=60001-60005
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
#TaskEpilog=
#TaskPlugin=task/affinity,task/cgroup
TaskPlugin=task/none
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
AccountingStorageHost=hpc2-comp00
AccountingStoragePass=/var/run/munge/munge.socket.2
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
AccountingStoreFlags=job_comment,job_env,job_extra,job_script
#JobCompHost=localhost
#JobCompLoc=slurm_jobcomp_db
##JobCompParams=
#JobCompPass=/var/run/munge/munge.socket.2
#JobCompPort=3306
#JobCompType=jobcomp/mysql
JobCompType=jobcomp/none
#JobCompUser=slurm
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
# Enabled next line - 06-15-2023
SlurmctldDebug=debug5
SlurmctldLogFile=/var/log/slurmctld.log
# Enabled next line - 06-15-2023
SlurmdDebug=debug5
SlurmdLogFile=/var/log/slurmd.log
#SlurmSchedLogFile=
# Added next line : 06-15-2023
DebugFlags=Cgroup,CPU_Bind,Data,Gres,NodeFeatures,SelectType,Steps,TraceJobs
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=hpc2-comp[01-02] CPUs=32 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=hpc2-comp[01-02] Default=YES MaxTime=INFINITE State=UP
----- Vagrantfile file -----------------------------------
# -*- mode: ruby -*-
# vi: set ft=ruby :
# All Vagrant configuration is done below. The "2" in Vagrant.configure
# configures the configuration version (we support older styles for
# backwards compatibility). Please don't change it unless you know what
# you're doing.
Vagrant.configure("2") do |config|
# The most common configuration options are documented and commented below.
# For a complete reference, please see the online documentation at
# https://urldefense.com/v3/__https://docs.vagrantup.com__;!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e2wTKzoiY$ .
# Every Vagrant development environment requires a box. You can search for
# boxes at https://urldefense.com/v3/__https://vagrantcloud.com/search__;!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e2f04oxLo$ .
config.vm.box = "generic/fedora37"
# Stop vagrant from generating a new key for each host to allow ssh between
# machines.
config.ssh.insert_key = false
# The Vagrant commands are too limited to configure a NAT network,
# so run the VBoxManager commands by hand.
config.vm.provider "virtualbox" do |vbox|
# Add nic2 (eth1 on the guest VM) as the physical router. Never change
# nic1, because that's what the host uses to communicate with the guest VM.
vbox.customize ["modifyvm", :id,
"--nic2", "bridged",
"--bridge-adapter2", "enp8s0"]
end
# Common provisioning for all guest VMs.
config.vm.provision "shell", inline: <<-SHELL
# Show which command is being run to associate with command output!
set -x
# Remove suprious hosts from the VM image.
sed -i '/fedora37/d' /etc/hosts
sed -i '/^127[.]0[.]1[.]1/d' /etc/hosts
# Add NAT network to /etc/hosts.
for host in 10.0.1.{100..102}
do
hostname=hpc2-comp${host:8}
grep -q $host /etc/hosts ||
echo "$host $hostname" >> /etc/hosts
done
unset host hostname
# Use latest set of packages.
dnf -y update
# Install MUNGE.
dnf -y install munge
# Create the SLURM user.
id -u slurm ||
useradd -r -s /sbin/nologin -d /etc/slurm -c "SLURM job scheduler" slurm
SHELL
config.vm.define "hpc2_comp00" do |hpc2_comp00|
hpc2_comp00.vm.hostname = "hpc2-comp00"
hpc2_comp00.vm.synced_folder ".", "/vagrant", automount: true
hpc2_comp00.vm.provision :shell, inline: <<-SHELL
# Show which command is being run to associate with command output!
set -x
# Set static IP address for NAT network.
HOST=10.0.1.100
ETH1=/etc/NetworkManager/system-connections/eth1.nmconnection
sed "s|address1=10.0.1.100|address1=${HOST}|" \
/vagrant/eth1.nmconnection > $ETH1
chmod go-r $ETH1
nmcli con load $ETH1
unset $HOST
# Create the MUNGE key.
[[ -f /etc/munge/munge.key ]] || sudo -u munge /usr/sbin/mungekey -v
cp -av /etc/munge/munge.key /vagrant/
# Enable and start munge.
systemctl enable munge
systemctl start munge
systemctl status munge
# Setup database on the head node:
dnf -y install mariadb-devel mariadb-server
# Set recomended memory (5%-50% RAM) and timeout.
CNF=/etc/my.cnf.d/mariadb-server.cnf
# Note we need to use a double slash for the newline character below
# because Vagrant's inline shell script.
MYSQL_RAM=$(awk '/^MemTotal/ {printf "%.0f\\n", $2*0.05}' /proc/meminfo)
grep -q innodb_buffer_pool_size $CNF ||
sed -i '/InnoDB/a innodb_buffer_pool_size='${MYSQL_RAM}K $CNF
grep -q innodb_lock_wait_timeout $CNF ||
sed -i '/innodb_buffer_pool_size/a innodb_lock_wait_timeout=900' $CNF
unset CNF MYSQL_RAM
# Run the head node services:
systemctl enable mariadb
systemctl start mariadb
systemctl status mariadb
# Secure the server.
#
# Send interactive commands using printf per
# https://urldefense.com/v3/__https://unix.stackexchange.com/a/112348__;!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e23TZJlog$ printf "%s\n" "" n n y y y y | mariadb-secure-installation
# Install the RPM package builder for SLURM.
dnf -y install rpmdevtools
# Download SLURM.
wget -nc https://urldefense.com/v3/__https://download.schedmd.com/slurm/slurm-23.02.0.tar.bz2__;!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e2-8gcs08$ # Install the source package dependencies to determine the dependencies
# to build the binary.
dnf -y install \
dbus-devel \
freeipmi-devel \
hdf5-devel \
http-parser-devel \
json-c-devel \
libcurl-devel \
libjwt-devel \
libyaml-devel \
lua-devel \
lz4-devel \
man2html \
readline-devel \
rrdtool-devel
SLURM_SRPM=~/rpmbuild/SRPMS/slurm-23.02.0-1.fc37.src.rpm
# Create SLURM .rpmmacros file.
cp -av /vagrant/.rpmmacros .
[[ -f ${SLURM_SRPM} ]] ||
rpmbuild -ts slurm-23.02.0.tar.bz2 |& tee build-slurm-source.log
# Installs the source package dependencies to build the binary.
dnf -y builddep ${SLURM_SRPM}
unset SLURM_SRPM
# Build the SLURM binaries.
SLURM_RPM=~/rpmbuild/RPMS/x86_64/slurm-23.02.0-1.fc37.x86_64.rpm
[[ -f ${SLURM_RPM} ]] ||
rpmbuild -ta slurm-23.02.0.tar.bz2 |& tee build-slurm-binary.log
unset SLURM_RPM
# Copy SLURM packages to the compute nodes.
DIR_RPM=~/rpmbuild/RPMS/x86_64
cp -av ${DIR_RPM}/slurm-23*.rpm ${DIR_RPM}/slurm-slurmd-23*.rpm /vagrant/ &&
touch /vagrant/sentinel-copied-rpms.done
# Install all SLURM packages on the head node.
find ${DIR_RPM} -type f -not -name '*-slurmd-*' -not -name '*-torque-*' \
-exec dnf -y install {} +
unset DIR_RPM
# Copy the configuration files.
cp -av /vagrant/slurmdbd.conf /etc/slurm/slurmdbd.conf
chown slurm:slurm /etc/slurm/slurmdbd.conf
chmod 600 /etc/slurm/slurmdbd.conf
cp -av /vagrant/slurm.conf /etc/slurm/slurm.conf
chown root:root /etc/slurm/slurm.conf
chmod 644 /etc/slurm/slurm.conf
# Now create the slurm MySQL user.
SLURM_PASSWORD=$(awk -vFS='=' '/StoragePass/ {print $2}' /etc/slurm/slurmdbd.conf)
DBD_HOST=localhost
# https://urldefense.com/v3/__https://docs.fedoraproject.org/en-US/quick-docs/installing-mysql-mariadb/*_start_mysql_service_and_enable_at_loggin__;Iw!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e2km7qEbo$ mysql -e "create user if not exists 'slurm'@'$DBD_HOST' identified by '$SLURM_PASSWORD';"
mysql -e "grant all on slurm_acct_db.* to 'slurm'@'$DBD_HOST';"
mysql -e "show grants for 'slurm'@'$DBD_HOST';"
DBD_HOST=hpc2-comp00
mysql -e "create user if not exists 'slurm'@'$DBD_HOST' identified by '$SLURM_PASSWORD';"
mysql -e "grant all on slurm_acct_db.* to 'slurm'@'$DBD_HOST';"
mysql -e "show grants for 'slurm'@'$DBD_HOST';"
unset SLURM_PASSWORD DBD_HOST
systemctl enable slurmdbd
systemctl start slurmdbd
systemctl status slurmdbd
systemctl enable slurmctld
mkdir -p /var/spool/slurmctld
chown slurm:slurm /var/spool/slurmctld
# Open ports for slurmctld (6817) and slurmdbd (6819).
firewall-cmd --add-port=6817/tcp
firewall-cmd --add-port=6819/tcp
firewall-cmd --runtime-to-permanent
systemctl start slurmctld
systemctl status slurmctld
# Clear any previous node DOWN errors.
sinfo -s
sinfo -R
scontrol update nodename=ALL state=RESUME
sinfo -s
sinfo -R
SHELL
end
config.vm.define "hpc2_comp01" do |hpc2_comp01|
hpc2_comp01.vm.hostname = "hpc2-comp01"
hpc2_comp01.vm.synced_folder ".", "/vagrant", automount: true
hpc2_comp01.vm.provision :shell, inline: <<-SHELL
# Show which command is being run to associate with command output!
set -x
# Set static IP address for NAT network.
HOST=10.0.1.101
ETH1=/etc/NetworkManager/system-connections/eth1.nmconnection
sed "s|address1=10.0.1.100|address1=${HOST}|" \
/vagrant/eth1.nmconnection > $ETH1
chmod go-r $ETH1
nmcli con load $ETH1
unset $HOST
# Copy the MUNGE key.
KEY=/etc/munge/munge.key
cp -av /vagrant/munge.key /etc/munge/
chown munge:munge $KEY
chmod 600 $KEY
# Enable and start munge.
systemctl enable munge
systemctl start munge
systemctl status munge
# SLURM packages to be installed on compute nodes.
DIR_RPM=~/rpmbuild/RPMS/x86_64
mkdir -p ${DIR_RPM}
rsync -avP /vagrant/slurm*.rpm ${DIR_RPM}/
dnf -y install ${DIR_RPM}/slurm-23*.rpm
dnf -y install ${DIR_RPM}/slurm-slurmd-23*.rpm
unset DIR_RPM
# Copy the configuration file.
mkdir -p /etc/slurm
cp -av /vagrant/slurm.conf /etc/slurm/slurm.conf
chown root:root /etc/slurm/slurm.conf
chmod 644 /etc/slurm/slurm.conf
# Only enable slurmd on the worker nodes.
systemctl enable slurmd
# Open port for slurmd (6818).
firewall-cmd --add-port=6818/tcp
firewall-cmd --runtime-to-permanent
# Open port range for srun.
SRUN_PORT_RANGE=$(awk -vFS='=' '/SrunPortRange/ {print $2}' /etc/slurm/slurm.conf)
firewall-cmd --add-port=$SRUN_PORT_RANGE/tcp
firewall-cmd --runtime-to-permanent
systemctl start slurmd
systemctl status slurmd
# Clear any previous node DOWN errors.
sinfo -s
sinfo -R
scontrol update nodename=ALL state=RESUME
sinfo -s
sinfo -R
SHELL
end
config.vm.define "hpc2_comp02" do |hpc2_comp02|
hpc2_comp02.vm.hostname = "hpc2-comp02"
hpc2_comp02.vm.synced_folder ".", "/vagrant", automount: true
hpc2_comp02.vm.provision :shell, inline: <<-SHELL
# Show which command is being run to associate with command output!
set -x
# Set static IP address for NAT network.
HOST=10.0.1.102
ETH1=/etc/NetworkManager/system-connections/eth1.nmconnection
sed "s|address1=10.0.1.100|address1=${HOST}|" \
/vagrant/eth1.nmconnection > $ETH1
chmod go-r $ETH1
nmcli con load $ETH1
unset $HOST
# Copy the MUNGE key.
KEY=/etc/munge/munge.key
cp -av /vagrant/munge.key /etc/munge/
chown munge:munge $KEY
chmod 600 $KEY
# Enable and start munge.
systemctl enable munge
systemctl start munge
systemctl status munge
# SLURM packages to be installed on compute nodes.
DIR_RPM=~/rpmbuild/RPMS/x86_64
mkdir -p ${DIR_RPM}
rsync -avP /vagrant/slurm*.rpm ${DIR_RPM}/
dnf -y install ${DIR_RPM}/slurm-23*.rpm
dnf -y install ${DIR_RPM}/slurm-slurmd-23*.rpm
unset DIR_RPM
# Copy the configuration file.
mkdir -p /etc/slurm
cp -av /vagrant/slurm.conf /etc/slurm/slurm.conf
chown root:root /etc/slurm/slurm.conf
chmod 644 /etc/slurm/slurm.conf
# Only enable slurmd on the worker nodes.
systemctl enable slurmd
# Open port for slurmd (6818).
firewall-cmd --add-port=6818/tcp
firewall-cmd --runtime-to-permanent
systemctl start slurmd
systemctl status slurmd
# Clear any previous node DOWN errors.
sinfo -s
sinfo -R
scontrol update nodename=ALL state=RESUME
sinfo -s
sinfo -R
SHELL
end
end
-------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230616/18faae30/attachment-0001.htm>
More information about the slurm-users
mailing list