[slurm-users] Need to free up memory for running more than one job on a node

Joe Waliga jwaliga at umich.edu
Fri Jun 16 19:43:58 UTC 2023


Hello,

(This is my first time submitting a question to the list)

We have a test-HPC with 1 login node and 2 computer nodes. When we 
submit 90 jobs onto the test-HPC, we can only run one job per node. We 
seem to be allocating all memory to the one job and other jobs can run 
until the memory is freed up.

Any ideas on what we need to change inorder to free up the memory?

~ ~

We noticed this from the 'slurmctld.log' ...

[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: 
Not considering node hpc2-comp01, allocated memory = 1 and all memory 
requested for JobId=71_*
[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: 
Not considering node hpc2-comp02, allocated memory = 1 and all memory 
requested for JobId=71_*

The test-HPC is running on hardware, but we also created a test-HPC 
using a 3 VM set constructed by Vagrant running on a Virtualbox backend.

I have included some of the 'slurmctld.log' file, the batch submission 
script, the slurm.conf file (of the hardware based test-HPC), and the 
'Vagrantfile' file (in case someone wants to recreate our test-HPC in a 
set of VMs.)

- Joe


----- (some of) slurmctld.log -----------------------------------------

[2023-06-15T20:11:32.631] select/cons_tres: node_data_dump: 
Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 
ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 
AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:32.631] select/cons_tres: node_data_dump: 
Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 
ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 
AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:32.631] debug3: select/cons_tres: _verify_node_state: 
Not considering node hpc2-comp01, allocated memory = 1 and all memory 
requested for JobId=71_7(71)
[2023-06-15T20:11:32.631] debug3: select/cons_tres: _verify_node_state: 
Not considering node hpc2-comp02, allocated memory = 1 and all memory 
requested for JobId=71_7(71)
[2023-06-15T20:11:32.631] select/cons_tres: _job_test: SELECT_TYPE: 
evaluating JobId=71_7(71) on 0 nodes
[2023-06-15T20:11:32.631] select/cons_tres: _job_test: SELECT_TYPE: test 
0 fail: insufficient resources
[2023-06-15T20:11:32.631] select/cons_tres: common_job_test: no 
job_resources info for JobId=71_7(71) rc=-1
[2023-06-15T20:11:32.631] debug2: select/cons_tres: select_p_job_test: 
evaluating JobId=71_7(71)
[2023-06-15T20:11:32.631] select/cons_tres: common_job_test: 
JobId=71_7(71) node_mode:Normal alloc_mode:Test_Only
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list & 
exc_cores
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
node_list:hpc2-comp[01-02]
[2023-06-15T20:11:32.632] select/cons_tres: common_job_test: nodes: 
min:1 max:1 requested:1 avail:2
[2023-06-15T20:11:32.632] select/cons_tres: node_data_dump: 
Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 
ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 
AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:32.632] select/cons_tres: node_data_dump: 
Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 
ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 
AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:32.632] select/cons_tres: _job_test: SELECT_TYPE: 
evaluating JobId=71_7(71) on 2 nodes
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
_select_nodes/enter
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
node_list:hpc2-comp[01-02]
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
core_list:node[0]:0-15,node[1]:0-15
[2023-06-15T20:11:32.632] select/cons_tres: can_job_run_on_node: 
SELECT_TYPE: 32 CPUs on hpc2-comp01(state:1), mem 1/1
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: 
Node:hpc2-comp01 Sockets:2 SpecThreads:0 CPUs:Min-Max,Avail:2-32,32 
ThreadsPerCore:2
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: 
   Socket[0] Cores:8
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: 
   Socket[1] Cores:8
[2023-06-15T20:11:32.632] select/cons_tres: can_job_run_on_node: 
SELECT_TYPE: 32 CPUs on hpc2-comp02(state:1), mem 1/1
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: 
Node:hpc2-comp02 Sockets:2 SpecThreads:0 CPUs:Min-Max,Avail:2-32,32 
ThreadsPerCore:2
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: 
   Socket[0] Cores:8
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: 
   Socket[1] Cores:8
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
_select_nodes/elim_nodes
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
node_list:hpc2-comp[01-02]
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
core_list:node[0]:0-15,node[1]:0-15
[2023-06-15T20:11:32.632] select/cons_tres: _eval_nodes: set:0 consec 
CPUs:64 nodes:2:hpc2-comp[01-02] begin:0 end:1 required:-1 weight:511
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
_select_nodes/choose_nodes
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
node_list:hpc2-comp01
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
core_list:node[0]:0-15,node[1]:0-15
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
_select_nodes/sync_cores
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
node_list:hpc2-comp01
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
core_list:node[0]:0-15
[2023-06-15T20:11:32.632] select/cons_tres: _job_test: SELECT_TYPE: test 
0 pass: test_only
[2023-06-15T20:11:32.632] select/cons_tres: common_job_test: no 
job_resources info for JobId=71_7(71) rc=0
[2023-06-15T20:11:32.632] debug3: sched: JobId=71_*. State=PENDING. 
Reason=Resources. Priority=4294901759. Partition=debug.
[2023-06-15T20:11:56.645] debug:  Spawning ping agent for hpc2-comp[01-02]
[2023-06-15T20:11:56.645] debug2: Spawning RPC agent for msg_type 
REQUEST_PING
[2023-06-15T20:11:56.646] debug3: Tree sending to hpc2-comp01
[2023-06-15T20:11:56.646] debug2: Tree head got back 0 looking for 2
[2023-06-15T20:11:56.646] debug3: Tree sending to hpc2-comp02
[2023-06-15T20:11:56.647] debug2: Tree head got back 1
[2023-06-15T20:11:56.647] debug2: Tree head got back 2
[2023-06-15T20:11:56.651] debug2: node_did_resp hpc2-comp01
[2023-06-15T20:11:56.651] debug2: node_did_resp hpc2-comp02
[2023-06-15T20:11:57.329] debug:  sched/backfill: _attempt_backfill: 
beginning
[2023-06-15T20:11:57.329] debug:  sched/backfill: _attempt_backfill: 1 
jobs to backfill
[2023-06-15T20:11:57.329] debug2: sched/backfill: _attempt_backfill: 
entering _try_sched for JobId=71_*.
[2023-06-15T20:11:57.329] debug2: select/cons_tres: select_p_job_test: 
evaluating JobId=71_*
[2023-06-15T20:11:57.329] select/cons_tres: common_job_test: JobId=71_* 
node_mode:Normal alloc_mode:Will_Run
[2023-06-15T20:11:57.329] select/cons_tres: core_array_log: node_list & 
exc_cores
[2023-06-15T20:11:57.329] select/cons_tres: core_array_log: 
node_list:hpc2-comp[01-02]
[2023-06-15T20:11:57.329] select/cons_tres: common_job_test: nodes: 
min:1 max:1 requested:1 avail:2
[2023-06-15T20:11:57.330] select/cons_tres: node_data_dump: 
Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 
ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 
AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:57.330] select/cons_tres: node_data_dump: 
Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 
ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 
AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: 
Not considering node hpc2-comp01, allocated memory = 1 and all memory 
requested for JobId=71_*
[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: 
Not considering node hpc2-comp02, allocated memory = 1 and all memory 
requested for JobId=71_*
[2023-06-15T20:11:57.330] select/cons_tres: _job_test: SELECT_TYPE: 
evaluating JobId=71_* on 0 nodes
[2023-06-15T20:11:57.330] select/cons_tres: _job_test: SELECT_TYPE: test 
0 fail: insufficient resources
[2023-06-15T20:11:57.330] select/cons_tres: _will_run_test: 
JobId=71_5(76): overlap=1
[2023-06-15T20:11:57.330] select/cons_tres: job_res_rm_job: 
JobId=71_5(76) action:normal
[2023-06-15T20:11:57.330] ====================
[2023-06-15T20:11:57.330] JobId=71_5(76) nhosts:1 ncpus:1 node_req:1 
nodes=hpc2-comp01
[2023-06-15T20:11:57.330] Node[0]:
[2023-06-15T20:11:57.330]   Mem(MB):1:0  Sockets:2  Cores:8  CPUs:2:0
[2023-06-15T20:11:57.330]   Socket[0] Core[0] is allocated
[2023-06-15T20:11:57.330] --------------------
[2023-06-15T20:11:57.330] cpu_array_value[0]:2 reps:1
[2023-06-15T20:11:57.330] ====================
[2023-06-15T20:11:57.330] debug3: select/cons_tres: job_res_rm_job: 
removed JobId=71_5(76) from part debug row 0
[2023-06-15T20:11:57.330] select/cons_tres: job_res_rm_job: 
JobId=71_5(76) finished
[2023-06-15T20:11:57.330] select/cons_tres: _will_run_test: 
JobId=71_6(77): overlap=1
[2023-06-15T20:11:57.330] select/cons_tres: job_res_rm_job: 
JobId=71_6(77) action:normal
[2023-06-15T20:11:57.330] ====================
[2023-06-15T20:11:57.330] JobId=71_6(77) nhosts:1 ncpus:1 node_req:1 
nodes=hpc2-comp02
[2023-06-15T20:11:57.330] Node[0]:
[2023-06-15T20:11:57.330]   Mem(MB):1:0  Sockets:2  Cores:8  CPUs:2:0
[2023-06-15T20:11:57.330]   Socket[0] Core[0] is allocated
[2023-06-15T20:11:57.330] --------------------
[2023-06-15T20:11:57.330] cpu_array_value[0]:2 reps:1
[2023-06-15T20:11:57.330] ====================

----- batch script -----------------------------------

#!/bin/bash

echo "Running on: ${SLURM_CLUSTER_NAME}, node list: 
${SLURM_JOB_NODELIST}, node names: ${SLURMD_NODENAME} in: `pwd` at `date`"
echo "SLURM_NTASKS: ${SLURM_NTASKS} SLURM_TASKS_PER_NODE: 
${SLURM_TASKS_PER_NODE} "
echo "SLURM_ARRAY_TASK_ID: ${SLURM_ARRAY_TASK_ID}"
echo "SLURM_MEM_PER_CPU: ${SLURM_MEM_PER_CPU}"

sleep 3600

echo "END"

Here is the sbatch command to run it:

sbatch -J test -a1-10 -t 999:00:00 -N 1 -n 1 -p debug sbatch.slurm

----- slurm.conf -----------------------------------

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster
SlurmctldHost=hpc2-comp00
#SlurmctldHost=
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=10000
#MaxStepCount=40000
#MaxTasksPerNode=512
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/linuxproc
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
SrunPortRange=60001-60005
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
#TaskEpilog=
#TaskPlugin=task/affinity,task/cgroup
TaskPlugin=task/none
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
AccountingStorageHost=hpc2-comp00
AccountingStoragePass=/var/run/munge/munge.socket.2
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
AccountingStoreFlags=job_comment,job_env,job_extra,job_script
#JobCompHost=localhost
#JobCompLoc=slurm_jobcomp_db
##JobCompParams=
#JobCompPass=/var/run/munge/munge.socket.2
#JobCompPort=3306
#JobCompType=jobcomp/mysql
JobCompType=jobcomp/none
#JobCompUser=slurm
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
# Enabled next line - 06-15-2023
SlurmctldDebug=debug5
SlurmctldLogFile=/var/log/slurmctld.log
# Enabled next line - 06-15-2023
SlurmdDebug=debug5
SlurmdLogFile=/var/log/slurmd.log
#SlurmSchedLogFile=

# Added next line : 06-15-2023
DebugFlags=Cgroup,CPU_Bind,Data,Gres,NodeFeatures,SelectType,Steps,TraceJobs

#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=hpc2-comp[01-02] CPUs=32 Sockets=2 CoresPerSocket=8 
ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=hpc2-comp[01-02] Default=YES MaxTime=INFINITE 
State=UP

----- Vagrantfile file -----------------------------------

# -*- mode: ruby -*-
# vi: set ft=ruby :

# All Vagrant configuration is done below. The "2" in Vagrant.configure
# configures the configuration version (we support older styles for
# backwards compatibility). Please don't change it unless you know what
# you're doing.
Vagrant.configure("2") do |config|
   # The most common configuration options are documented and commented 
below.
   # For a complete reference, please see the online documentation at
   # https://docs.vagrantup.com.

   # Every Vagrant development environment requires a box. You can 
search for
   # boxes at https://vagrantcloud.com/search.
   config.vm.box = "generic/fedora37"

   # Stop vagrant from generating a new key for each host to allow ssh 
between
   # machines.
   config.ssh.insert_key = false

   # The Vagrant commands are too limited to configure a NAT network,
   # so run the VBoxManager commands by hand.
   config.vm.provider "virtualbox" do |vbox|
     # Add nic2 (eth1 on the guest VM) as the physical router.  Never change
     # nic1, because that's what the host uses to communicate with the 
guest VM.
     vbox.customize ["modifyvm", :id,
                     "--nic2", "bridged",
                     "--bridge-adapter2", "enp8s0"]
   end

   # Common provisioning for all guest VMs.
   config.vm.provision "shell", inline: <<-SHELL
     # Show which command is being run to associate with command output!
     set -x

     # Remove suprious hosts from the VM image.
     sed -i '/fedora37/d' /etc/hosts
     sed -i '/^127[.]0[.]1[.]1/d' /etc/hosts

     # Add NAT network to /etc/hosts.
     for host in 10.0.1.{100..102}
     do
         hostname=hpc2-comp${host:8}
         grep -q $host /etc/hosts ||
              echo "$host	$hostname" >> /etc/hosts
     done
     unset host hostname

     # Use latest set of packages.
     dnf -y update

     # Install MUNGE.
     dnf -y install munge

     # Create the SLURM user.
     id -u slurm ||
        useradd -r -s /sbin/nologin -d /etc/slurm -c "SLURM job 
scheduler" slurm
   SHELL

   config.vm.define "hpc2_comp00" do |hpc2_comp00|
     hpc2_comp00.vm.hostname = "hpc2-comp00"
     hpc2_comp00.vm.synced_folder ".", "/vagrant", automount: true
     hpc2_comp00.vm.provision :shell, inline: <<-SHELL
       # Show which command is being run to associate with command output!
       set -x

       # Set static IP address for NAT network.
       HOST=10.0.1.100
       ETH1=/etc/NetworkManager/system-connections/eth1.nmconnection
       sed "s|address1=10.0.1.100|address1=${HOST}|" \
           /vagrant/eth1.nmconnection > $ETH1
       chmod go-r $ETH1
       nmcli con load $ETH1
       unset $HOST

       # Create the MUNGE key.
       [[ -f /etc/munge/munge.key ]] || sudo -u munge /usr/sbin/mungekey -v
       cp -av /etc/munge/munge.key /vagrant/

       # Enable and start munge.
       systemctl enable munge
       systemctl start munge
       systemctl status munge

       # Setup database on the head node:
       dnf -y install mariadb-devel mariadb-server

       # Set recomended memory (5%-50% RAM) and timeout.
       CNF=/etc/my.cnf.d/mariadb-server.cnf

       # Note we need to use a double slash for the newline character below
       # because Vagrant's inline shell script.

       MYSQL_RAM=$(awk '/^MemTotal/ {printf "%.0f\\n", $2*0.05}' 
/proc/meminfo)
       grep -q innodb_buffer_pool_size $CNF ||
            sed -i '/InnoDB/a innodb_buffer_pool_size='${MYSQL_RAM}K $CNF
       grep -q innodb_lock_wait_timeout $CNF ||
            sed -i '/innodb_buffer_pool_size/a 
innodb_lock_wait_timeout=900' $CNF
       unset CNF MYSQL_RAM

       # Run the head node services:
       systemctl enable mariadb
       systemctl start mariadb
       systemctl status mariadb

       # Secure the server.
       #
       # Send interactive commands using printf per
       # https://unix.stackexchange.com/a/112348
       printf "%s\n" "" n n y y y y | mariadb-secure-installation

       # Install the RPM package builder for SLURM.
       dnf -y install rpmdevtools

       # Download SLURM.
       wget -nc https://download.schedmd.com/slurm/slurm-23.02.0.tar.bz2
       # Install the source package dependencies to determine the 
dependencies
       # to build the binary.
       dnf -y install \
           dbus-devel \
           freeipmi-devel \
           hdf5-devel \
           http-parser-devel \
           json-c-devel \
           libcurl-devel \
           libjwt-devel \
           libyaml-devel \
           lua-devel \
           lz4-devel \
           man2html \
           readline-devel \
           rrdtool-devel
       SLURM_SRPM=~/rpmbuild/SRPMS/slurm-23.02.0-1.fc37.src.rpm
       # Create SLURM .rpmmacros file.
       cp -av /vagrant/.rpmmacros .
       [[ -f ${SLURM_SRPM} ]] ||
             rpmbuild -ts slurm-23.02.0.tar.bz2 |& tee 
build-slurm-source.log
       # Installs the source package dependencies to build the binary.
       dnf -y builddep ${SLURM_SRPM}
       unset SLURM_SRPM
       # Build the SLURM binaries.
       SLURM_RPM=~/rpmbuild/RPMS/x86_64/slurm-23.02.0-1.fc37.x86_64.rpm
       [[ -f ${SLURM_RPM} ]] ||
             rpmbuild -ta slurm-23.02.0.tar.bz2 |& tee 
build-slurm-binary.log
       unset SLURM_RPM

       # Copy SLURM packages to the compute nodes.
       DIR_RPM=~/rpmbuild/RPMS/x86_64
       cp -av ${DIR_RPM}/slurm-23*.rpm ${DIR_RPM}/slurm-slurmd-23*.rpm 
/vagrant/ &&
          touch /vagrant/sentinel-copied-rpms.done

       # Install all SLURM packages on the head node.
       find ${DIR_RPM} -type f -not -name '*-slurmd-*' -not -name 
'*-torque-*' \
            -exec dnf -y install {} +
       unset DIR_RPM
       # Copy the configuration files.
       cp -av /vagrant/slurmdbd.conf /etc/slurm/slurmdbd.conf
       chown slurm:slurm /etc/slurm/slurmdbd.conf
       chmod 600 /etc/slurm/slurmdbd.conf
       cp -av /vagrant/slurm.conf /etc/slurm/slurm.conf
       chown root:root /etc/slurm/slurm.conf
       chmod 644 /etc/slurm/slurm.conf

       # Now create the slurm MySQL user.
       SLURM_PASSWORD=$(awk -vFS='=' '/StoragePass/ {print $2}' 
/etc/slurm/slurmdbd.conf)
       DBD_HOST=localhost
       # 
https://docs.fedoraproject.org/en-US/quick-docs/installing-mysql-mariadb/#_start_mysql_service_and_enable_at_loggin
       mysql -e "create user if not exists 'slurm'@'$DBD_HOST' 
identified by '$SLURM_PASSWORD';"
       mysql -e "grant all on slurm_acct_db.* to 'slurm'@'$DBD_HOST';"
       mysql -e "show grants for 'slurm'@'$DBD_HOST';"
       DBD_HOST=hpc2-comp00
       mysql -e "create user if not exists 'slurm'@'$DBD_HOST' 
identified by '$SLURM_PASSWORD';"
       mysql -e "grant all on slurm_acct_db.* to 'slurm'@'$DBD_HOST';"
       mysql -e "show grants for 'slurm'@'$DBD_HOST';"
       unset SLURM_PASSWORD DBD_HOST

       systemctl enable slurmdbd
       systemctl start slurmdbd
       systemctl status slurmdbd

       systemctl enable slurmctld
       mkdir -p /var/spool/slurmctld
       chown slurm:slurm /var/spool/slurmctld
       # Open ports for slurmctld (6817) and slurmdbd (6819).
       firewall-cmd --add-port=6817/tcp
       firewall-cmd --add-port=6819/tcp
       firewall-cmd --runtime-to-permanent
       systemctl start slurmctld
       systemctl status slurmctld

       # Clear any previous node DOWN errors.
       sinfo -s
       sinfo -R
       scontrol update nodename=ALL state=RESUME
       sinfo -s
       sinfo -R
   SHELL
   end

   config.vm.define "hpc2_comp01" do |hpc2_comp01|
     hpc2_comp01.vm.hostname = "hpc2-comp01"
     hpc2_comp01.vm.synced_folder ".", "/vagrant", automount: true
     hpc2_comp01.vm.provision :shell, inline: <<-SHELL
       # Show which command is being run to associate with command output!
       set -x

       # Set static IP address for NAT network.
       HOST=10.0.1.101
       ETH1=/etc/NetworkManager/system-connections/eth1.nmconnection
       sed "s|address1=10.0.1.100|address1=${HOST}|" \
           /vagrant/eth1.nmconnection > $ETH1
       chmod go-r $ETH1
       nmcli con load $ETH1
       unset $HOST

       # Copy the MUNGE key.
       KEY=/etc/munge/munge.key
       cp -av /vagrant/munge.key /etc/munge/
       chown munge:munge $KEY
       chmod 600 $KEY

       # Enable and start munge.
       systemctl enable munge
       systemctl start munge
       systemctl status munge

       # SLURM packages to be installed on compute nodes.
       DIR_RPM=~/rpmbuild/RPMS/x86_64
       mkdir -p ${DIR_RPM}
       rsync -avP /vagrant/slurm*.rpm ${DIR_RPM}/
       dnf -y install ${DIR_RPM}/slurm-23*.rpm
       dnf -y install ${DIR_RPM}/slurm-slurmd-23*.rpm
       unset DIR_RPM
       # Copy the configuration file.
       mkdir -p /etc/slurm
       cp -av /vagrant/slurm.conf /etc/slurm/slurm.conf
       chown root:root /etc/slurm/slurm.conf
       chmod 644 /etc/slurm/slurm.conf
       # Only enable slurmd on the worker nodes.
       systemctl enable slurmd
       # Open port for slurmd (6818).
       firewall-cmd --add-port=6818/tcp
       firewall-cmd --runtime-to-permanent
       # Open port range for srun.
       SRUN_PORT_RANGE=$(awk -vFS='=' '/SrunPortRange/ {print $2}' 
/etc/slurm/slurm.conf)
       firewall-cmd --add-port=$SRUN_PORT_RANGE/tcp
       firewall-cmd --runtime-to-permanent
       systemctl start slurmd
       systemctl status slurmd

       # Clear any previous node DOWN errors.
       sinfo -s
       sinfo -R
       scontrol update nodename=ALL state=RESUME
       sinfo -s
       sinfo -R
   SHELL
   end

   config.vm.define "hpc2_comp02" do |hpc2_comp02|
     hpc2_comp02.vm.hostname = "hpc2-comp02"
     hpc2_comp02.vm.synced_folder ".", "/vagrant", automount: true
     hpc2_comp02.vm.provision :shell, inline: <<-SHELL
       # Show which command is being run to associate with command output!
       set -x

       # Set static IP address for NAT network.
       HOST=10.0.1.102
       ETH1=/etc/NetworkManager/system-connections/eth1.nmconnection
       sed "s|address1=10.0.1.100|address1=${HOST}|" \
           /vagrant/eth1.nmconnection > $ETH1
       chmod go-r $ETH1
       nmcli con load $ETH1
       unset $HOST

       # Copy the MUNGE key.
       KEY=/etc/munge/munge.key
       cp -av /vagrant/munge.key /etc/munge/
       chown munge:munge $KEY
       chmod 600 $KEY

       # Enable and start munge.
       systemctl enable munge
       systemctl start munge
       systemctl status munge

       # SLURM packages to be installed on compute nodes.
       DIR_RPM=~/rpmbuild/RPMS/x86_64
       mkdir -p ${DIR_RPM}
       rsync -avP /vagrant/slurm*.rpm ${DIR_RPM}/
       dnf -y install ${DIR_RPM}/slurm-23*.rpm
       dnf -y install ${DIR_RPM}/slurm-slurmd-23*.rpm
       unset DIR_RPM
       # Copy the configuration file.
       mkdir -p /etc/slurm
       cp -av /vagrant/slurm.conf /etc/slurm/slurm.conf
       chown root:root /etc/slurm/slurm.conf
       chmod 644 /etc/slurm/slurm.conf
       # Only enable slurmd on the worker nodes.
       systemctl enable slurmd
       # Open port for slurmd (6818).
       firewall-cmd --add-port=6818/tcp
       firewall-cmd --runtime-to-permanent
       systemctl start slurmd
       systemctl status slurmd

       # Clear any previous node DOWN errors.
       sinfo -s
       sinfo -R
       scontrol update nodename=ALL state=RESUME
       sinfo -s
       sinfo -R
   SHELL
   end
end

-------------------------------------------------------------------------



More information about the slurm-users mailing list