[slurm-users] final stages of cloud infrastructure set up
nathan norton
nathan at nanoservices.com.au
Sat May 18 22:02:59 UTC 2019
Hi,
I am in the process of setting up Slurm using Amazon cloud
infrastructure. All is going well, I can elastically start and stop
nodes when jobs run. I am running into a few small teething issues,
that are probably due to me not understanding some of the terminology
here. At a high level all the nodes given to end users in the cloud are
hyper threaded, so I want to use my nodes as hyper threaded nodes. All
nodes are running centos7 latest. I would also like the jobs to be run
in a cgroup and not migrate around after it starts. As I said before I
think most of it is working except for the few issues below here.
My use case is, I have an in house built binary application that is
single threaded and does no message passing or anything like that. The
application is not memory bound it is only compute bound.
So on a node I would like to be able to run 16 instances in parallel. As
can be seen below if I launch the single app via srun it runs on each
thread on a CPU. However if I run the via sbatch command as can be seen
it only runs on CPU 0-7 instead of CPU 0-15.
Another question would be how would be the best way to retry failed
jobs, I can rerun the batch again, but I only want to rerun a single
step in the batch?
Please see below for the output of various commands as well as my
slurm.conf file as well.
Many thanks
Nathan.
______________________________________________________________________
btuser at bt_slurm_login001[domain ]% slurmd -V
slurm 18.08.6-2
______________________________________________________________________
btuser at bt_slurm_login001[domain ]% cat nathan.batch.sh
#!/bin/bash
#SBATCH --job-name=nathan_test
#SBATCH --ntasks=1
#SBATCH --array=1-32
#SBATCH --ntasks-per-core=2
hostname
srun --hint=multithread -n1 --exclusive --cpu_bind=threads cat
/proc/self/status | grep -i cpus_allowed_list
btuser at bt_slurm_login001[domain ]%
btuser at bt_slurm_login001[domain ]% sbatch Nathan.batch.sh
Submitted batch job 106491
btuser at bt_slurm_login001[domain ]% cat slurm-106491_*
btuser at bt_slurm_login001[domain ]% cat slurm-106491_*
ip-10-0-8-89.ec2.internal
Cpus_allowed_list: 1
ip-10-0-8-89.ec2.internal
Cpus_allowed_list: 2
ip-10-0-8-89.ec2.internal
Cpus_allowed_list: 3
ip-10-0-8-89.ec2.internal
Cpus_allowed_list: 4
ip-10-0-8-89.ec2.internal
Cpus_allowed_list: 5
ip-10-0-8-89.ec2.internal
Cpus_allowed_list: 6
ip-10-0-8-89.ec2.internal
Cpus_allowed_list: 7
ip-10-0-8-90.ec2.internal
Cpus_allowed_list: 0
ip-10-0-8-90.ec2.internal
Cpus_allowed_list: 1
ip-10-0-8-90.ec2.internal
Cpus_allowed_list: 2
ip-10-0-8-88.ec2.internal
Cpus_allowed_list: 0
ip-10-0-8-90.ec2.internal
Cpus_allowed_list: 3
ip-10-0-8-90.ec2.internal
Cpus_allowed_list: 4
ip-10-0-8-90.ec2.internal
Cpus_allowed_list: 5
ip-10-0-8-90.ec2.internal
Cpus_allowed_list: 6
ip-10-0-8-90.ec2.internal
Cpus_allowed_list: 7
ip-10-0-8-91.ec2.internal
Cpus_allowed_list: 0
ip-10-0-8-91.ec2.internal
Cpus_allowed_list: 1
ip-10-0-8-91.ec2.internal
Cpus_allowed_list: 2
ip-10-0-8-91.ec2.internal
Cpus_allowed_list: 3
ip-10-0-8-91.ec2.internal
Cpus_allowed_list: 4
ip-10-0-8-88.ec2.internal
Cpus_allowed_list: 1
ip-10-0-8-91.ec2.internal
Cpus_allowed_list: 5
ip-10-0-8-91.ec2.internal
Cpus_allowed_list: 6
ip-10-0-8-91.ec2.internal
Cpus_allowed_list: 7
ip-10-0-8-88.ec2.internal
Cpus_allowed_list: 2
ip-10-0-8-88.ec2.internal
Cpus_allowed_list: 3
ip-10-0-8-88.ec2.internal
Cpus_allowed_list: 4
ip-10-0-8-88.ec2.internal
Cpus_allowed_list: 5
ip-10-0-8-88.ec2.internal
Cpus_allowed_list: 6
ip-10-0-8-88.ec2.internal
Cpus_allowed_list: 7
ip-10-0-8-89.ec2.internal
Cpus_allowed_list: 0
______________________________________________________________________
btuser at bt_slurm_login001[domain ]% srun -n32 --exclusive
--cpu_bind=threads cat /proc/self/status | grep -i cpus_allowed_list
Cpus_allowed_list: 12
Cpus_allowed_list: 13
Cpus_allowed_list: 15
Cpus_allowed_list: 0
Cpus_allowed_list: 8
Cpus_allowed_list: 1
Cpus_allowed_list: 9
Cpus_allowed_list: 2
Cpus_allowed_list: 10
Cpus_allowed_list: 11
Cpus_allowed_list: 4
Cpus_allowed_list: 5
Cpus_allowed_list: 6
Cpus_allowed_list: 14
Cpus_allowed_list: 7
Cpus_allowed_list: 3
Cpus_allowed_list: 0
Cpus_allowed_list: 8
Cpus_allowed_list: 1
Cpus_allowed_list: 9
Cpus_allowed_list: 2
Cpus_allowed_list: 10
Cpus_allowed_list: 3
Cpus_allowed_list: 11
Cpus_allowed_list: 4
Cpus_allowed_list: 12
Cpus_allowed_list: 5
Cpus_allowed_list: 6
Cpus_allowed_list: 14
Cpus_allowed_list: 7
Cpus_allowed_list: 15
Cpus_allowed_list: 13
______________________________________________________________________
[sysadmin at ip-10-0-8-88 ~]$ slurmd -C
NodeName=ip-10-0-8-88 CPUs=16 Boards=1 SocketsPerBoard=1
CoresPerSocket=8 ThreadsPerCore=2 RealMemory=30986
UpTime=0-00:06:10
______________________________________________________________________
Cloud server stats:
[sysadmin at ip-10-0-8-88 ~]$ lscpu -e
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
0 0 0 0 0:0:0:0 yes
1 0 0 1 1:1:1:0 yes
2 0 0 2 2:2:2:0 yes
3 0 0 3 3:3:3:0 yes
4 0 0 4 4:4:4:0 yes
5 0 0 5 5:5:5:0 yes
6 0 0 6 6:6:6:0 yes
7 0 0 7 7:7:7:0 yes
8 0 0 0 0:0:0:0 yes
9 0 0 1 1:1:1:0 yes
10 0 0 2 2:2:2:0 yes
11 0 0 3 3:3:3:0 yes
12 0 0 4 4:4:4:0 yes
13 0 0 5 5:5:5:0 yes
14 0 0 6 6:6:6:0 yes
15 0 0 7 7:7:7:0 yes
______________________________________________________________________
# slurm.conf file generated by configurator easy.html.
SlurmctldHost=bt_slurm_master
MailProg=/dev/null
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmd/slurmctld.pid
SlurmdPidFile=/var/run/slurmd/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmctld/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/cgroup
TaskPluginParam=Threads
SlurmdTimeout=500
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
AccountingStorageType=accounting_storage/none
ClusterName=simplecluster
JobAcctGatherType=jobacct_gather/none
PropagatePrioProcess=2
MaxTasksPerNode=16
ResumeProgram=/bt/admin/slurm/etc/slurm_ec2_startup.sh
ResumeTimeout=900
ResumeRate=0
SuspendProgram=/bt/admin/slurm/etc/slurm_ec2_shutdown.sh
SuspendTime=600
SuspendTimeout=120
TreeWidth=1024
SuspendRate=0
NodeName=ip-10-0-8-[2-100] CPUs=16 RealMemory=27648 Sockets=1
CoresPerSocket=8 ThreadsPerCore=2 State=CLOUD
NodeName=bt_slurm_login00[1-10] RealMemory=512 State=DOWN
PartitionName=backtest Nodes=ip-10-0-8-[2-100] Default=YES MaxTime=300
Oversubscribe=NO State=UP
______________________________________________________________________
[sysadmin at bt_slurm_master ~]$ cat /etc/slurm/cgroup.conf
###
#
# Slurm cgroup support configuration file
#
# See man slurm.conf and man cgroup.conf for further
# information on cgroup configuration parameters
#--
CgroupAutomount=yes
CgroupMountpoint="/sys/fs/cgroup"
TaskAffinity=yes
ConstrainCores=yes
ConstrainRAMSpace=no
______________________________________________________________________
JobId=107424 ArrayJobId=107423 ArrayTaskId=1 JobName=aijing_test
UserId=btuser(1001) GroupId=users(100) MCS_label=N/A
Priority=4294901612 Nice=0 Account=(null) QOS=(null)
JobState=COMPLETED Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:01:06 TimeLimit=05:00:00 TimeMin=N/A
SubmitTime=2019-05-17T01:55:56 EligibleTime=2019-05-17T01:55:56
AccrueTime=Unknown
StartTime=2019-05-17T01:55:56 EndTime=2019-05-17T01:57:02 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-05-17T01:55:56
Partition=backtest AllocNode:Sid=bt_slurm_login001:14002
ReqNodeList=(null) ExcNodeList=(null)
NodeList=ip-10-0-8-88
BatchHost=ip-10-0-8-88
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
Nodes=ip-10-0-8-88 CPU_IDs=0-1 Mem=0 GRES_IDX=
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/bt/data/backtester/destination_tables_94633/batch_run.sh
WorkDir=/bt/data/backtester/destination_tables_94633
StdErr=/bt/data/backtester/destination_tables_94633/slurm-107423_1.out
StdIn=/dev/null
StdOut=/bt/data/backtester/destination_tables_94633/slurm-107423_1.out
Power=
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190519/744fc0c6/attachment-0001.html>
More information about the slurm-users
mailing list