[slurm-users] final stages of cloud infrastructure set up
nathan norton
nathan at nanoservices.com.au
Tue May 21 09:37:03 UTC 2019
Unfortunately that didn't work,
However i modified my slurm.conf to lie and say i had 16 cpu on 1 thread
and now everything is working fine.
One issue with CLOUD state machines is that is when i run scontrol show
nodes they don't show up, is there a way i can get their info when they
are not 'running'
Thanks
Nathan
On 20/5/19 12:04 am, Riebs, Andy wrote:
>
> Just looking at this quickly, have you tried specifying
> “hint=multithread” as an sbatch parameter?
>
> *From:*slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] *On
> Behalf Of *nathan norton
> *Sent:* Saturday, May 18, 2019 6:03 PM
> *To:* slurm-users at lists.schedmd.com
> *Subject:* [slurm-users] final stages of cloud infrastructure set up
>
> Hi,
>
> I am in the process of setting up Slurm using Amazon cloud
> infrastructure. All is going well, I can elastically start and stop
> nodes when jobs run. I am running into a few small teething issues,
> that are probably due to me not understanding some of the terminology
> here. At a high level all the nodes given to end users in the cloud
> are hyper threaded, so I want to use my nodes as hyper threaded nodes.
> All nodes are running centos7 latest. I would also like the jobs to
> be run in a cgroup and not migrate around after it starts. As I said
> before I think most of it is working except for the few issues below
> here.
>
> My use case is, I have an in house built binary application that is
> single threaded and does no message passing or anything like that.
> The application is not memory bound it is only compute bound.
>
> So on a node I would like to be able to run 16 instances in parallel.
> As can be seen below if I launch the single app via srun it runs on
> each thread on a CPU. However if I run the via sbatch command as can
> be seen it only runs on CPU 0-7 instead of CPU 0-15.
>
> Another question would be how would be the best way to retry failed
> jobs, I can rerun the batch again, but I only want to rerun a single
> step in the batch?
>
> Please see below for the output of various commands as well as my
> slurm.conf file as well.
>
> Many thanks
>
> Nathan.
>
> ______________________________________________________________________
>
> btuser at bt_slurm_login001[domain ]% slurmd -V
>
> slurm 18.08.6-2
>
> ______________________________________________________________________
>
> btuser at bt_slurm_login001[domain ]% cat nathan.batch.sh
>
> #!/bin/bash
>
> #SBATCH --job-name=nathan_test
>
> #SBATCH --ntasks=1
>
> #SBATCH --array=1-32
>
> #SBATCH --ntasks-per-core=2
>
> hostname
>
> srun --hint=multithread -n1 --exclusive --cpu_bind=threads cat
> /proc/self/status | grep -i cpus_allowed_list
>
> btuser at bt_slurm_login001[domain ]%
>
> btuser at bt_slurm_login001[domain ]% sbatch Nathan.batch.sh
>
> Submitted batch job 106491
>
> btuser at bt_slurm_login001[domain ]% cat slurm-106491_*
>
> btuser at bt_slurm_login001[domain ]% cat slurm-106491_*
>
> ip-10-0-8-89.ec2.internal
>
> Cpus_allowed_list: 1
>
> ip-10-0-8-89.ec2.internal
>
> Cpus_allowed_list: 2
>
> ip-10-0-8-89.ec2.internal
>
> Cpus_allowed_list: 3
>
> ip-10-0-8-89.ec2.internal
>
> Cpus_allowed_list: 4
>
> ip-10-0-8-89.ec2.internal
>
> Cpus_allowed_list: 5
>
> ip-10-0-8-89.ec2.internal
>
> Cpus_allowed_list: 6
>
> ip-10-0-8-89.ec2.internal
>
> Cpus_allowed_list: 7
>
> ip-10-0-8-90.ec2.internal
>
> Cpus_allowed_list: 0
>
> ip-10-0-8-90.ec2.internal
>
> Cpus_allowed_list: 1
>
> ip-10-0-8-90.ec2.internal
>
> Cpus_allowed_list: 2
>
> ip-10-0-8-88.ec2.internal
>
> Cpus_allowed_list: 0
>
> ip-10-0-8-90.ec2.internal
>
> Cpus_allowed_list: 3
>
> ip-10-0-8-90.ec2.internal
>
> Cpus_allowed_list: 4
>
> ip-10-0-8-90.ec2.internal
>
> Cpus_allowed_list: 5
>
> ip-10-0-8-90.ec2.internal
>
> Cpus_allowed_list: 6
>
> ip-10-0-8-90.ec2.internal
>
> Cpus_allowed_list: 7
>
> ip-10-0-8-91.ec2.internal
>
> Cpus_allowed_list: 0
>
> ip-10-0-8-91.ec2.internal
>
> Cpus_allowed_list: 1
>
> ip-10-0-8-91.ec2.internal
>
> Cpus_allowed_list: 2
>
> ip-10-0-8-91.ec2.internal
>
> Cpus_allowed_list: 3
>
> ip-10-0-8-91.ec2.internal
>
> Cpus_allowed_list: 4
>
> ip-10-0-8-88.ec2.internal
>
> Cpus_allowed_list: 1
>
> ip-10-0-8-91.ec2.internal
>
> Cpus_allowed_list: 5
>
> ip-10-0-8-91.ec2.internal
>
> Cpus_allowed_list: 6
>
> ip-10-0-8-91.ec2.internal
>
> Cpus_allowed_list: 7
>
> ip-10-0-8-88.ec2.internal
>
> Cpus_allowed_list: 2
>
> ip-10-0-8-88.ec2.internal
>
> Cpus_allowed_list: 3
>
> ip-10-0-8-88.ec2.internal
>
> Cpus_allowed_list: 4
>
> ip-10-0-8-88.ec2.internal
>
> Cpus_allowed_list: 5
>
> ip-10-0-8-88.ec2.internal
>
> Cpus_allowed_list: 6
>
> ip-10-0-8-88.ec2.internal
>
> Cpus_allowed_list: 7
>
> ip-10-0-8-89.ec2.internal
>
> Cpus_allowed_list: 0
>
> ______________________________________________________________________
>
> btuser at bt_slurm_login001[domain ]% srun -n32 --exclusive
> --cpu_bind=threads cat /proc/self/status | grep -i cpus_allowed_list
>
> Cpus_allowed_list: 12
>
> Cpus_allowed_list: 13
>
> Cpus_allowed_list: 15
>
> Cpus_allowed_list: 0
>
> Cpus_allowed_list: 8
>
> Cpus_allowed_list: 1
>
> Cpus_allowed_list: 9
>
> Cpus_allowed_list: 2
>
> Cpus_allowed_list: 10
>
> Cpus_allowed_list: 11
>
> Cpus_allowed_list: 4
>
> Cpus_allowed_list: 5
>
> Cpus_allowed_list: 6
>
> Cpus_allowed_list: 14
>
> Cpus_allowed_list: 7
>
> Cpus_allowed_list: 3
>
> Cpus_allowed_list: 0
>
> Cpus_allowed_list: 8
>
> Cpus_allowed_list: 1
>
> Cpus_allowed_list: 9
>
> Cpus_allowed_list: 2
>
> Cpus_allowed_list: 10
>
> Cpus_allowed_list: 3
>
> Cpus_allowed_list: 11
>
> Cpus_allowed_list: 4
>
> Cpus_allowed_list: 12
>
> Cpus_allowed_list: 5
>
> Cpus_allowed_list: 6
>
> Cpus_allowed_list: 14
>
> Cpus_allowed_list: 7
>
> Cpus_allowed_list: 15
>
> Cpus_allowed_list: 13
>
> ______________________________________________________________________
>
> [sysadmin at ip-10-0-8-88 ~]$ slurmd -C
>
> NodeName=ip-10-0-8-88 CPUs=16 Boards=1 SocketsPerBoard=1
> CoresPerSocket=8 ThreadsPerCore=2 RealMemory=30986
>
> UpTime=0-00:06:10
>
> ______________________________________________________________________
>
> Cloud server stats:
>
> [sysadmin at ip-10-0-8-88 ~]$ lscpu -e
>
> CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
>
> 0 0 0 0 0:0:0:0 yes
>
> 1 0 0 1 1:1:1:0 yes
>
> 2 0 0 2 2:2:2:0 yes
>
> 3 0 0 3 3:3:3:0 yes
>
> 4 0 0 4 4:4:4:0 yes
>
> 5 0 0 5 5:5:5:0 yes
>
> 6 0 0 6 6:6:6:0 yes
>
> 7 0 0 7 7:7:7:0 yes
>
> 8 0 0 0 0:0:0:0 yes
>
> 9 0 0 1 1:1:1:0 yes
>
> 10 0 0 2 2:2:2:0 yes
>
> 11 0 0 3 3:3:3:0 yes
>
> 12 0 0 4 4:4:4:0 yes
>
> 13 0 0 5 5:5:5:0 yes
>
> 14 0 0 6 6:6:6:0 yes
>
> 15 0 0 7 7:7:7:0 yes
>
> ______________________________________________________________________
>
> # slurm.conf file generated by configurator easy.html.
>
> SlurmctldHost=bt_slurm_master
>
> MailProg=/dev/null
>
> MpiDefault=none
>
> ProctrackType=proctrack/cgroup
>
> ReturnToService=1
>
> SlurmctldPidFile=/var/run/slurmd/slurmctld.pid
>
> SlurmdPidFile=/var/run/slurmd/slurmd.pid
>
> SlurmdSpoolDir=/var/spool/slurmctld/slurmd
>
> SlurmUser=slurm
>
> StateSaveLocation=/var/spool/slurmctld
>
> SwitchType=switch/none
>
> TaskPlugin=task/cgroup
>
> TaskPluginParam=Threads
>
> SlurmdTimeout=500
>
> FastSchedule=1
>
> SchedulerType=sched/backfill
>
> SelectType=select/cons_res
>
> SelectTypeParameters=CR_CPU
>
> AccountingStorageType=accounting_storage/none
>
> ClusterName=simplecluster
>
> JobAcctGatherType=jobacct_gather/none
>
> PropagatePrioProcess=2
>
> MaxTasksPerNode=16
>
> ResumeProgram=/bt/admin/slurm/etc/slurm_ec2_startup.sh
>
> ResumeTimeout=900
>
> ResumeRate=0
>
> SuspendProgram=/bt/admin/slurm/etc/slurm_ec2_shutdown.sh
>
> SuspendTime=600
>
> SuspendTimeout=120
>
> TreeWidth=1024
>
> SuspendRate=0
>
> NodeName=ip-10-0-8-[2-100] CPUs=16 RealMemory=27648 Sockets=1
> CoresPerSocket=8 ThreadsPerCore=2 State=CLOUD
>
> NodeName=bt_slurm_login00[1-10] RealMemory=512 State=DOWN
>
> PartitionName=backtest Nodes=ip-10-0-8-[2-100] Default=YES MaxTime=300
> Oversubscribe=NO State=UP
>
> ______________________________________________________________________
>
> [sysadmin at bt_slurm_master ~]$ cat /etc/slurm/cgroup.conf
>
> ###
>
> #
>
> # Slurm cgroup support configuration file
>
> #
>
> # See man slurm.conf and man cgroup.conf for further
>
> # information on cgroup configuration parameters
>
> #--
>
> CgroupAutomount=yes
>
> CgroupMountpoint="/sys/fs/cgroup"
>
> TaskAffinity=yes
>
> ConstrainCores=yes
>
> ConstrainRAMSpace=no
>
> ______________________________________________________________________
>
> JobId=107424 ArrayJobId=107423 ArrayTaskId=1 JobName=aijing_test
>
> UserId=btuser(1001) GroupId=users(100) MCS_label=N/A
>
> Priority=4294901612 Nice=0 Account=(null) QOS=(null)
>
> JobState=COMPLETED Reason=None Dependency=(null)
>
> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>
> DerivedExitCode=0:0
>
> RunTime=00:01:06 TimeLimit=05:00:00 TimeMin=N/A
>
> SubmitTime=2019-05-17T01:55:56 EligibleTime=2019-05-17T01:55:56
>
> AccrueTime=Unknown
>
> StartTime=2019-05-17T01:55:56 EndTime=2019-05-17T01:57:02 Deadline=N/A
>
> PreemptTime=None SuspendTime=None SecsPreSuspend=0
>
> LastSchedEval=2019-05-17T01:55:56
>
> Partition=backtest AllocNode:Sid=bt_slurm_login001:14002
>
> ReqNodeList=(null) ExcNodeList=(null)
>
> NodeList=ip-10-0-8-88
>
> BatchHost=ip-10-0-8-88
>
> NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>
> TRES=cpu=1,node=1,billing=1
>
> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>
> Nodes=ip-10-0-8-88 CPU_IDs=0-1 Mem=0 GRES_IDX=
>
> MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>
> Features=(null) DelayBoot=00:00:00
>
> OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>
> Command=/bt/data/backtester/destination_tables_94633/batch_run.sh
>
> WorkDir=/bt/data/backtester/destination_tables_94633
>
> StdErr=/bt/data/backtester/destination_tables_94633/slurm-107423_1.out
>
> StdIn=/dev/null
>
> StdOut=/bt/data/backtester/destination_tables_94633/slurm-107423_1.out
>
> Power=
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190521/4547eb22/attachment-0001.html>
More information about the slurm-users
mailing list