[slurm-users] final stages of cloud infrastructure set up

nathan norton nathan at nanoservices.com.au
Tue May 21 09:37:03 UTC 2019


Unfortunately that didn't work,

However i modified my slurm.conf to lie and say i had 16 cpu on 1 thread 
and now everything is working fine.

One issue with CLOUD state machines is that is when i run scontrol show 
nodes they don't show up, is there a way i can get their info when they 
are not 'running'

Thanks
Nathan

On 20/5/19 12:04 am, Riebs, Andy wrote:
>
> Just looking at this quickly, have you tried specifying 
> “hint=multithread” as an sbatch parameter?
>
> *From:*slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] *On 
> Behalf Of *nathan norton
> *Sent:* Saturday, May 18, 2019 6:03 PM
> *To:* slurm-users at lists.schedmd.com
> *Subject:* [slurm-users] final stages of cloud infrastructure set up
>
> Hi,
>
> I am in the process of setting up Slurm using Amazon cloud 
> infrastructure. All is going well, I can elastically start and stop 
> nodes when jobs run.  I am running into a few small teething issues, 
> that are probably due to me not understanding some of the terminology 
> here. At a high level all the nodes given to end users in the cloud 
> are hyper threaded, so I want to use my nodes as hyper threaded nodes. 
>  All nodes are running centos7 latest. I would also like the jobs to 
> be run in a cgroup and not migrate around after it starts. As I said 
> before I think most of it is working except for the few issues below 
> here.
>
> My use case is, I have an in house built binary application that is 
> single threaded and does no  message passing or anything like that. 
> The application is not memory bound it is only compute bound.
>
> So on a node I would like to be able to run 16 instances in parallel. 
> As can be seen below if I launch the single app via srun it runs on 
> each thread on a CPU.  However if I run the via sbatch command as can 
> be seen it only runs on CPU 0-7 instead of CPU 0-15.
>
> Another question would be how would be the best way to retry failed 
> jobs, I can rerun the batch again, but I only want to rerun a single 
> step in the batch?
>
> Please see below for the output of various commands as well as my 
> slurm.conf file as well.
>
> Many thanks
>
> Nathan.
>
> ______________________________________________________________________
>
> btuser at bt_slurm_login001[domain ]% slurmd -V
>
> slurm 18.08.6-2
>
> ______________________________________________________________________
>
> btuser at bt_slurm_login001[domain ]% cat nathan.batch.sh
>
> #!/bin/bash
>
> #SBATCH --job-name=nathan_test
>
> #SBATCH --ntasks=1
>
> #SBATCH --array=1-32
>
> #SBATCH --ntasks-per-core=2
>
> hostname
>
> srun --hint=multithread -n1  --exclusive --cpu_bind=threads cat 
> /proc/self/status | grep -i cpus_allowed_list
>
> btuser at bt_slurm_login001[domain ]%
>
> btuser at bt_slurm_login001[domain ]% sbatch Nathan.batch.sh
>
> Submitted batch job 106491
>
> btuser at bt_slurm_login001[domain ]% cat slurm-106491_*
>
> btuser at bt_slurm_login001[domain ]% cat slurm-106491_*
>
> ip-10-0-8-89.ec2.internal
>
> Cpus_allowed_list:      1
>
> ip-10-0-8-89.ec2.internal
>
> Cpus_allowed_list:      2
>
> ip-10-0-8-89.ec2.internal
>
> Cpus_allowed_list:      3
>
> ip-10-0-8-89.ec2.internal
>
> Cpus_allowed_list:      4
>
> ip-10-0-8-89.ec2.internal
>
> Cpus_allowed_list:      5
>
> ip-10-0-8-89.ec2.internal
>
> Cpus_allowed_list:      6
>
> ip-10-0-8-89.ec2.internal
>
> Cpus_allowed_list:      7
>
> ip-10-0-8-90.ec2.internal
>
> Cpus_allowed_list:      0
>
> ip-10-0-8-90.ec2.internal
>
> Cpus_allowed_list:      1
>
> ip-10-0-8-90.ec2.internal
>
> Cpus_allowed_list:      2
>
> ip-10-0-8-88.ec2.internal
>
> Cpus_allowed_list:      0
>
> ip-10-0-8-90.ec2.internal
>
> Cpus_allowed_list:      3
>
> ip-10-0-8-90.ec2.internal
>
> Cpus_allowed_list:      4
>
> ip-10-0-8-90.ec2.internal
>
> Cpus_allowed_list:      5
>
> ip-10-0-8-90.ec2.internal
>
> Cpus_allowed_list:      6
>
> ip-10-0-8-90.ec2.internal
>
> Cpus_allowed_list:      7
>
> ip-10-0-8-91.ec2.internal
>
> Cpus_allowed_list:      0
>
> ip-10-0-8-91.ec2.internal
>
> Cpus_allowed_list:      1
>
> ip-10-0-8-91.ec2.internal
>
> Cpus_allowed_list:      2
>
> ip-10-0-8-91.ec2.internal
>
> Cpus_allowed_list:      3
>
> ip-10-0-8-91.ec2.internal
>
> Cpus_allowed_list:      4
>
> ip-10-0-8-88.ec2.internal
>
> Cpus_allowed_list:      1
>
> ip-10-0-8-91.ec2.internal
>
> Cpus_allowed_list:      5
>
> ip-10-0-8-91.ec2.internal
>
> Cpus_allowed_list:      6
>
> ip-10-0-8-91.ec2.internal
>
> Cpus_allowed_list:      7
>
> ip-10-0-8-88.ec2.internal
>
> Cpus_allowed_list:      2
>
> ip-10-0-8-88.ec2.internal
>
> Cpus_allowed_list:      3
>
> ip-10-0-8-88.ec2.internal
>
> Cpus_allowed_list:      4
>
> ip-10-0-8-88.ec2.internal
>
> Cpus_allowed_list:      5
>
> ip-10-0-8-88.ec2.internal
>
> Cpus_allowed_list:      6
>
> ip-10-0-8-88.ec2.internal
>
> Cpus_allowed_list:      7
>
> ip-10-0-8-89.ec2.internal
>
> Cpus_allowed_list:      0
>
> ______________________________________________________________________
>
> btuser at bt_slurm_login001[domain ]% srun -n32 --exclusive   
> --cpu_bind=threads cat /proc/self/status | grep -i cpus_allowed_list
>
> Cpus_allowed_list:      12
>
> Cpus_allowed_list:      13
>
> Cpus_allowed_list:      15
>
> Cpus_allowed_list:      0
>
> Cpus_allowed_list:      8
>
> Cpus_allowed_list:      1
>
> Cpus_allowed_list:      9
>
> Cpus_allowed_list:      2
>
> Cpus_allowed_list:      10
>
> Cpus_allowed_list:      11
>
> Cpus_allowed_list:      4
>
> Cpus_allowed_list:      5
>
> Cpus_allowed_list:      6
>
> Cpus_allowed_list:      14
>
> Cpus_allowed_list:      7
>
> Cpus_allowed_list:      3
>
> Cpus_allowed_list:      0
>
> Cpus_allowed_list:      8
>
> Cpus_allowed_list:      1
>
> Cpus_allowed_list:      9
>
> Cpus_allowed_list:      2
>
> Cpus_allowed_list:      10
>
> Cpus_allowed_list:      3
>
> Cpus_allowed_list:      11
>
> Cpus_allowed_list:      4
>
> Cpus_allowed_list:      12
>
> Cpus_allowed_list:      5
>
> Cpus_allowed_list:      6
>
> Cpus_allowed_list:      14
>
> Cpus_allowed_list:      7
>
> Cpus_allowed_list:      15
>
> Cpus_allowed_list:      13
>
> ______________________________________________________________________
>
> [sysadmin at ip-10-0-8-88 ~]$ slurmd -C
>
> NodeName=ip-10-0-8-88 CPUs=16 Boards=1 SocketsPerBoard=1 
> CoresPerSocket=8 ThreadsPerCore=2 RealMemory=30986
>
> UpTime=0-00:06:10
>
> ______________________________________________________________________
>
> Cloud server stats:
>
> [sysadmin at ip-10-0-8-88 ~]$ lscpu  -e
>
> CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
>
> 0   0    0      0    0:0:0:0       yes
>
> 1   0    0      1    1:1:1:0       yes
>
> 2   0    0      2    2:2:2:0       yes
>
> 3   0    0      3    3:3:3:0       yes
>
> 4   0    0      4    4:4:4:0       yes
>
> 5   0    0      5    5:5:5:0       yes
>
> 6   0    0      6    6:6:6:0       yes
>
> 7   0    0      7    7:7:7:0       yes
>
> 8   0    0      0    0:0:0:0       yes
>
> 9   0    0      1    1:1:1:0       yes
>
> 10  0    0      2    2:2:2:0       yes
>
> 11  0    0      3    3:3:3:0       yes
>
> 12  0    0      4    4:4:4:0       yes
>
> 13  0    0      5    5:5:5:0       yes
>
> 14  0    0      6    6:6:6:0       yes
>
> 15  0    0      7    7:7:7:0       yes
>
> ______________________________________________________________________
>
> # slurm.conf file generated by configurator easy.html.
>
> SlurmctldHost=bt_slurm_master
>
> MailProg=/dev/null
>
> MpiDefault=none
>
> ProctrackType=proctrack/cgroup
>
> ReturnToService=1
>
> SlurmctldPidFile=/var/run/slurmd/slurmctld.pid
>
> SlurmdPidFile=/var/run/slurmd/slurmd.pid
>
> SlurmdSpoolDir=/var/spool/slurmctld/slurmd
>
> SlurmUser=slurm
>
> StateSaveLocation=/var/spool/slurmctld
>
> SwitchType=switch/none
>
> TaskPlugin=task/cgroup
>
> TaskPluginParam=Threads
>
> SlurmdTimeout=500
>
> FastSchedule=1
>
> SchedulerType=sched/backfill
>
> SelectType=select/cons_res
>
> SelectTypeParameters=CR_CPU
>
> AccountingStorageType=accounting_storage/none
>
> ClusterName=simplecluster
>
> JobAcctGatherType=jobacct_gather/none
>
> PropagatePrioProcess=2
>
> MaxTasksPerNode=16
>
> ResumeProgram=/bt/admin/slurm/etc/slurm_ec2_startup.sh
>
> ResumeTimeout=900
>
> ResumeRate=0
>
> SuspendProgram=/bt/admin/slurm/etc/slurm_ec2_shutdown.sh
>
> SuspendTime=600
>
> SuspendTimeout=120
>
> TreeWidth=1024
>
> SuspendRate=0
>
> NodeName=ip-10-0-8-[2-100] CPUs=16 RealMemory=27648 Sockets=1 
> CoresPerSocket=8 ThreadsPerCore=2  State=CLOUD
>
> NodeName=bt_slurm_login00[1-10] RealMemory=512 State=DOWN
>
> PartitionName=backtest Nodes=ip-10-0-8-[2-100] Default=YES MaxTime=300 
> Oversubscribe=NO State=UP
>
> ______________________________________________________________________
>
> [sysadmin at bt_slurm_master ~]$ cat /etc/slurm/cgroup.conf
>
> ###
>
> #
>
> # Slurm cgroup support configuration file
>
> #
>
> # See man slurm.conf and man cgroup.conf for further
>
> # information on cgroup configuration parameters
>
> #--
>
> CgroupAutomount=yes
>
> CgroupMountpoint="/sys/fs/cgroup"
>
> TaskAffinity=yes
>
> ConstrainCores=yes
>
> ConstrainRAMSpace=no
>
> ______________________________________________________________________
>
> JobId=107424 ArrayJobId=107423 ArrayTaskId=1 JobName=aijing_test
>
>    UserId=btuser(1001) GroupId=users(100) MCS_label=N/A
>
>    Priority=4294901612 Nice=0 Account=(null) QOS=(null)
>
>    JobState=COMPLETED Reason=None Dependency=(null)
>
>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>
>    DerivedExitCode=0:0
>
>    RunTime=00:01:06 TimeLimit=05:00:00 TimeMin=N/A
>
>    SubmitTime=2019-05-17T01:55:56 EligibleTime=2019-05-17T01:55:56
>
>    AccrueTime=Unknown
>
>    StartTime=2019-05-17T01:55:56 EndTime=2019-05-17T01:57:02 Deadline=N/A
>
>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>
>    LastSchedEval=2019-05-17T01:55:56
>
>    Partition=backtest AllocNode:Sid=bt_slurm_login001:14002
>
>   ReqNodeList=(null) ExcNodeList=(null)
>
>    NodeList=ip-10-0-8-88
>
>    BatchHost=ip-10-0-8-88
>
>    NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>
>    TRES=cpu=1,node=1,billing=1
>
>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>
>      Nodes=ip-10-0-8-88 CPU_IDs=0-1 Mem=0 GRES_IDX=
>
>    MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>
>    Features=(null) DelayBoot=00:00:00
>
>    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>
> Command=/bt/data/backtester/destination_tables_94633/batch_run.sh
>
> WorkDir=/bt/data/backtester/destination_tables_94633
>
> StdErr=/bt/data/backtester/destination_tables_94633/slurm-107423_1.out
>
>    StdIn=/dev/null
>
> StdOut=/bt/data/backtester/destination_tables_94633/slurm-107423_1.out
>
>    Power=
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190521/4547eb22/attachment-0001.html>


More information about the slurm-users mailing list