[slurm-users] Having errors trying to run a packed jobs

Marius Cetateanu MCetateanu at softkinetic.com
Tue Nov 7 05:44:06 MST 2017


Hi,

I see that parts of my message were scrubbed. I will try to post the relevant info below
(If that does not abide to the mailing list rules please let me know and point in the right 
direction to convey this kind of information).

slurmctld.log

2017-11-06T11:38:26.623] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=0
[2017-11-06T11:40:45.063] _slurm_rpc_submit_batch_job JobId=1489 usec=505
[2017-11-06T11:40:45.625] backfill: Started JobId=1489 in main_compute on cn_burebista
[2017-11-06T11:40:45.697] _pick_step_nodes: Configuration for job 1489 is complete
[2017-11-06T11:44:48.289] slurmctld: agent retry_list size is 101
[2017-11-06T11:44:48.289]    retry_list msg_type=7009,7009,7009,7009,7009
[2017-11-06T11:51:12.132] slurmctld: agent retry_list size is 101
[2017-11-06T11:51:12.132]    retry_list msg_type=7009,7009,7009,7009,7009
[2017-11-06T11:52:12.835] job_step_signal step 1489.56 not found
[2017-11-06T11:52:12.835] job_step_signal step 1489.59 not found
[2017-11-06T11:52:12.835] job_step_signal step 1489.63 not found
[2017-11-06T11:52:12.835] job_step_signal step 1489.52 not found
[2017-11-06T11:52:12.838] job_step_signal step 1489.53 not found
[2017-11-06T11:52:12.842] job_step_signal step 1489.54 not found
[2017-11-06T11:52:12.842] job_step_signal step 1489.60 not found
[2017-11-06T11:52:12.856] job_step_signal step 1489.58 not found
[2017-11-06T11:52:12.856] job_step_signal step 1489.61 not found
[2017-11-06T11:52:12.862] job_step_signal step 1489.55 not found
[2017-11-06T11:52:12.875] job_step_signal step 1489.62 not found
[2017-11-06T11:52:12.884] job_step_signal step 1489.51 not found
[2017-11-06T11:52:13.007] job_step_signal step 1489.57 not found
[2017-11-06T11:52:13.191] job_step_signal step 1489.50 not found
[2017-11-06T11:52:58.625] job_step_signal step 1489.61 not found
[2017-11-06T11:52:58.625] job_step_signal step 1489.57 not found
[2017-11-06T11:52:58.625] job_step_signal step 1489.55 not found

script.sh
#!/bin/sh                                                                                                                                                                                                                                   
# job parameters                                                                                                         
#SBATCH --job-name=preprocess_movies                                                                                     
#SBATCH --output=preprocess_movies.log                                                                                                                                                                                                   
 # needed resources                                                                                                       
 #SBATCH --ntasks=50                                                                                                      
 #SBATCH --mem-per-cpu=2GB                                                                                                                                                                                                       
 shopt -s globstar                                                                                                                                                                                                    
 FILES=$(find ./full_training/raw -name "*.skv")                                                
 OUTPUT=./output_files                                                                                                           
 # operations                                                                                                             
 echo "[preprocess] Job started at $(date)"                                                                                                                                                                                                 
 # job steps                                                                                                              
 for file in ${FILES}                                                                                                     
 do                                                                                                                       
      echo "[preprocess] Processing file: ${file##*/}"                                                                     
      echo "[preprocess] Output to: $OUTPUT/${file##*/}"                                                                   
      srun -n1 --exclusive ./preprocessing-r 1 0 1 ${file} $OUTPUT/${file##*/} &                                           
 done                                                                                                                                                                                                                                        
 wait                                                                                                                                                                                                                                       
 echo "[preprocess] Job ended at $(date)"     
 
sinfo, squeue output
% sinfo -Nle                                                                                                                                                                                                                          
Mon Nov  6 12:31:51 2017
NODELIST      NODES     PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
cn_burebista      1 main_compute*       mixed   56   2:14:2 256000        0      1   (null) none                

% squeue                                                                                                                                                                                                                    
            JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1489 main_comp preproce mcetatea  R      51:11      1 cn_burebista


slurm.conf
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=draco
ControlMachine=zalmoxis
#ControlAddr=
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/tmp
SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir=
#FirstJobId=
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
UsePAM=1
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
#JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODES
# OpenHPC default configuration
PropagateResourceLimitsExcept=MEMLOCK
SlurmdLogFile=/var/log/slurm.log
SlurmctldLogFile=/var/log/slurmctld.log
Epilog=/etc/slurm/slurm.epilog.clean
# ?? How to set up more node names ??
NodeName=cn_burebista Sockets=2 CoresPerSocket=14 ThreadsPerCore=2 RealMemory=256000 State=UNKNOWN
PartitionName=normal Nodes=cn_burebista Default=YES MaxTime=24:00:00 State=UP
ReturnToService=1



----------------------------------------------------------------------

Message: 1
Date: Tue, 7 Nov 2017 09:12:17 +0000
From: Marius Cetateanu <MCetateanu at softkinetic.com>
To: "slurm-users at lists.schedmd.com" <slurm-users at lists.schedmd.com>
Subject: [slurm-users] Having errors trying to run a packed jobs
        script
Message-ID:
        <AM0PR0502MB36829B5109E51C3C823BADBBD9510 at AM0PR0502MB3682.eurprd05.prod.outlook.com>
        
Content-Type: text/plain; charset="iso-8859-1"


Hi,


I am new to slurm and I'm having some issues scheduling correctly

my tasks.

I have a very small cluster(if it even could be called a cluster) with only

one node for the moment; the node is a dual Xeon with 14 cores/socket,

hyper-threaded and 256GB of memory, running CentOS 7.3.


I have a single threaded process which I would like to run

over a series of input files(around 370). I have found that the packed

jobs scenario fits with what I'm trying to achieve. So I would like to

run 50 instances of my process at the same time over different input files.


The moment I schedule my script I can see that there are 50 instances of

my process started and running but just a bit afterwards only 5 or so of them

I can see running - so I only get full load for the first 50 instances and not

afterwards.


In the slurmctld.log I can see this type of messages:

"[2017-11-06T11:56:39.228] job_step_signal step 1489.107 not found"

and in my script output file I can see:

"srun: Job step creation temporarily disabled, retrying"


At this point I'm sifting through documentation and online info trying to figure

out what is going on. I have attached my slurmctld log file, slurm config file, script and

the output I get from sinfo, stat and the likes.


Any pointers on how to attack this problem would be much appreciated.


Thank you



<pre>
--

Marius Cetateanu | Senior Software Engineer
T +32 2 888 42 60
F +32 2 647 48 55
E mce at softkinetic.com
YT  https://emea01.safelinks.protection.outlook.com/?url=www.youtube.com%2Fsoftkinetic&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C6d75105c8fad4d8fbb1a08d525bfac72%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=w7D%2BkFYJid%2BXrMJk3EDGJkZrp4IXBmtQY%2Fgjnha%2Fo6Q%3D&reserved=0
Boulevard de la Plaine 11, 1050, Brussels, Belgium
Registration No: RPM/RPR Brussels 0811 784 189

Our e-mail communication disclaimers & liability are available
at:  https://emea01.safelinks.protection.outlook.com/?url=www.softkinetic.com%2Fdisclaimer.aspx&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C6d75105c8fad4d8fbb1a08d525bfac72%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=zHn0RIVwALLlgPji6BaCJnm9vhJsryMWXq%2BGUUzLU4E%3D&reserved=0
</pre>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.schedmd.com%2Fpipermail%2Fslurm-users%2Fattachments%2F20171107%2Fac948558%2Fattachment.html&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C6d75105c8fad4d8fbb1a08d525bfac72%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=Cw%2BLHSfVCa%2BzROmxX0bO02on%2B27hhwTCsDRAZ%2BoRwxs%3D&reserved=0>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: slurm.conf
Type: application/octet-stream
Size: 2229 bytes
Desc: slurm.conf
URL: <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.schedmd.com%2Fpipermail%2Fslurm-users%2Fattachments%2F20171107%2Fac948558%2Fattachment.obj&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C6d75105c8fad4d8fbb1a08d525bfac72%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=df4%2FYVUxwS1BBRmnvY4rMT52MJqAsKJe1rtIdSUu8f8%3D&reserved=0>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tools.out
Type: application/octet-stream
Size: 4142 bytes
Desc: tools.out
URL: <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.schedmd.com%2Fpipermail%2Fslurm-users%2Fattachments%2F20171107%2Fac948558%2Fattachment-0001.obj&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C6d75105c8fad4d8fbb1a08d525bfac72%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=EKlZTp6qaz03D9okE0tgirPQ6Ufej3uwh74CUItCS2c%3D&reserved=0>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: preprocess.sh
Type: application/x-shellscript
Size: 3385 bytes
Desc: preprocess.sh
URL: <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.schedmd.com%2Fpipermail%2Fslurm-users%2Fattachments%2F20171107%2Fac948558%2Fattachment.bin&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C6d75105c8fad4d8fbb1a08d525bfac72%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=rfe6n75ccvS7rgM2npgs4EegYnTL6YqzkoCvYKzw4AA%3D&reserved=0>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: slurmctld.log
Type: text/x-log
Size: 3700 bytes
Desc: slurmctld.log
URL: <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.schedmd.com%2Fpipermail%2Fslurm-users%2Fattachments%2F20171107%2Fac948558%2Fattachment-0001.bin&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C6d75105c8fad4d8fbb1a08d525bfac72%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=W3W5hd4yA69W%2FC6EgghdCedFzHHKq4NN7%2FJzx2zsxvg%3D&reserved=0>

End of slurm-users Digest, Vol 1, Issue 5
*****************************************
      


More information about the slurm-users mailing list