[slurm-users] Having errors trying to run a packed jobs
Marius Cetateanu
MCetateanu at softkinetic.com
Tue Nov 7 05:44:06 MST 2017
Hi,
I see that parts of my message were scrubbed. I will try to post the relevant info below
(If that does not abide to the mailing list rules please let me know and point in the right
direction to convey this kind of information).
slurmctld.log
2017-11-06T11:38:26.623] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=0
[2017-11-06T11:40:45.063] _slurm_rpc_submit_batch_job JobId=1489 usec=505
[2017-11-06T11:40:45.625] backfill: Started JobId=1489 in main_compute on cn_burebista
[2017-11-06T11:40:45.697] _pick_step_nodes: Configuration for job 1489 is complete
[2017-11-06T11:44:48.289] slurmctld: agent retry_list size is 101
[2017-11-06T11:44:48.289] retry_list msg_type=7009,7009,7009,7009,7009
[2017-11-06T11:51:12.132] slurmctld: agent retry_list size is 101
[2017-11-06T11:51:12.132] retry_list msg_type=7009,7009,7009,7009,7009
[2017-11-06T11:52:12.835] job_step_signal step 1489.56 not found
[2017-11-06T11:52:12.835] job_step_signal step 1489.59 not found
[2017-11-06T11:52:12.835] job_step_signal step 1489.63 not found
[2017-11-06T11:52:12.835] job_step_signal step 1489.52 not found
[2017-11-06T11:52:12.838] job_step_signal step 1489.53 not found
[2017-11-06T11:52:12.842] job_step_signal step 1489.54 not found
[2017-11-06T11:52:12.842] job_step_signal step 1489.60 not found
[2017-11-06T11:52:12.856] job_step_signal step 1489.58 not found
[2017-11-06T11:52:12.856] job_step_signal step 1489.61 not found
[2017-11-06T11:52:12.862] job_step_signal step 1489.55 not found
[2017-11-06T11:52:12.875] job_step_signal step 1489.62 not found
[2017-11-06T11:52:12.884] job_step_signal step 1489.51 not found
[2017-11-06T11:52:13.007] job_step_signal step 1489.57 not found
[2017-11-06T11:52:13.191] job_step_signal step 1489.50 not found
[2017-11-06T11:52:58.625] job_step_signal step 1489.61 not found
[2017-11-06T11:52:58.625] job_step_signal step 1489.57 not found
[2017-11-06T11:52:58.625] job_step_signal step 1489.55 not found
script.sh
#!/bin/sh
# job parameters
#SBATCH --job-name=preprocess_movies
#SBATCH --output=preprocess_movies.log
# needed resources
#SBATCH --ntasks=50
#SBATCH --mem-per-cpu=2GB
shopt -s globstar
FILES=$(find ./full_training/raw -name "*.skv")
OUTPUT=./output_files
# operations
echo "[preprocess] Job started at $(date)"
# job steps
for file in ${FILES}
do
echo "[preprocess] Processing file: ${file##*/}"
echo "[preprocess] Output to: $OUTPUT/${file##*/}"
srun -n1 --exclusive ./preprocessing-r 1 0 1 ${file} $OUTPUT/${file##*/} &
done
wait
echo "[preprocess] Job ended at $(date)"
sinfo, squeue output
% sinfo -Nle
Mon Nov 6 12:31:51 2017
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
cn_burebista 1 main_compute* mixed 56 2:14:2 256000 0 1 (null) none
% squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1489 main_comp preproce mcetatea R 51:11 1 cn_burebista
slurm.conf
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=draco
ControlMachine=zalmoxis
#ControlAddr=
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/tmp
SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir=
#FirstJobId=
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
UsePAM=1
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
#JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODES
# OpenHPC default configuration
PropagateResourceLimitsExcept=MEMLOCK
SlurmdLogFile=/var/log/slurm.log
SlurmctldLogFile=/var/log/slurmctld.log
Epilog=/etc/slurm/slurm.epilog.clean
# ?? How to set up more node names ??
NodeName=cn_burebista Sockets=2 CoresPerSocket=14 ThreadsPerCore=2 RealMemory=256000 State=UNKNOWN
PartitionName=normal Nodes=cn_burebista Default=YES MaxTime=24:00:00 State=UP
ReturnToService=1
----------------------------------------------------------------------
Message: 1
Date: Tue, 7 Nov 2017 09:12:17 +0000
From: Marius Cetateanu <MCetateanu at softkinetic.com>
To: "slurm-users at lists.schedmd.com" <slurm-users at lists.schedmd.com>
Subject: [slurm-users] Having errors trying to run a packed jobs
script
Message-ID:
<AM0PR0502MB36829B5109E51C3C823BADBBD9510 at AM0PR0502MB3682.eurprd05.prod.outlook.com>
Content-Type: text/plain; charset="iso-8859-1"
Hi,
I am new to slurm and I'm having some issues scheduling correctly
my tasks.
I have a very small cluster(if it even could be called a cluster) with only
one node for the moment; the node is a dual Xeon with 14 cores/socket,
hyper-threaded and 256GB of memory, running CentOS 7.3.
I have a single threaded process which I would like to run
over a series of input files(around 370). I have found that the packed
jobs scenario fits with what I'm trying to achieve. So I would like to
run 50 instances of my process at the same time over different input files.
The moment I schedule my script I can see that there are 50 instances of
my process started and running but just a bit afterwards only 5 or so of them
I can see running - so I only get full load for the first 50 instances and not
afterwards.
In the slurmctld.log I can see this type of messages:
"[2017-11-06T11:56:39.228] job_step_signal step 1489.107 not found"
and in my script output file I can see:
"srun: Job step creation temporarily disabled, retrying"
At this point I'm sifting through documentation and online info trying to figure
out what is going on. I have attached my slurmctld log file, slurm config file, script and
the output I get from sinfo, stat and the likes.
Any pointers on how to attack this problem would be much appreciated.
Thank you
<pre>
--
Marius Cetateanu | Senior Software Engineer
T +32 2 888 42 60
F +32 2 647 48 55
E mce at softkinetic.com
YT https://emea01.safelinks.protection.outlook.com/?url=www.youtube.com%2Fsoftkinetic&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C6d75105c8fad4d8fbb1a08d525bfac72%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=w7D%2BkFYJid%2BXrMJk3EDGJkZrp4IXBmtQY%2Fgjnha%2Fo6Q%3D&reserved=0
Boulevard de la Plaine 11, 1050, Brussels, Belgium
Registration No: RPM/RPR Brussels 0811 784 189
Our e-mail communication disclaimers & liability are available
at: https://emea01.safelinks.protection.outlook.com/?url=www.softkinetic.com%2Fdisclaimer.aspx&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C6d75105c8fad4d8fbb1a08d525bfac72%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=zHn0RIVwALLlgPji6BaCJnm9vhJsryMWXq%2BGUUzLU4E%3D&reserved=0
</pre>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.schedmd.com%2Fpipermail%2Fslurm-users%2Fattachments%2F20171107%2Fac948558%2Fattachment.html&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C6d75105c8fad4d8fbb1a08d525bfac72%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=Cw%2BLHSfVCa%2BzROmxX0bO02on%2B27hhwTCsDRAZ%2BoRwxs%3D&reserved=0>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: slurm.conf
Type: application/octet-stream
Size: 2229 bytes
Desc: slurm.conf
URL: <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.schedmd.com%2Fpipermail%2Fslurm-users%2Fattachments%2F20171107%2Fac948558%2Fattachment.obj&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C6d75105c8fad4d8fbb1a08d525bfac72%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=df4%2FYVUxwS1BBRmnvY4rMT52MJqAsKJe1rtIdSUu8f8%3D&reserved=0>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tools.out
Type: application/octet-stream
Size: 4142 bytes
Desc: tools.out
URL: <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.schedmd.com%2Fpipermail%2Fslurm-users%2Fattachments%2F20171107%2Fac948558%2Fattachment-0001.obj&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C6d75105c8fad4d8fbb1a08d525bfac72%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=EKlZTp6qaz03D9okE0tgirPQ6Ufej3uwh74CUItCS2c%3D&reserved=0>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: preprocess.sh
Type: application/x-shellscript
Size: 3385 bytes
Desc: preprocess.sh
URL: <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.schedmd.com%2Fpipermail%2Fslurm-users%2Fattachments%2F20171107%2Fac948558%2Fattachment.bin&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C6d75105c8fad4d8fbb1a08d525bfac72%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=rfe6n75ccvS7rgM2npgs4EegYnTL6YqzkoCvYKzw4AA%3D&reserved=0>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: slurmctld.log
Type: text/x-log
Size: 3700 bytes
Desc: slurmctld.log
URL: <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.schedmd.com%2Fpipermail%2Fslurm-users%2Fattachments%2F20171107%2Fac948558%2Fattachment-0001.bin&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C6d75105c8fad4d8fbb1a08d525bfac72%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=W3W5hd4yA69W%2FC6EgghdCedFzHHKq4NN7%2FJzx2zsxvg%3D&reserved=0>
End of slurm-users Digest, Vol 1, Issue 5
*****************************************
More information about the slurm-users
mailing list