[slurm-users] Having errors trying to run a packed jobs script
Marius Cetateanu
MCetateanu at softkinetic.com
Tue Nov 7 02:12:17 MST 2017
Hi,
I am new to slurm and I'm having some issues scheduling correctly
my tasks.
I have a very small cluster(if it even could be called a cluster) with only
one node for the moment; the node is a dual Xeon with 14 cores/socket,
hyper-threaded and 256GB of memory, running CentOS 7.3.
I have a single threaded process which I would like to run
over a series of input files(around 370). I have found that the packed
jobs scenario fits with what I'm trying to achieve. So I would like to
run 50 instances of my process at the same time over different input files.
The moment I schedule my script I can see that there are 50 instances of
my process started and running but just a bit afterwards only 5 or so of them
I can see running - so I only get full load for the first 50 instances and not
afterwards.
In the slurmctld.log I can see this type of messages:
"[2017-11-06T11:56:39.228] job_step_signal step 1489.107 not found"
and in my script output file I can see:
"srun: Job step creation temporarily disabled, retrying"
At this point I'm sifting through documentation and online info trying to figure
out what is going on. I have attached my slurmctld log file, slurm config file, script and
the output I get from sinfo, stat and the likes.
Any pointers on how to attack this problem would be much appreciated.
Thank you
<pre>
--
Marius Cetateanu | Senior Software Engineer
T +32 2 888 42 60
F +32 2 647 48 55
E mce at softkinetic.com
YT www.youtube.com/softkinetic
Boulevard de la Plaine 11, 1050, Brussels, Belgium
Registration No: RPM/RPR Brussels 0811 784 189
Our e-mail communication disclaimers & liability are available
at: www.softkinetic.com/disclaimer.aspx
</pre>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171107/ac948558/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: slurm.conf
Type: application/octet-stream
Size: 2229 bytes
Desc: slurm.conf
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171107/ac948558/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tools.out
Type: application/octet-stream
Size: 4142 bytes
Desc: tools.out
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171107/ac948558/attachment-0003.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: preprocess.sh
Type: application/x-shellscript
Size: 3385 bytes
Desc: preprocess.sh
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171107/ac948558/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: slurmctld.log
Type: text/x-log
Size: 3700 bytes
Desc: slurmctld.log
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171107/ac948558/attachment-0003.bin>
More information about the slurm-users
mailing list