[slurm-users] Having errors trying to run a packed jobs script

Marius Cetateanu MCetateanu at softkinetic.com
Tue Nov 7 02:12:17 MST 2017


Hi,


I am new to slurm and I'm having some issues scheduling correctly

my tasks.

I have a very small cluster(if it even could be called a cluster) with only

one node for the moment; the node is a dual Xeon with 14 cores/socket,

hyper-threaded and 256GB of memory, running CentOS 7.3.


I have a single threaded process which I would like to run

over a series of input files(around 370). I have found that the packed

jobs scenario fits with what I'm trying to achieve. So I would like to

run 50 instances of my process at the same time over different input files.


The moment I schedule my script I can see that there are 50 instances of

my process started and running but just a bit afterwards only 5 or so of them

I can see running - so I only get full load for the first 50 instances and not

afterwards.


In the slurmctld.log I can see this type of messages:

"[2017-11-06T11:56:39.228] job_step_signal step 1489.107 not found"

and in my script output file I can see:

"srun: Job step creation temporarily disabled, retrying"


At this point I'm sifting through documentation and online info trying to figure

out what is going on. I have attached my slurmctld log file, slurm config file, script and

the output I get from sinfo, stat and the likes.


Any pointers on how to attack this problem would be much appreciated.


Thank you



<pre>
--

Marius Cetateanu | Senior Software Engineer
T +32 2 888 42 60
F +32 2 647 48 55
E mce at softkinetic.com
YT www.youtube.com/softkinetic
Boulevard de la Plaine 11, 1050, Brussels, Belgium
Registration No: RPM/RPR Brussels 0811 784 189

Our e-mail communication disclaimers & liability are available
at: www.softkinetic.com/disclaimer.aspx
</pre>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171107/ac948558/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: slurm.conf
Type: application/octet-stream
Size: 2229 bytes
Desc: slurm.conf
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171107/ac948558/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tools.out
Type: application/octet-stream
Size: 4142 bytes
Desc: tools.out
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171107/ac948558/attachment-0003.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: preprocess.sh
Type: application/x-shellscript
Size: 3385 bytes
Desc: preprocess.sh
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171107/ac948558/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: slurmctld.log
Type: text/x-log
Size: 3700 bytes
Desc: slurmctld.log
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171107/ac948558/attachment-0003.bin>


More information about the slurm-users mailing list