[slurm-users] Having errors trying to run a packed jobs script

Wed Nov 8 02:57:29 MST 2017

Date: Tue, 7 Nov 2017 11:19:32 +0100
From: Benjamin Redling <benjamin.rampe at uni-jena.de>
To: slurm-users at lists.schedmd.com
Subject: Re: [slurm-users] Having errors trying to run a packed jobs
	script
Message-ID: <6979a04b-c9c0-badd-b57b-34d4d0ec8157 at uni-jena.de>
Content-Type: text/plain; charset=UTF-8

Hi Benjamin,

Thank you for the answer

> Bigger than a small cluster a decade ago... ;) Nice workhorse I guess.
It is sufficient for the moment :)

[...]
>> The moment I schedule my script I can see that there are 50 instances 
>> of my process started and running but just a bit afterwards only 5 or 
>> so of them
>> 
>> I can see running - so I only get full load for the first 50 instances 
>> and not afterwards.
> "a bit afterwards" is too vague to reason anything aside sched_interval just being the default 60s:

I know it's not the best choice of words. Before scheduling my script I start "top" on the 
compute node so I can see that the first batch of the jobs steps are scheduled simultaneously 
but after that I only have 4 - 6 processes running resulting in a very poor utilization of the CPU 
resources. I get the following output in the logs

[2017-11-06T11:40:45.625] backfill: Started JobId=1489 in main_compute on cn_burebista
[2017-11-06T11:40:45.697] _pick_step_nodes: Configuration for job 1489 is complete
[2017-11-06T11:44:48.289] slurmctld: agent retry_list size is 101
[2017-11-06T11:44:48.289]    retry_list msg_type=7009,7009,7009,7009,7009
[2017-11-06T11:51:12.132] slurmctld: agent retry_list size is 101
[2017-11-06T11:51:12.132]    retry_list msg_type=7009,7009,7009,7009,7009
[2017-11-06T11:52:12.835] job_step_signal step 1489.56 not found
[2017-11-06T11:52:12.835] job_step_signal step 1489.59 not found

...

> What's the (average) runtime of the jobs?
> If your jobs are not running longer than the sched_interval default you might want to *decrease* that.

The average runtime of a job is 4 minutes. I am preprocessing small "video" files.
I also tried with a smaller batch(smaller number of job steps) by reducing "--ntasks=25". It seems to improve 
a bit the total time it takes to process all the files but not very drastically. 

Best Regards
Marius