[slurm-users] ticking time bomb? launching too many jobs in parallel

Tue Aug 27 08:22:38 UTC 2019

Hi Guillaume,

The performance of the slurmctld server depends strongly on the server 
hardware on which it is running!  This should be taken into account when 
considering your question.

SchedMD recommends that the slurmctld server should have only a few, but 
very fast CPU cores, in order to ensure the best responsiveness.

The file system for /var/spool/slurmctld/ should be mounted on the 
fastest possible disks (SSD or NVMe if possible).

You should also read the Large Cluster Administration Guide at 
https://slurm.schedmd.com/big_sys.html

Furthermore, it may perhaps be a good idea to have the MySQL database 
server installed on a separate server so that it doesn't slow down the 
slurmctld.

Best regards,
Ole

On 8/27/19 9:45 AM, Guillaume Perrault Archambault wrote:
> Hi Paul,
> 
> Thanks a lot for your suggestion.
> 
> The cluster I'm using has thousands of users, so I'm doubtful the admins 
> will change this setting just for me. But I'll mention it to the support 
> team I'm working with.
> 
> I was hoping more for something that can be done on the user end.
> 
> Is there some way for the user to measure whether the scheduler is in 
> RPC saturation? And then if it is, I could make sure my script doesn't 
> launch too many jobs in parallel.
> 
> Sorry if my question is too vague, I don't understand the backend of the 
> SLURM scheduler too well, so my questions are using the limited 
> terminology of a user.
> 
> My concern is just to make sure that my scripts don't send out more 
> commands (simultaneously) than the scheduler can handle.
> 
> For example, as an extreme scenario, suppose a user forks off 1000 
> sbatch commands in parallel, is that more than the scheduler can handle? 
> As a user, how can I know whether it is?