[slurm-users] Slurm overhead

Tue Apr 24 10:14:00 MDT 2018

Hi Bill,
In order to shutdown the slurm process on the compute node, is it fine
to kill /usr/sbin/slurm? Or there is a better and safer way for that?

Regards,
Mahmood

On Sun, Apr 22, 2018 at 5:44 PM, Bill Barth <bbarth at tacc.utexas.edu> wrote:
> Mahmood,
>
> If you have exclusive control of this system and can afford to have compute-0-0 out of production for awhile, you can do a simple test:
>
> Shut Slurm down on compute-0-0
> Login directly to compute-0-0
> Run the timing experiment there
> Compare the results to both of the other experiments you have already run on this node and the head node.
>
> The big deal here it to make sure that Slurm is stopped during one of your experiements, and you didn’t say whether you did that or not. If you did, then maybe you have something to worry about.
>
> This takes Slurm out of the loop. It’s possible that something else about compute-0-0 will show itself after you do this test, but this way you can eliminate the overhead of the running Slurm processes. One possibility that comes to my mind is that if compute-0-0 is a multi-socket node, then you may have no or incorrect task and memory binding under Slurm (i.e. your processes may be unbound with memory being allocated on one socket but Linux letting them run on the other), which could easily lead to large performance differences. We don’t require or let Slurm do bindings for us but require our users to use numactl or the MPI runtime to handle it for them. Maybe you should look into that after you eliminate direct interference from Slurm.
>
> Best,
> Bill.
>
> --
> Bill Barth, Ph.D., Director, HPC
> bbarth at tacc.utexas.edu        |   Phone: (512) 232-7069
> Office: ROC 1.435            |   Fax:   (512) 475-9445
>