[slurm-users] Slurm overhead

Sun Apr 22 07:14:57 MDT 2018

Mahmood,

If you have exclusive control of this system and can afford to have compute-0-0 out of production for awhile, you can do a simple test:

Shut Slurm down on compute-0-0
Login directly to compute-0-0
Run the timing experiment there
Compare the results to both of the other experiments you have already run on this node and the head node.

The big deal here it to make sure that Slurm is stopped during one of your experiements, and you didn’t say whether you did that or not. If you did, then maybe you have something to worry about.

This takes Slurm out of the loop. It’s possible that something else about compute-0-0 will show itself after you do this test, but this way you can eliminate the overhead of the running Slurm processes. One possibility that comes to my mind is that if compute-0-0 is a multi-socket node, then you may have no or incorrect task and memory binding under Slurm (i.e. your processes may be unbound with memory being allocated on one socket but Linux letting them run on the other), which could easily lead to large performance differences. We don’t require or let Slurm do bindings for us but require our users to use numactl or the MPI runtime to handle it for them. Maybe you should look into that after you eliminate direct interference from Slurm.

Best,
Bill.

-- 
Bill Barth, Ph.D., Director, HPC
bbarth at tacc.utexas.edu        |   Phone: (512) 232-7069
Office: ROC 1.435            |   Fax:   (512) 475-9445

On 4/22/18, 1:06 AM, "slurm-users on behalf of Mahmood Naderan" <slurm-users-bounces at lists.schedmd.com on behalf of mahmood.nt at gmail.com> wrote:

    I ran some other tests and got the nearly the same results. That 4
    minutes in my previous post means about 50% overhead. So, 24000
    minutes on direct run is about 35000 minutes via slurm. I will post
    with details later. the methodology I used is

    1- Submit a job to a specific node (compute-0-0) via slurm on the
    frontend and get te elapsed run time (or add  time command in the
    script)
    2- ssh to the specific node (compute-0-0) and directly run the program
    with time command.

    So, the hardware is the same. I have to say that the frontnend has
    little differences with compute-0-0 but that is not important because
    as I said before, the program is installed on /usr and not the shared
    file system.

    I think the slurm process which query the node to collect runtime
    information is not negligible. For example, squeue updates the runtime
    every seconds. How can I tell slurm not to query very soon. For
    example, update the node information every 10 seconds. Though I am not
    sure how much effect that has.

    Regards,
    Mahmood

    On Fri, Apr 20, 2018 at 10:39 AM, Loris Bennett
    <loris.bennett at fu-berlin.de> wrote:
    > Hi Mahmood,
    > Rather than the overhead being 50%, maybe it is just 4 minutes.  If
    > another job runs for a week, that might not be a problem.  In addition,
    > you just have one data point, so it is rather difficult to draw any
    > conclusion.
    >
    > However, I think that it is unlikely that Slurm is responsible for
    > this difference.  What can happen is that, if a node is powered down
    > before the job starts, then the clock starts ticking as soon as the job
    > is assigned to the node.  This means that the elapsed time also includes
    > the time for the node to be provisioned.  If this is not relevant in
    > your case, then you are probably just not comparing like with like,
    > e.g. is the hardware underlying /tmp identical in both cases?
    >
    > Cheers,
    >
    > Loris
    >
    > --
    > Dr. Loris Bennett (Mr.)
    > ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de
    >