[slurm-users] Detecting non-MPI jobs running on multiple nodes

Thu Sep 29 13:21:18 UTC 2022

Hi Davide,

That is a interesting idea.  We already do some averaging, but over the
whole of the past month.  For each user we use the output of seff to
generate two scatterplots: CPU-efficiency vs. CPU-hours and
memory-efficiency vs. GB-hours.  See

  https://www.fu-berlin.de/en/sites/high-performance-computing/Dokumentation/Statistik

However, I am mainly interested in being able to cancel some of the inefficient
jobs before they have run for too long.

Cheers,

Loris

 Davide DelVento <davide.quantum at gmail.com> writes:

> At my previous job there were cron jobs running everywhere measuring
> possibly idle cores which were eventually averaged out for the
> duration of the job, and reported (the day after) via email to the
> user support team.
> I believe they stopped doing so when compute became (relatively) cheap
> at the expense of memory and I/O becoming expensive.
>
> I know, it does not help you much, but perhaps something to think about
>
> On Thu, Sep 29, 2022 at 1:29 AM Loris Bennett
> <loris.bennett at fu-berlin.de> wrote:
>>
>> Hi,
>>
>> Has anyone already come up with a good way to identify non-MPI jobs which
>> request multiple cores but don't restrict themselves to a single node,
>> leaving cores idle on all but the first node?
>>
>> I can see that this is potentially not easy, since an MPI job might have
>> still have phases where only one core is actually being used.
>>
>> Cheers,
>>
>> Loris
>>
>> --
>> Dr. Loris Bennett (Herr/Mr)
>> ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de
>>
-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de