[slurm-users] SLURM heterogeneous jobs, a little help needed plz

Frava fravadona at gmail.com
Fri Mar 22 11:07:06 UTC 2019


Hi all,

I think it's not that easy to keep SLURM up to date in a cluster of more
than 3k nodes with a lot of users. I mean, that cluster has only a little
more than 2 years old and my today's submission got the JOBID 12711473, the
queue has 9769 jobs (squeue | wc -l). In two years there were only two
maintenances that impacted the users and each one was announced a few
months prior. They told me that they actually plan to update SLURM but not
until late 2019 because they have other things to do before that. Also, I'm
the only one asking for heterogeneous jobs...

Rafael.

Le jeu. 21 mars 2019 à 22:19, Prentice Bisbal <pbisbal at pppl.gov> a écrit :

> On 3/21/19 4:40 PM, Reuti wrote:
>
> >> Am 21.03.2019 um 16:26 schrieb Prentice Bisbal <pbisbal at pppl.gov>:
> >>
> >>
> >> On 3/20/19 1:58 PM, Christopher Samuel wrote:
> >>> On 3/20/19 4:20 AM, Frava wrote:
> >>>
> >>>> Hi Chris, thank you for the reply.
> >>>> The team that manages that cluster is not very fond of upgrading
> SLURM, which I understand.
> >> As a system admin who manages clusters myself, I don't understand this.
> Our job is to provide and maintain resources for our users. Part of that
> maintenance is to provide updates for security, performance, and
> functionality (new features) reasons. HPC has always been a leading-edge
> kind if field, so I feel this is even more important for HPC admins.
> >>
> >> Yes, there can be issues caused by updates, but those can be with
> proper planning: Have a plan to do the actual upgrade, have a plan to test
> for issues, and have a plan to revert to an earlier version if issues are
> discovered. This is work, but it's really not all that much work, and this
> is exactly the work we are being paid to do as cluster admins.
> > Besides the work on the side of the admins, also the users are involved:
> exchanging libraries also means to run the test suites of their
> applications again.
> >
> > -- Reuti
>
> That implies the users actually wrote test suites. ;-)
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190322/a577aa5a/attachment.html>


More information about the slurm-users mailing list