[slurm-users] FSU & Slurm

Sean Caron scaron at umich.edu
Fri Apr 13 13:59:01 MDT 2018

I respect this is a technical list and that SchedMD is running it so I will
say my bit once and keep it short but I think it's important to get these
ideas out there. These thoughts are mine and do not constitute any official
statement on the part of my employer.

My lab management probably likes using SLURM, but mostly because it doesn't
cost them anything. To me, it's just one of many things I need to wrangle
everyday. When it behaves, I like SLURM alright. When it doesn't behave, I
like it less :)

Honestly, I see the value in the support offering. It's a small fraction of
the amount of money that my management has invested in hardware, data
center facilities and so on. The HPC batch queue is pretty central to the
work my employer does on a daily basis. But my management feels the cost of
paid support is excessive and thus I am left to manage SLURM on my own
largely by trial and error. We are not a huge site. A few hundred nodes.
Unlike many HPC sites which are centrally funded and operated services at
their respective institutions with a regular budget, my employer is a 100%
grant-funded lab that does computing in-house. In that kind of funding
environment, it's challenging to find large chunks of money for software

What leaves a little bit of a sour taste in my mouth is the feeling that
SchedMD goes out of their way to make things challenging for people in my
position, in an effort to drive people towards paid support. In contrast to
most open source projects that I am familiar with, developer presence on
the lists is zero. SchedMD barely even accepts bug reports if you are not a
paid customer. The published documentation for SLURM is a little bit terse
- more a dictionary than an encyclopedia - with very little information
about theory of operation, best practices, appropriate tuning parameters
for various use cases and cluster scale. Ancillary materials such as
conference slide decks contain information that is increasingly out of
date, and there are few resources replacing them as time goes on. The
ecosystem for community SLURM support outside of paying SchedMD is
basically nil, hampered by the fact that most sites probably do just call
SchedMD when they have issues, so we never get to see hard problems
reported on the mailing list or places like Stack Exchange, not to mention
much in the way of serious troubleshooting and problem resolution. Going it
alone is a hard road to hoe. And it's just a little frustrating when you
try to reach out to the community in a time of need and you get nothing
back but a solicitation to pay if you want help.

So, rant over. And I know it's not SchedMD's fault. To end it on a
constructive note, I'd really just like to see SchedMD maybe entertain some
more flexibility in support offerings. Realistically, my group is not a
needy customer. If we bought support, we would probably use it once or
twice a year just for deployment and initial tuning assistance. Once SLURM
is up and running, it doesn't generally need much care and feeding. We
don't upgrade versions of SLURM too often. It would change my world to see
SchedMD offer a single-incident pay-per-engagement or paid consulting for
initial deployment type support at reduced cost to a full year-long
contract. These are the type of offerings I could pitch to my management
with reasonable expectation they would accept. We could get the support we
need, and SchedMD can make a little extra money on sites that are unwilling
or unable to go for the full support contract. Win-win!



On Fri, Apr 13, 2018 at 12:27 PM, Patrick Goetz <pgoetz at math.utexas.edu>

> On 04/11/2018 02:35 PM, Sean Caron wrote:
>> As a protest to asking questions on this list and getting solicitations
>> for pay-for support, let me give you some advice for free :)
> Now, now.  Paid support is how they keep the project going.  You like
> using Slurm, right?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180413/031cc415/attachment.html>

More information about the slurm-users mailing list