[slurm-users] Heterogeneous HPC

Jennings, Michael E mej at lanl.gov
Tue Sep 24 17:20:22 UTC 2019


I have not personally used Sarus, so I can't speak intelligently on its pros/cons, security profile, etc.  We (LANL) had already started down the path of adding OCI compliance to Charliecloud (mainly for compatibility reasons, not because we had a specific user request) when I first learned about Sarus (at last year's HPCXXL & HPC/AI Advisory Council's Swiss Conference in Lugano).

What I can say is that CSCS has a strong track record in the container space, having worked extensively on Shifter with the folks at NERSC even before there was documentation on how exactly to do that.  So they know their stuff.  However, my understanding is that they still use a setuid-root runtime rather than relying on user namespaces, so even though I think their code is going to be more reliable & secure than some alternatives, some of the same caveats/warnings still apply.

Ultimately my advice is to use user namespaces, whatever your runtime choice might be, whether you choose Charliecloud or not.  I'd much prefer each admin/site choose their container solution(s) based on solid, factual information -- regardless of what that choice is -- than simply pick something because they heard a marketing talk on one or because it was the right choice for some other site.

So if Sarus sounds like the right solution for your use cases, I encourage you to give it due consideration.  Just keep in mind the risks, and if you feel comfortable doing so (assuming they don't already support it), maybe contribute a patch for user namespace support! :-)

HTH,
Michael

--
Michael E. Jennings <mej at lanl.gov>
HPC Systems Team, Los Alamos National Laboratory
Bldg. 03-2327, Rm. 2341     W: +1 (505) 606-0605


________________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Stijn De Weirdt <stijn.deweirdt at ugent.be>
Sent: Friday, September 20, 2019 2:34:33 AM
To: slurm-users at lists.schedmd.com
Subject: Re: [slurm-users] Heterogeneous HPC

hi michael,

very intersting feedback!
have you ever tried/looked at https://github.com/eth-cscs/sarus?

stijn

On 9/20/19 9:11 AM, Mahmood Naderan wrote:
> I appreciate the repplies.
> I will try to test Charliecloud to see what is what...
>
>
> On Fri, Sep 20, 2019, 10:37 Fulcomer, Samuel <samuel_fulcomer at brown.edu>
> wrote:
>
>>
>>
>> Thanks! and I'll watch the video...
>>
>> Privileged containers!.... never!....
>>
>> On Thu, Sep 19, 2019 at 9:06 PM Michael Jennings <mej at lanl.gov> wrote:
>>
>>> On Thursday, 19 September 2019, at 19:27:38 (-0400),
>>> Fulcomer, Samuel wrote:
>>>
>>>> I obviously haven't been keeping up with any security concerns over the
>>> use
>>>> of Singularity. In a 2-3 sentence nutshell, what are they?
>>>
>>> So before I do that, if you have a few minutes, I do think you'll find
>>> it worth your time to go to https://youtu.be/H6VrjowOOF4?t=2361 (it'll
>>> start about 39 minutes in) and watch at least those next 8 or so minutes.
>>> I go into some detail about the security track records of multiple
>>> container runtimes and provide factual data so that folks can make their
>>> own risk assessments rather than just giving my personal opinion.  (The
>>> video does cut off the right side of the slides, but the slide deck is
>>> available at
>>> https://permalink.lanl.gov/object/tr?what=info:lanl-repo/lareport/LA-UR-19-22663
>>> for anyone interested.)
>>>
>>> If you really don't want to watch the video, though, I can provide a few
>>> of the data points.
>>>
>>> First off, if you have not read it before, you really should read
>>> Matthias Gerstner's assessment after doing a code review and security
>>> audit on Singularity 2.6.0 to see if it could be packaged for SuSE:
>>> https://www.openwall.com/lists/oss-security/2018/12/12/2
>>> The quotes I used on the slide for my talk came from comments he made in
>>> the linked SuSE Bugzilla bug -- which, for unknown reasons, was
>>> re-locked by SuSE after previously being unlocked once the bug report
>>> was public! -- regarding whether or not, and under what constraints, to
>>> include and support Singularity on SuSE.  Matthias is a widely respected
>>> security expert in the OSS community, so I trust his assessment and
>>> insight.  And his audit alone found 5 or 6 CVE-worthy vulnerabilities at
>>> once.
>>>
>>> Additionally, as I mentioned in the video, during the 3-year period
>>> 2016-2018, there were at least 17 different vulnerabilities found in
>>> Singularity.  Also, of the 9 releases they did during 2018, 7 of those
>>> were security releases to fix vulnerabilities (and frequently more than
>>> 1 at a time).  That's...not great.  Especially in an environment like
>>> ours where saying "security is important" is an understatement of
>>> nuclear proportions! ;-)
>>>
>>> And finally, while we were hopeful that the rewrite in Go (version 3.0
>>> and above) would correct the security failings in the code, there've
>>> already been multiple serious vulnerabilities (all grouped together
>>> under a single CVE identifier, CVE-2019-11328), at least one of which
>>> was essentially a replica of one of the flaws fixed in 2.6.0 under
>>> CVE-2018-12021!  And you don't need to take my word for it, either:
>>> https://www.openwall.com/lists/oss-security/2019/05/16/1
>>>
>>> It's hard to say if the above trend will continue...but not all sites
>>> can afford to take those kinds of risks.
>>>
>>> And while Shifter's security track record is spotless to date, I would
>>> still summarize the overall lesson to be learned as, "Don't use
>>> privileged container runtimes.  Use user namespaces.  That's what
>>> they're there for."  And before anyone yells at me, yes I know
>>> Singularity advertises user namespace support and non-setuid operation.
>>> But it doesn't seem to be very widely used or adequately exercised, and
>>> AFAICT the default mode of operation in both RPMs and build-from-src is
>>> via setuid binaries.  So using a natively unprivileged runtime still
>>> seems the less risky choice, in my personal assessment.
>>>
>>> Yes, I know that was more than a "2-3 sentence nutshell," but hopefully
>>> it was helpful anyway! :-)
>>>
>>> Michael
>>>
>>> --
>>> Michael E. Jennings <mej at lanl.gov>
>>> HPC Systems Team, Los Alamos National Laboratory
>>> Bldg. 03-2327, Rm. 2341     W: +1 (505) 606-0605
>>>
>>
>




More information about the slurm-users mailing list