[slurm-users] Heterogeneous HPC

Michael Jennings mej at lanl.gov
Thu Sep 19 21:02:39 UTC 2019


On Thursday, 19 September 2019, at 20:00:40 (+0000),
Goetz, Patrick G wrote:

> On 9/19/19 8:22 AM, Thomas M. Payerle wrote:
> > one of our clusters
> > is still running RHEL6, and while containers based on Ubuntu 16,
> > Debian 8, or RHEL7 all appear to work properly,
> > containers based on Ubuntu 18 or Debian 9 will die with "Kernel too
> > old" errors.
> 
> I think the idea generally is to have your container host be the newest 
> version, with containers providing a home for legacy (or at most 
> contemporary) software stacks.  Never heard of anyone trying to do it 
> the other way around, but appreciate this proof of concept that it's a 
> bad idea.

And contrary to popular belief, it's not just the kernel that comes
into play with container compatibility.  There's a surprisingly complex,
nuanced interplay between the kernel, glibc, libgcc, and
/lib64/ld-linux-x86-64.so.2 for almost any containerized application,
let alone the standard challenges of MPI libraries, HSN/GPU device
interfaces/drivers, etc.  With containers, many of the old problems are
new again!  You'd cringe if I told you about some of the nastier
container compatibility challenges we've run into/heard about....

There are general rules of thumb that will help container portability
(limit container size, link statically as much as possible, build
against the oldest possible stuff but run on the newest possible stuff,
et al.), but the biggest one is:  Don't expect containers to solve all
your problems; they don't.  Much of the time you're just exchanging the
prior set of problems for a new set. :-)

And don't forget to test with VMs as well, not just containers.
Depending on complexity, computation and I/O patterns, and other similar
factors, an appropriately paravirtualized VM could wind up being on par
with native/containerized code in many circumstances, or at least within
acceptable tolerances.  And VMs offer many advantages in terms of
safety, separation, and sanity vs. containerization.

Michael

-- 
Michael E. Jennings <mej at lanl.gov>
HPC Systems Team, Los Alamos National Laboratory
Bldg. 03-2327, Rm. 2341     W: +1 (505) 606-0605



More information about the slurm-users mailing list