[slurm-users] Job allocating more CPUs than requested

Sat Sep 22 08:44:13 MDT 2018

Anecdotally, I’ve had a user cause load averages of 10x the node’s core count. The user caught it and cancelled the job before I noticed it myself. Where I’ve seen it happen live on less severe cases, I’ve never noticed anything other than the excessive load average. Viewed from ‘top’, the offending process was still confined to its 100% CPU or whatever it had reserved.

-- 
Mike Renfro  / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University

> On Sep 21, 2018, at 11:35 PM, Ryan Novosielski <novosirj at rutgers.edu> wrote:
> 
> I apologize for potentially thread hijacking here, but it's in the
> spirit of the original question I guess.
> 
> We constrain using cgroups, and occasionally someone will request 1
> core (-n1 -c1) and then run something that asks for way more
> cores/threads, or that tries to use the whole machine. They won't
> succeed obviously. Is this any sort of problem? It seems to me that
> trying to run 24 threads on a single core might generate some sort of
> overhead, and that I/O could be increased, but I'm not sure. What I do
> know is that if someone does this -- let's say in the extreme by
> running something -n24 that itself tries to run 24 threads in each
> task -- and someone uses the other 23 cores, you'll end up with a load
> average near 24*24+23. Does this make any difference? We have NHC set
> to offline such nodes, but that affects job preemption. What sort of
> choices do others make in this area?