[slurm-users] 4 sockets but "

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Fri Jul 23 10:43:35 UTC 2021


Hi Diego,

On 7/23/21 12:36 PM, Diego Zuccato wrote:
>> I believe that slurmd reports the 15 minute CPU load average to the 
>> slurmctld, only.  So you got this information already.
> Yup. It's just unexpected: if you don't know, you run pestat and see that 
> an idle node does have a very high load :)
> My users would think someone is breaking the rules...

Well, Slurm reports the 15-minute load average.  I guess users will have 
to learn that, because we can't print help information every time.

>> If you run "pestat -F" it will show you (in red color) the nodes where 
>> the CPU load is outside the expected range, as given by the number of 
>> allocated cores.  That covers your situation when 0 CPUs are allocated.
> That's how I noticed it.

Yes, pestat can be quite helpful :-)

>> I'm wondering what information you get from slurmtop, which you're 
>> missing from pestat?  Maybe an opportunity for improvement :-)
> Well, it shows semi-graphically the CPU allocations for the various jobs, 
> so users can tell at a glance if there are useable nodes for their job.

For finding idle nodes, there are better tools:

* sinfo -t idle

* showpartitions (download from 
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/partitions)

>> I added a little code to pestat now that calculates the longest hostname 
>> (minimum 8, truncated to 20 chars).  This is done by querying Slurm with 
>> "sinfo -N -O NodeList".  Can you try out this new version on your cluster?
>> Download: https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat
...
> Once fixed, it seems to work OK and columns are aligned. Not the first 
> time long names give us problems :( (users are even worse...).
Oops, I fixed this bug in the master branch now, thanks!

/Ole



More information about the slurm-users mailing list