[slurm-users] sacct: error

Eric F. Alemany ealemany at stanford.edu
Fri May 4 10:45:19 MDT 2018


Hi Patrick
Hi Ray

Happy Friday!
Thank you both for your quick reply. This is what i found out.

With Patrick one liner it works fine.
NodeName=radonc[01-04] CPUs=32 RealMemory=64402 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2

With Ray suggestion i have a error message for each nodes. Here i am giving you only one error message from a node.
sacct: error: NodeNames=radonc01 CPUs=32 doesn't match Sockets*CoresPerSocket*ThreadsPerCore (16), resetting CPUs
The interesting thing is if you follow the Sockets*CoresPerSocket*ThreadsPerCore formula 2x8x2 = 32  however look above and it says (16) - Strange, no ?
Aslo, as Ray suggested NodeAddr=10.112.0.5,10.112.0.6,10.112.0.14,10.112.0.16  comma between IP works fine.

So for now I will stay with Patrick’s one-liner. Although this solution did not give any error messages i am still worried that SLURM stills think that Sockets*CoresPerSocket*ThreadsPerCore (16)

FYI: Also, the /etc/hosts file on each machine (master and execute nodes) looks like this one.
0.112.0.25             radoncmaster.stanford.EDU<http://radoncmaster.stanford.EDU>       radoncmaster
10.112.0.5              radonc01.stanford.EDU<http://radonc01.stanford.EDU>           radonc01
10.112.0.6              radonc02.stanford.EDU<http://radonc02.stanford.EDU>           radonc02
10.112.0.14             radonc03.stanford.EDU<http://radonc03.stanford.EDU>           radonc03
10.112.0.16             radonc04.stanford.EDU<http://radonc04.stanford.EDU>           radonc04

Now, when i run sacct it says
SLURM accounting storage is disabled
which i am ok since i have only two pos-doc at the moment.

How can I test my cluster with a sample job and make sure it uses all the CPUs and ram?

Thank you for your help and patience with me

Best,
Eric
_____________________________________________________________________________________________________

Eric F.  Alemany
System Administrator for Research

Division of Radiation & Cancer  Biology
Department of Radiation Oncology

Stanford University School of Medicine
Stanford, California 94305

Tel:1-650-498-7969<tel:1-650-498-7969>  No Texting
Fax:1-650-723-7382<tel:1-650-723-7382>



On May 4, 2018, at 6:14 AM, Patrick Goetz <pgoetz at math.utexas.edu<mailto:pgoetz at math.utexas.edu>> wrote:

I concur with this.  Make sure your nodes are in the /etc/hosts file on the SMS.  Also, if you name them by base + numerical sequence, you can configure them with a single line in Slurm (using the example below):

NodeName=radonc[01-04] CPUs=32 RealMemory=64402 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2

On 05/04/2018 12:05 AM, Raymond Wan wrote:
Hi Eric,
On Fri, May 4, 2018 at 6:04 AM, Eric F. Alemany <ealemany at stanford.edu<mailto:ealemany at stanford.edu>> wrote:
# COMPUTE NODES
NodeName=radonc[01-04] NodeAddr=10.112.0.5 10.112.0.6 10.112.0.14
10.112.0.16 CPUs=32 RealMemory=64402 Sockets=2 CoresPerSocket=8
ThreadsPerCore=2   State=UNKNOWN
PartitionName=debug Nodes=radonc[01-04] Default=YES MaxTime=INFINITE
State=UP
I don't know what is the problem, but my *guess* based on my own
configuration file is that we have one node per line under "NodeName".
We also don't have NodeAddr but maybe that's ok.  This means the IP
addresses of the nodes in our cluster are hard-coded in /etc/hosts.
Also, State is not given.
So, if I formatted your's to look line our's would look something like:
NodeName=radonc01 CPUs=32 RealMemory=64402 Sockets=2 CoresPerSocket=8
ThreadsPerCore=2
NodeName=radonc02 CPUs=32 RealMemory=64402 Sockets=2 CoresPerSocket=8
ThreadsPerCore=2
NodeName=radonc03 CPUs=32 RealMemory=64402 Sockets=2 CoresPerSocket=8
ThreadsPerCore=2
NodeName=radonc04 CPUs=32 RealMemory=64402 Sockets=2 CoresPerSocket=8
ThreadsPerCore=2
PartitionName=debug Nodes=radonc[01-04] Default=YES MaxTime=INFINITE State=UP
Maybe the problem is with the NodeAddr because you might have to
separate the values with a comma instead of a space?  With spaces, it
might have problems parsing?
That's my guess...
Ray


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180504/d1e3c929/attachment-0001.html>


More information about the slurm-users mailing list