Hi,
my GPU testing system (named “gpu-node”) is a simple computer with one socket and a processor " Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz". Executing "lscpu", I can see there are 4 cores per socket, 2 threads per core
and 8 CPUs:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 26
Model name: Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz
My “gres.conf” file is:
NodeName=gpu-node Name=gpu Type=GeForce-GTX-TITAN-X File=/dev/nvidia0 CPUs=0-1
NodeName=gpu-node Name=gpu Type=GeForce-GTX-TITAN-Black File=/dev/nvidia1 CPUs=2-3
Running “numactl -H” in “gpu-node” host, reports:
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 7809 MB
node 0 free: 6597 MB
node distances:
node 0
0: 10
CPUs are assigned 0-1 for first GPU and 2-3 for second GPU. However, “lscpu” shows 8 CPUs… If I rewrite “gres.conf” in this way:
NodeName=gpu-node Name=gpu Type=GeForce-GTX-TITAN-X File=/dev/nvidia0 CPUs=0-3
NodeName=gpu-node Name=gpu Type=GeForce-GTX-TITAN-Black File=/dev/nvidia1 CPUs=4-7
when I run “scontrol reconfigure”, slurmctld log reports this error message:
[2024-06-05T11:42:18.558] error: _node_config_validate: gres/gpu: invalid GRES core specification (4-7) on node gpu-node
So I think SLURM only can get physical cores and not threads, so my node only can serve 4 cores (in “lspcu”) but in gres.conf I need to write “CPUs”, not “Cores”… isn’t it?
But if “numactl -H” shows 8 CPUs, why I can use this 8 CPUs in “gres.conf”?
Sorry about this large email.
Thanks.