[slurm-users] 4 sockets but "

Diego Zuccato diego.zuccato at unibo.it
Wed Jul 21 09:56:56 UTC 2021


Hello all.

I'm speechless.
I suspendend testing config changes to update another machine. In the 
last test I added "CPUs=192" to the noe definition, restarted slurmctld 
and nothing changed.
When I returned, I checked again and slurm reported 192 CPUs! Magic?
I now removed CPUs=192, restarted slurmctld and it keeps seeing all CPUs...
What should I think?

But another problem surfaces: slurmtop seems not to handle so many CPUs 
gracefully and throws a lot of errors, but that should be something 
manageable...

Tks for the help.

BYtE,
  Diego

Il 21/07/2021 11:01, Diego Zuccato ha scritto:
> Uff... A bit mangled... Correcting and resending.
> 
> Il 21/07/2021 08:18, Diego Zuccato ha scritto:
>> Il 20/07/2021 18:02, mercan ha scritto:
>> Hi Ahmet.
>>
>>> Did you check slurmctld log for a complain about the host line. if 
>>> the slumctld can not recognize a parameter, may be it give up 
>>> processing whole host line.
>> Yup. Nothing there :(
>>
>> [2021-07-21T08:13:14.984] slurmctld version 18.08.5-2 started on 
>> cluster oph
>> [2021-07-21T08:13:16.990] error: _shutdown_bu_thread:send/recv 
>> str957-cluster2: Connection timed out
>> [2021-07-21T08:13:17.809] layouts: no layout to initialize
>> [2021-07-21T08:13:17.828] error: read_slurm_conf: default partition 
>> not set.
>> [2021-07-21T08:13:17.829] layouts: loading entities/relations information
>> [2021-07-21T08:13:17.829] Recovered state of 34 nodes
>> [2021-07-21T08:13:17.829] Down nodes: str957-mtx-[21-22]
>> [2021-07-21T08:13:17.829] Recovered JobId=33656 Assoc=377
>> [...cut...]
>> [2021-07-21T08:13:17.831] Recovered information about 45 jobs
>> [2021-07-21T08:13:17.831] cons_res: select_p_node_init
>> [2021-07-21T08:13:17.831] cons_res: preparing for 8 partitions
>> [2021-07-21T08:13:17.832] Recovered state of 0 reservations
>> [2021-07-21T08:13:17.833] cons_res: select_p_reconfigure
>> [2021-07-21T08:13:17.833] cons_res: select_p_node_init
>> [2021-07-21T08:13:17.833] cons_res: preparing for 8 partitions
>> [2021-07-21T08:13:17.833] Running as primary controller
>> [2021-07-21T08:13:17.833] Registering slurmctld at port 6817 with 
>> slurmdbd.
>> [2021-07-21T08:13:18.220] No parameter for mcs plugin, default values set
>> [2021-07-21T08:13:18.220] mcs: MCSParameters = (null). ondemand set.
>> [2021-07-21T08:13:23.226] 
>> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 
>>
>> [2021-07-21T08:13:23.226] _build_node_list: No nodes satisfy 
>> JobId=33762 requirements in partition b6
>> [2021-07-21T08:13:23.227] _build_node_list: No nodes satisfy 
>> JobId=33808 requirements in partition b4
>>
>> (str957-cluster2 is the second frontend/login node that I've had to 
>> take offline for an unrelated problem).
> And str957-mtx-[21-22] are not yet installed.
> 

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



More information about the slurm-users mailing list