[slurm-users] Accounting Information from slurmdbd does not reach slurmctld

Marcus Wagner wagner at itc.rwth-aachen.de
Mon Mar 23 07:26:50 UTC 2020


Hi Pascal,

are the slurmdbd and slurmctld running on he same host?

Best
Marcus

Am 20.03.2020 um 18:12 schrieb Pascal Klink:
> Hi Chris,
> 
> Thanks for the quick answer! I tried the 'sacctmgr show clusters‘ command, which gave
> 
> Cluster     ControlHost     ControlPort   RPC   Share     ... QOS                   Def QOS
> ----------  --------------- ------------  ----- --------- ... --------------------  ---------
> iascluster  127.0.0.1       6817          8192  1             normal
> 
> 
> I removed the columns which had no value in between the 'Share' and 'QOS' row. Also you can see the relevant output of slurmctld and slurmdbd here (it was running on debug mode):
> 
> slurmctld:
> 
> [2020-03-18T22:59:52.441] debug:  sched: slurmctld starting
> [2020-03-18T22:59:52.441] slurmctld version 17.11.2 started on cluster iascluster
> [2020-03-18T22:59:52.442] Munge cryptographic signature plugin loaded
> [2020-03-18T22:59:52.442] preempt/none loaded
> [2020-03-18T22:59:52.442] debug:  Checkpoint plugin loaded: checkpoint/none
> [2020-03-18T22:59:52.442] debug:  AcctGatherEnergy NONE plugin loaded
> [2020-03-18T22:59:52.442] debug:  AcctGatherProfile NONE plugin loaded
> [2020-03-18T22:59:52.442] debug:  AcctGatherInterconnect NONE plugin loaded
> [2020-03-18T22:59:52.442] debug:  AcctGatherFilesystem NONE plugin loaded
> [2020-03-18T22:59:52.442] debug:  Job accounting gather LINUX plugin loaded
> [2020-03-18T22:59:52.442] ExtSensors NONE plugin loaded
> [2020-03-18T22:59:52.442] debug:  switch NONE plugin loaded
> [2020-03-18T22:59:52.442] debug:  power_save module disabled, SuspendTime < 0
> [2020-03-18T22:59:52.442] debug:  No backup controller to shutdown
> [2020-03-18T22:59:52.442] Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
> [2020-03-18T22:59:52.442] debug:  Munge authentication plugin loaded
> [2020-03-18T22:59:52.443] debug:  slurmdbd: Sent PersistInit msg
> [2020-03-18T22:59:52.443] slurmdbd: recovered 0 pending RPCs
> [2020-03-18T22:59:52.447] debug:  Reading slurm.conf file: /etc/slurm-llnl/slurm.conf
> [2020-03-18T22:59:52.447] layouts: no layout to initialize
> [2020-03-18T22:59:52.447] topology NONE plugin loaded
> [2020-03-18T22:59:52.447] debug:  No DownNodes
> [2020-03-18T22:59:52.492] debug:  Log file re-opened
> [2020-03-18T22:59:52.492] error: chown(var/log/slurm/slurmctld.log, 64030, 64030): No such file or directory
> [2020-03-18T22:59:52.492] sched: Backfill scheduler plugin loaded
> [2020-03-18T22:59:52.493] route default plugin loaded
> [2020-03-18T22:59:52.493] layouts: loading entities/relations information
> [2020-03-18T22:59:52.493] debug:  layouts: 7/7 nodes in hash table, rc=0
> [2020-03-18T22:59:52.493] debug:  layouts: loading stage 1
> [2020-03-18T22:59:52.493] debug:  layouts: loading stage 1.1 (restore state)
> [2020-03-18T22:59:52.493] debug:  layouts: loading stage 2
> [2020-03-18T22:59:52.493] debug:  layouts: loading stage 3
> [2020-03-18T22:59:52.493] Recovered state of 7 nodes
> [2020-03-18T22:59:52.493] Down nodes: cn[01-07]
> [2020-03-18T22:59:52.493] Recovered information about 0 jobs
> [2020-03-18T22:59:52.493] debug:  Updating partition uid access list
> [2020-03-18T22:59:52.493] Recovered state of 0 reservations
> [2020-03-18T22:59:52.493] State of 0 triggers recovered
> [2020-03-18T22:59:52.493] _preserve_plugins: backup_controller not specified
> [2020-03-18T22:59:52.493] Running as primary controller
> [2020-03-18T22:59:52.493] debug:  No BackupController, not launching heartbeat.
> [2020-03-18T22:59:52.493] Registering slurmctld at port 6817 with slurmdbd.
> [2020-03-18T22:59:52.528] debug:  No feds to retrieve from state
> [2020-03-18T22:59:52.572] debug:  Priority MULTIFACTOR plugin loaded
> [2020-03-18T22:59:52.572] No parameter for mcs plugin, default values set
> [2020-03-18T22:59:52.572] mcs: MCSParameters = (null). ondemand set.
> [2020-03-18T22:59:52.572] debug:  mcs none plugin loaded
> [2020-03-18T22:59:52.573] debug:  power_save mode not enabled
> [2020-03-18T22:59:55.578] debug:  Spawning registration agent for cn[01-07] 7 hosts
> [2020-03-18T23:00:05.591] agent/is_node_resp: node:cn01 RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure
> [2020-03-18T23:00:05.591] agent/is_node_resp: node:cn02 RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure
> [2020-03-18T23:00:05.591] agent/is_node_resp: node:cn07 RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure
> [2020-03-18T23:00:05.592] agent/is_node_resp: node:cn04 RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure
> [2020-03-18T23:00:05.592] agent/is_node_resp: node:cn03 RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure
> [2020-03-18T23:00:05.592] agent/is_node_resp: node:cn05 RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure
> [2020-03-18T23:00:05.592] agent/is_node_resp: node:cn06 RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure
> [2020-03-18T23:00:22.494] debug:  backfill: beginning
> [2020-03-18T23:00:22.494] debug:  backfill: no jobs to backfill
> [2020-03-18T23:00:52.666] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
> [2020-03-18T23:00:52.666] debug:  sched: Running job scheduler
> [2020-03-18T23:01:35.732] debug:  Spawning registration agent for cn[01-07] 7 hosts
> [2020-03-18T23:01:52.758] debug:  sched: Running job scheduler
> [2020-03-18T23:02:52.850] debug:  sched: Running job scheduler
> [2020-03-18T23:03:15.886] debug:  Spawning registration agent for cn[01-07] 7 hosts
> [2020-03-18T23:03:52.942] debug:  sched: Running job scheduler
> [2020-03-18T23:04:01.636] Node cn01 now responding
> [2020-03-18T23:04:01.636] node cn01 returned to service
> [2020-03-18T23:04:01.955] debug:  sched: Running job scheduler
> [2020-03-18T23:04:02.524] debug:  backfill: beginning
> [2020-03-18T23:04:02.524] debug:  backfill: no jobs to backfill
> [2020-03-18T23:04:52.032] debug:  sched: Running job scheduler
> [2020-03-18T23:04:55.322] debug:  Spawning registration agent for cn[02-07] 6 hosts
> [2020-03-18T23:04:57.574] _slurm_rpc_submit_batch_job: JobId=776 InitPrio=1000000 usec=801
> [2020-03-18T23:04:58.327] debug:  sched: Running job scheduler
> [2020-03-18T23:04:58.327] sched: Allocate JobID=776_2(776) NodeList=cn01 #CPUs=32 Partition=amd
> [2020-03-18T23:04:58.528] debug:  backfill: beginning
> [2020-03-18T23:04:58.528] debug:  backfill: no jobs to backfill
> [2020-03-18T23:05:28.529] debug:  backfill: beginning
> [2020-03-18T23:05:28.529] debug:  backfill: no jobs to backfill
> [2020-03-18T23:05:52.545] debug:  sched: Running job scheduler
> [2020-03-18T23:06:35.611] debug:  Spawning registration agent for cn[02-07] 6 hosts
> [2020-03-18T23:06:52.638] debug:  sched: Running job scheduler
> [2020-03-18T23:07:52.731] debug:  sched: Running job scheduler
> [2020-03-18T23:08:15.766] debug:  Spawning ping agent for cn01
> [2020-03-18T23:08:15.766] debug:  Spawning registration agent for cn[02-07] 6 hosts
> [2020-03-18T23:08:16.552] debug:  backfill: beginning
> [2020-03-18T23:08:16.552] debug:  backfill: no jobs to backfill
> [2020-03-18T23:08:20.711] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 776_2 uid 1000
> [2020-03-18T23:08:20.711] error: job_str_signal: Security violation JOB_CANCEL RPC for jobID 776_2 from uid 1000
> [2020-03-18T23:08:20.711] _slurm_rpc_kill_job: job_str_signal() job 776_2 sig 9 returned Access/permission denied
> [2020-03-18T23:08:31.656] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 776_2 uid 1001
> [2020-03-18T23:08:31.656] _job_signal: 9 of running JobID=776_2(776) State=0x8004 NodeCnt=1 successful 0x8004
> [2020-03-18T23:08:33.661] debug:  sched: Running job scheduler
> [2020-03-18T23:08:46.552] debug:  backfill: beginning
> [2020-03-18T23:08:46.552] debug:  backfill: no jobs to backfill
> [2020-03-18T23:08:52.822] debug:  sched: Running job scheduler
> [2020-03-18T23:09:52.913] debug:  Updating partition uid access list
> [2020-03-18T23:09:52.913] debug:  sched: Running job scheduler
> [2020-03-18T23:09:55.920] debug:  Spawning registration agent for cn[02-07] 6 hosts
> [2020-03-18T23:10:52.170] debug:  sched: Running job scheduler
> [2020-03-18T23:11:35.234] debug:  Spawning ping agent for cn01
> [2020-03-18T23:11:35.234] debug:  Spawning registration agent for cn[02-07] 6 hosts
> [2020-03-18T23:11:35.574] debug:  backfill: beginning
> [2020-03-18T23:11:35.574] debug:  backfill: no jobs to backfill
> [2020-03-18T23:11:52.259] debug:  sched: Running job scheduler
> [2020-03-18T23:12:05.574] debug:  backfill: beginning
> [2020-03-18T23:12:05.575] debug:  backfill: no jobs to backfill
> [2020-03-18T23:12:52.353] debug:  sched: Running job scheduler
> [2020-03-18T23:13:15.389] debug:  Spawning registration agent for cn[02-07] 6 hosts
> [2020-03-18T23:13:52.445] debug:  sched: Running job scheduler
> [2020-03-18T23:13:52.587] debug:  backfill: beginning
> [2020-03-18T23:13:52.587] debug:  backfill: no jobs to backfill
> [2020-03-18T23:14:22.588] debug:  backfill: beginning
> [2020-03-18T23:14:22.588] debug:  backfill: no jobs to backfill
> [2020-03-18T23:14:52.537] debug:  sched: Running job scheduler
> [2020-03-18T23:14:55.543] debug:  Spawning ping agent for cn01
> [2020-03-18T23:14:55.543] debug:  Spawning registration agent for cn[02-07] 6 hosts
> [2020-03-18T23:14:55.588] debug:  backfill: beginning
> [2020-03-18T23:14:55.589] debug:  backfill: no jobs to backfill
> [2020-03-18T23:15:25.589] debug:  backfill: beginning
> [2020-03-18T23:15:25.589] debug:  backfill: no jobs to backfill
> [2020-03-18T23:15:52.848] debug:  sched: Running job scheduler
> [2020-03-18T23:16:35.915] debug:  Spawning registration agent for cn[02-07] 6 hosts
> [2020-03-18T23:16:52.940] debug:  sched: Running job scheduler
> [2020-03-18T23:17:52.031] debug:  sched: Running job scheduler
> [2020-03-18T23:18:15.066] debug:  Spawning ping agent for cn01
> [2020-03-18T23:18:15.066] debug:  Spawning registration agent for cn[02-07] 6 hosts
> [2020-03-18T23:18:15.611] debug:  backfill: beginning
> [2020-03-18T23:18:15.611] debug:  backfill: no jobs to backfill
> [2020-03-18T23:18:45.611] debug:  backfill: beginning
> [2020-03-18T23:18:45.611] debug:  backfill: no jobs to backfill
> [2020-03-18T23:18:52.124] debug:  sched: Running job scheduler
> [2020-03-18T23:19:52.216] debug:  Updating partition uid access list
> [2020-03-18T23:19:52.216] debug:  sched: Running job scheduler
> [2020-03-18T23:19:55.265] debug:  Spawning registration agent for cn[02-07] 6 hosts
> [2020-03-18T23:20:52.538] debug:  sched: Running job scheduler
> [2020-03-18T23:21:35.604] debug:  Spawning ping agent for cn01
> [2020-03-18T23:21:35.604] debug:  Spawning registration agent for cn[02-07] 6 hosts
> [2020-03-18T23:21:35.635] debug:  backfill: beginning
> [2020-03-18T23:21:35.635] debug:  backfill: no jobs to backfill
> [2020-03-18T23:21:52.630] debug:  sched: Running job scheduler
> [2020-03-18T23:22:05.635] debug:  backfill: beginning
> [2020-03-18T23:22:05.635] debug:  backfill: no jobs to backfill
> [2020-03-18T23:22:52.723] debug:  sched: Running job scheduler
> [2020-03-18T23:23:15.759] debug:  Spawning registration agent for cn[02-07] 6 hosts
> [2020-03-18T23:23:52.817] debug:  sched: Running job scheduler
> [2020-03-18T23:24:52.911] debug:  sched: Running job scheduler
> ... (it continues like this ... The error regarding not connected hosts is just because we only started on of the 7 computers for the minimal example)
> 
> slurmdbd:
> 
> [2020-03-18T22:58:31.121] debug:  Log file re-opened
> [2020-03-18T22:58:31.121] debug:  Munge authentication plugin loaded
> [2020-03-18T22:58:31.178] Accounting storage MYSQL plugin loaded
> [2020-03-18T22:58:31.182] slurmdbd version 17.11.2 started
> [2020-03-18T22:58:57.099] debug:  REQUEST_PERSIST_INIT: CLUSTER:iascluster VERSION:8192 UID:1000 IP:127.0.0.1 CONN:7
> [2020-03-18T22:59:52.443] debug:  REQUEST_PERSIST_INIT: CLUSTER:iascluster VERSION:8192 UID:64030 IP:127.0.0.1 CONN:8
> [2020-03-18T23:05:20.868] debug:  REQUEST_PERSIST_INIT: CLUSTER:iascluster VERSION:8192 UID:0 IP:127.0.0.1 CONN:7
> [2020-03-18T23:05:21.751] debug:  REQUEST_PERSIST_INIT: CLUSTER:iascluster VERSION:8192 UID:0 IP:127.0.0.1 CONN:10
> [2020-03-18T23:05:22.444] debug:  REQUEST_PERSIST_INIT: CLUSTER:iascluster VERSION:8192 UID:0 IP:127.0.0.1 CONN:7
> [2020-03-18T23:05:23.015] debug:  REQUEST_PERSIST_INIT: CLUSTER:iascluster VERSION:8192 UID:0 IP:127.0.0.1 CONN:10
> [2020-03-18T23:06:30.725] debug:  REQUEST_PERSIST_INIT: CLUSTER:iascluster VERSION:8192 UID:0 IP:127.0.0.1 CONN:7
> [2020-03-18T23:08:08.334] debug:  REQUEST_PERSIST_INIT: CLUSTER:iascluster VERSION:8192 UID:0 IP:127.0.0.1 CONN:10
> [2020-03-18T23:08:39.463] debug:  REQUEST_PERSIST_INIT: CLUSTER:iascluster VERSION:8192 UID:0 IP:127.0.0.1 CONN:7
> [2020-03-18T23:14:03.809] debug:  REQUEST_PERSIST_INIT: CLUSTER:iascluster VERSION:8192 UID:1000 IP:127.0.0.1 CONN:10
> [2020-03-18T23:14:09.006] debug:  REQUEST_PERSIST_INIT: CLUSTER:iascluster VERSION:8192 UID:1000 IP:127.0.0.1 CONN:7
> [2020-03-18T23:15:22.435] debug:  REQUEST_PERSIST_INIT: CLUSTER:iascluster VERSION:8192 UID:1000 IP:127.0.0.1 CONN:10
> [2020-03-19T09:20:14.836] debug:  REQUEST_PERSIST_INIT: CLUSTER:iascluster VERSION:8192 UID:1000 IP:127.0.0.1 CONN:7
> [2020-03-19T09:20:22.345] debug:  REQUEST_PERSIST_INIT: CLUSTER:iascluster VERSION:8192 UID:1000 IP:127.0.0.1 CONN:10
> [2020-03-19T09:20:47.873] debug:  REQUEST_PERSIST_INIT: CLUSTER:iascluster VERSION:8192 UID:1000 IP:127.0.0.1 CONN:7
> [2020-03-19T10:22:42.668] debug:  REQUEST_PERSIST_INIT: CLUSTER:iascluster VERSION:8192 UID:1000 IP:127.0.0.1 CONN:10
> [2020-03-19T11:46:54.823] debug:  REQUEST_PERSIST_INIT: CLUSTER:iascluster VERSION:8192 UID:1000 IP:127.0.0.1 CONN:7
> [2020-03-19T17:39:40.260] debug:  REQUEST_PERSIST_INIT: CLUSTER:iascluster VERSION:8192 UID:1000 IP:127.0.0.1 CONN:10
> [2020-03-19T23:42:54.806] debug:  REQUEST_PERSIST_INIT: CLUSTER:iascluster VERSION:8192 UID:0 IP:127.0.0.1 CONN:7
> 
> Best,
> Pascal
> 
> 



More information about the slurm-users mailing list