[slurm-users] slurmctld up and running but not really working
Julien Rey
julien.rey at univ-paris-diderot.fr
Tue Jul 19 14:28:46 UTC 2022
Hello,
I am currently facing an issue with an old install of slurm (17.02.11).
However, I cannot upgrade this version because I had troubles with
database migration in the past (when upgrading to 17.11) and this
install is set to be replaced in the next coming monthes. For the time
being I have to keep it running because some of our services still rely
on it.
This issue occured after a power outage.
slurmctld is up and running, however, when I enter "sinfo", I end up
with this message after a few minutes:
slurm_load_partitions: Unable to contact slurm controller (connect failure)
I set SlurmctldDebug=7 in slurm.conf and DebugLevel=7 in slurmdbd.conf,
however I don't get much info about any specific error that would
prevent the slurm controller from working in the logs.
Any help would be greatly appreciated.
/var/log/slurm-llnl/slurmctld.log:
[2022-07-19T15:17:58.342] debug3: Version in assoc_usage header is 7936
[2022-07-19T15:17:58.345] debug3: Version in qos_usage header is 7936
[2022-07-19T15:17:58.345] debug: Reading slurm.conf file:
/etc/slurm-llnl/slurm.conf
[2022-07-19T15:17:58.347] debug: Ignoring obsolete SchedulerPort option.
[2022-07-19T15:17:58.347] debug3: layouts: layouts_init()...
[2022-07-19T15:17:58.347] layouts: no layout to initialize
[2022-07-19T15:17:58.347] debug3: Trying to load plugin
/usr/local/lib/slurm/topology_none.so
[2022-07-19T15:17:58.347] topology NONE plugin loaded
[2022-07-19T15:17:58.347] debug3: Success.
[2022-07-19T15:17:58.348] debug: No DownNodes
[2022-07-19T15:17:58.348] debug3: Version in last_conf_lite header is 7936
[2022-07-19T15:17:58.349] debug3: Trying to load plugin
/usr/local/lib/slurm/jobcomp_none.so
[2022-07-19T15:17:58.349] debug3: Success.
[2022-07-19T15:17:58.349] debug3: Trying to load plugin
/usr/local/lib/slurm/sched_backfill.so
[2022-07-19T15:17:58.349] sched: Backfill scheduler plugin loaded
[2022-07-19T15:17:58.349] debug3: Success.
[2022-07-19T15:17:58.350] debug3: Trying to load plugin
/usr/local/lib/slurm/route_default.so
[2022-07-19T15:17:58.350] route default plugin loaded
[2022-07-19T15:17:58.350] debug3: Success.
[2022-07-19T15:17:58.355] layouts: loading entities/relations information
[2022-07-19T15:17:58.355] debug3: layouts: loading node node0
[2022-07-19T15:17:58.356] debug3: layouts: loading node node1
[2022-07-19T15:17:58.356] debug3: layouts: loading node node2
[2022-07-19T15:17:58.356] debug3: layouts: loading node node3
[2022-07-19T15:17:58.356] debug3: layouts: loading node node4
[2022-07-19T15:17:58.356] debug3: layouts: loading node node5
[2022-07-19T15:17:58.356] debug3: layouts: loading node node6
[2022-07-19T15:17:58.356] debug3: layouts: loading node node7
[2022-07-19T15:17:58.356] debug3: layouts: loading node node8
[2022-07-19T15:17:58.356] debug3: layouts: loading node node9
[2022-07-19T15:17:58.356] debug3: layouts: loading node node10
[2022-07-19T15:17:58.356] debug3: layouts: loading node node11
[2022-07-19T15:17:58.356] debug3: layouts: loading node node12
[2022-07-19T15:17:58.356] debug3: layouts: loading node node13
[2022-07-19T15:17:58.356] debug3: layouts: loading node node14
[2022-07-19T15:17:58.356] debug3: layouts: loading node node15
[2022-07-19T15:17:58.356] debug3: layouts: loading node node16
[2022-07-19T15:17:58.356] debug3: layouts: loading node node17
[2022-07-19T15:17:58.356] debug3: layouts: loading node node18
[2022-07-19T15:17:58.356] debug3: layouts: loading node node19
[2022-07-19T15:17:58.356] debug3: layouts: loading node node20
[2022-07-19T15:17:58.356] debug3: layouts: loading node node21
[2022-07-19T15:17:58.356] debug3: layouts: loading node node22
[2022-07-19T15:17:58.356] debug3: layouts: loading node node23
[2022-07-19T15:17:58.356] debug3: layouts: loading node node24
[2022-07-19T15:17:58.356] debug3: layouts: loading node node25
[2022-07-19T15:17:58.356] debug3: layouts: loading node node26
[2022-07-19T15:17:58.356] debug3: layouts: loading node node27
[2022-07-19T15:17:58.356] debug3: layouts: loading node node28
[2022-07-19T15:17:58.356] debug3: layouts: loading node node29
[2022-07-19T15:17:58.356] debug3: layouts: loading node node30
[2022-07-19T15:17:58.356] debug3: layouts: loading node node31
[2022-07-19T15:17:58.356] debug3: layouts: loading node node42
[2022-07-19T15:17:58.356] debug3: layouts: loading node node43
[2022-07-19T15:17:58.356] debug3: layouts: loading node node44
[2022-07-19T15:17:58.356] debug3: layouts: loading node node45
[2022-07-19T15:17:58.356] debug3: layouts: loading node node46
[2022-07-19T15:17:58.356] debug3: layouts: loading node node47
[2022-07-19T15:17:58.356] debug3: layouts: loading node node49
[2022-07-19T15:17:58.356] debug3: layouts: loading node node50
[2022-07-19T15:17:58.356] debug3: layouts: loading node node51
[2022-07-19T15:17:58.356] debug3: layouts: loading node node52
[2022-07-19T15:17:58.356] debug3: layouts: loading node node53
[2022-07-19T15:17:58.356] debug3: layouts: loading node node54
[2022-07-19T15:17:58.356] debug3: layouts: loading node node55
[2022-07-19T15:17:58.356] debug3: layouts: loading node node56
[2022-07-19T15:17:58.356] debug3: layouts: loading node node60
[2022-07-19T15:17:58.356] debug3: layouts: loading node node61
[2022-07-19T15:17:58.356] debug3: layouts: loading node node62
[2022-07-19T15:17:58.356] debug3: layouts: loading node node63
[2022-07-19T15:17:58.356] debug3: layouts: loading node node64
[2022-07-19T15:17:58.356] debug3: layouts: loading node node65
[2022-07-19T15:17:58.356] debug3: layouts: loading node node66
[2022-07-19T15:17:58.356] debug3: layouts: loading node node67
[2022-07-19T15:17:58.356] debug3: layouts: loading node node68
[2022-07-19T15:17:58.356] debug3: layouts: loading node node73
[2022-07-19T15:17:58.356] debug3: layouts: loading node node74
[2022-07-19T15:17:58.356] debug3: layouts: loading node node75
[2022-07-19T15:17:58.356] debug3: layouts: loading node node76
[2022-07-19T15:17:58.356] debug3: layouts: loading node node77
[2022-07-19T15:17:58.356] debug3: layouts: loading node node78
[2022-07-19T15:17:58.356] debug3: layouts: loading node node100
[2022-07-19T15:17:58.356] debug3: layouts: loading node node101
[2022-07-19T15:17:58.356] debug3: layouts: loading node node102
[2022-07-19T15:17:58.356] debug3: layouts: loading node node103
[2022-07-19T15:17:58.356] debug3: layouts: loading node node104
[2022-07-19T15:17:58.356] debug3: layouts: loading node node105
[2022-07-19T15:17:58.356] debug3: layouts: loading node node106
[2022-07-19T15:17:58.356] debug3: layouts: loading node node107
[2022-07-19T15:17:58.356] debug3: layouts: loading node node108
[2022-07-19T15:17:58.356] debug3: layouts: loading node node109
[2022-07-19T15:17:58.356] debug: layouts: 71/71 nodes in hash table, rc=0
[2022-07-19T15:17:58.356] debug: layouts: loading stage 1
[2022-07-19T15:17:58.356] debug: layouts: loading stage 1.1 (restore state)
[2022-07-19T15:17:58.356] debug: layouts: loading stage 2
[2022-07-19T15:17:58.356] debug: layouts: loading stage 3
[2022-07-19T15:17:58.356] error: Node state file
/var/lib/slurm-llnl/slurmctld/node_state too small
[2022-07-19T15:17:58.356] error: NOTE: Trying backup state save file.
Information may be lost!
[2022-07-19T15:17:58.356] debug3: Version string in node_state header is
PROTOCOL_VERSION
[2022-07-19T15:17:58.357] Recovered state of 71 nodes
[2022-07-19T15:17:58.357] error: Job state file
/var/lib/slurm-llnl/slurmctld/job_state too small
[2022-07-19T15:17:58.357] error: NOTE: Trying backup state save file.
Jobs may be lost!
[2022-07-19T15:17:58.357] error: Incomplete job state save file
[2022-07-19T15:17:58.357] Recovered information about 0 jobs
[2022-07-19T15:17:58.357] cons_res: select_p_node_init
[2022-07-19T15:17:58.357] cons_res: preparing for 7 partitions
[2022-07-19T15:17:58.357] debug: Ports available for reservation
10000-30000
[2022-07-19T15:17:58.359] debug2: init_requeue_policy:
kill_invalid_depend is set to 0
[2022-07-19T15:17:58.359] debug: Updating partition uid access list
[2022-07-19T15:17:58.359] debug3: Version string in resv_state header is
PROTOCOL_VERSION
[2022-07-19T15:17:58.359] Recovered state of 0 reservations
[2022-07-19T15:17:58.359] State of 0 triggers recovered
/var/log/slurm-llnl/slurmdbd.log:
[2022-07-19T15:00:45.265] debug3: Trying to load plugin
/usr/local/lib/slurm/auth_munge.so
[2022-07-19T15:00:45.265] debug: Munge authentication plugin loaded
[2022-07-19T15:00:45.265] debug3: Success.
[2022-07-19T15:00:45.265] debug3: Trying to load plugin
/usr/local/lib/slurm/accounting_storage_mysql.so
[2022-07-19T15:00:45.268] debug2: mysql_connect() called for db
slurm_acct_db
[2022-07-19T15:00:45.402] debug2: It appears the table conversions have
already taken place, hooray!
[2022-07-19T15:00:48.146] Accounting storage MYSQL plugin loaded
[2022-07-19T15:00:48.147] debug3: Success.
[2022-07-19T15:00:48.153] debug2: ArchiveDir = /home/slurm
[2022-07-19T15:00:48.153] debug2: ArchiveScript = (null)
[2022-07-19T15:00:48.153] debug2: AuthInfo =
/var/run/munge/munge.socket.2
[2022-07-19T15:00:48.154] debug2: AuthType = auth/munge
[2022-07-19T15:00:48.154] debug2: CommitDelay = 0
[2022-07-19T15:00:48.154] debug2: DbdAddr = localhost
[2022-07-19T15:00:48.154] debug2: DbdBackupHost = (null)
[2022-07-19T15:00:48.154] debug2: DbdHost = localhost
[2022-07-19T15:00:48.154] debug2: DbdPort = 6819
[2022-07-19T15:00:48.154] debug2: DebugFlags = (null)
[2022-07-19T15:00:48.154] debug2: DebugLevel = 7
[2022-07-19T15:00:48.154] debug2: DefaultQOS = (null)
[2022-07-19T15:00:48.154] debug2: LogFile =
/var/log/slurm-llnl/slurmdbd.log
[2022-07-19T15:00:48.154] debug2: MessageTimeout = 10
[2022-07-19T15:00:48.154] debug2: PidFile =
/var/run/slurm-llnl/slurmdbd.pid
[2022-07-19T15:00:48.154] debug2: PluginDir = /usr/local/lib/slurm
[2022-07-19T15:00:48.154] debug2: PrivateData = none
[2022-07-19T15:00:48.154] debug2: PurgeEventAfter = 730 days
[2022-07-19T15:00:48.154] debug2: PurgeJobAfter = 730 days
[2022-07-19T15:00:48.154] debug2: PurgeResvAfter = 730 days
[2022-07-19T15:00:48.154] debug2: PurgeStepAfter = 730 days
[2022-07-19T15:00:48.154] debug2: PurgeSuspendAfter = 730 days
[2022-07-19T15:00:48.154] debug2: PurgeTXNAfter = NONE
[2022-07-19T15:00:48.154] debug2: PurgeUsageAfter = NONE
[2022-07-19T15:00:48.154] debug2: SlurmUser = slurm(64030)
[2022-07-19T15:00:48.154] debug2: StorageBackupHost = (null)
[2022-07-19T15:00:48.154] debug2: StorageHost = localhost
[2022-07-19T15:00:48.154] debug2: StorageLoc = slurm_acct_db
[2022-07-19T15:00:48.154] debug2: StoragePort = 3306
[2022-07-19T15:00:48.154] debug2: StorageType =
accounting_storage/mysql
[2022-07-19T15:00:48.154] debug2: StorageUser = slurm
[2022-07-19T15:00:48.154] debug2: TCPTimeout = 2
[2022-07-19T15:00:48.154] debug2: TrackWCKey = 0
[2022-07-19T15:00:48.154] debug2: TrackSlurmctldDown= 0
[2022-07-19T15:00:48.154] debug2: acct_storage_p_get_connection: request
new connection 1
[2022-07-19T15:00:48.430] slurmdbd version 17.02.11 started
[2022-07-19T15:00:48.431] debug2: running rollup at Tue Jul 19 15:00:48 2022
[2022-07-19T15:00:48.435] debug2: No need to roll cluster clusterdev
this day 1658181600 <= 1658181600
[2022-07-19T15:00:48.435] debug2: No need to roll cluster clusterdev
this month 1656626400 <= 1656626400
[2022-07-19T15:00:48.436] debug2: Got 1 of 2 rolled up
[2022-07-19T15:00:48.454] error: We have more time than is possible
(1576800+2160000+0)(3736800) > 3456000 for cluster cluster(960) from
2022-07-19T14:00:00 - 2022-07-19T15:00:00 tres 1
[2022-07-19T15:00:48.456] debug2: No need to roll cluster cluster this
day 1658181600 <= 1658181600
[2022-07-19T15:00:48.457] debug2: No need to roll cluster cluster this
month 1656626400 <= 1656626400
[2022-07-19T15:00:48.458] debug2: Got 2 of 2 rolled up
[2022-07-19T15:00:48.458] debug2: Everything rolled up
[2022-07-19T15:01:05.000] debug2: Opened connection 9 from 10.0.1.51
[2022-07-19T15:01:05.001] debug: REQUEST_PERSIST_INIT: CLUSTER:cluster
VERSION:7936 UID:64030 IP:10.0.1.51 CONN:9
[2022-07-19T15:01:05.001] debug2: acct_storage_p_get_connection: request
new connection 1
[2022-07-19T15:06:52.329] Terminate signal (SIGINT or SIGTERM) received
[2022-07-19T15:06:52.330] debug: rpc_mgr shutting down
[2022-07-19T15:06:52.331] debug2: Closed connection 9 uid(64030)
[2022-07-19T15:06:52.332] debug3: starting mysql cleaning up
[2022-07-19T15:06:52.332] debug3: finished mysql cleaning up
[2022-07-19T15:11:13.288] debug3: Trying to load plugin
/usr/local/lib/slurm/auth_munge.so
[2022-07-19T15:11:13.301] debug: Munge authentication plugin loaded
[2022-07-19T15:11:13.301] debug3: Success.
[2022-07-19T15:11:13.301] debug3: Trying to load plugin
/usr/local/lib/slurm/accounting_storage_mysql.so
[2022-07-19T15:11:13.362] debug2: mysql_connect() called for db
slurm_acct_db
[2022-07-19T15:11:15.447] debug2: It appears the table conversions have
already taken place, hooray!
[2022-07-19T15:11:40.975] Accounting storage MYSQL plugin loaded
[2022-07-19T15:11:40.975] debug3: Success.
[2022-07-19T15:11:40.978] debug2: ArchiveDir = /home/slurm
[2022-07-19T15:11:40.978] debug2: ArchiveScript = (null)
[2022-07-19T15:11:40.978] debug2: AuthInfo =
/var/run/munge/munge.socket.2
[2022-07-19T15:11:40.978] debug2: AuthType = auth/munge
[2022-07-19T15:11:40.978] debug2: CommitDelay = 0
[2022-07-19T15:11:40.978] debug2: DbdAddr = localhost
[2022-07-19T15:11:40.978] debug2: DbdBackupHost = (null)
[2022-07-19T15:11:40.978] debug2: DbdHost = localhost
[2022-07-19T15:11:40.978] debug2: DbdPort = 6819
[2022-07-19T15:11:40.978] debug2: DebugFlags = (null)
[2022-07-19T15:11:40.978] debug2: DebugLevel = 7
[2022-07-19T15:11:40.978] debug2: DefaultQOS = (null)
[2022-07-19T15:11:40.978] debug2: LogFile =
/var/log/slurm-llnl/slurmdbd.log
[2022-07-19T15:11:40.978] debug2: MessageTimeout = 10
[2022-07-19T15:11:40.978] debug2: PidFile =
/var/run/slurm-llnl/slurmdbd.pid
[2022-07-19T15:11:40.978] debug2: PluginDir = /usr/local/lib/slurm
[2022-07-19T15:11:40.978] debug2: PrivateData = none
[2022-07-19T15:11:40.978] debug2: PurgeEventAfter = 730 days
[2022-07-19T15:11:40.978] debug2: PurgeJobAfter = 730 days
[2022-07-19T15:11:40.978] debug2: PurgeResvAfter = 730 days
[2022-07-19T15:11:40.978] debug2: PurgeStepAfter = 730 days
[2022-07-19T15:11:40.978] debug2: PurgeSuspendAfter = 730 days
[2022-07-19T15:11:40.978] debug2: PurgeTXNAfter = NONE
[2022-07-19T15:11:40.978] debug2: PurgeUsageAfter = NONE
[2022-07-19T15:11:40.979] debug2: SlurmUser = slurm(64030)
[2022-07-19T15:11:40.979] debug2: StorageBackupHost = (null)
[2022-07-19T15:11:40.979] debug2: StorageHost = localhost
[2022-07-19T15:11:40.979] debug2: StorageLoc = slurm_acct_db
[2022-07-19T15:11:40.979] debug2: StoragePort = 3306
[2022-07-19T15:11:40.979] debug2: StorageType =
accounting_storage/mysql
[2022-07-19T15:11:40.979] debug2: StorageUser = slurm
[2022-07-19T15:11:40.979] debug2: TCPTimeout = 2
[2022-07-19T15:11:40.979] debug2: TrackWCKey = 0
[2022-07-19T15:11:40.979] debug2: TrackSlurmctldDown= 0
[2022-07-19T15:11:40.979] debug2: acct_storage_p_get_connection: request
new connection 1
[2022-07-19T15:11:41.168] slurmdbd version 17.02.11 started
[2022-07-19T15:11:41.168] debug2: running rollup at Tue Jul 19 15:11:41 2022
[2022-07-19T15:11:41.170] debug2: No need to roll cluster clusterdev
this hour 1658235600 <= 1658235600
[2022-07-19T15:11:41.170] debug2: No need to roll cluster clusterdev
this day 1658181600 <= 1658181600
[2022-07-19T15:11:41.170] debug2: No need to roll cluster clusterdev
this month 1656626400 <= 1656626400
[2022-07-19T15:11:41.170] debug2: No need to roll cluster cluster this
hour 1658235600 <= 1658235600
[2022-07-19T15:11:41.170] debug2: No need to roll cluster cluster this
day 1658181600 <= 1658181600
[2022-07-19T15:11:41.170] debug2: No need to roll cluster cluster this
month 1656626400 <= 1656626400
[2022-07-19T15:11:41.170] debug2: Got 2 of 2 rolled up
[2022-07-19T15:11:41.170] debug2: Everything rolled up
[2022-07-19T15:11:58.000] debug2: Opened connection 9 from 10.0.1.51
[2022-07-19T15:11:58.003] debug: REQUEST_PERSIST_INIT: CLUSTER:cluster
VERSION:7936 UID:64030 IP:10.0.1.51 CONN:9
[2022-07-19T15:11:58.003] debug2: acct_storage_p_get_connection: request
new connection 1
[2022-07-19T15:15:28.671] debug2: Opened connection 10 from 10.0.1.51
[2022-07-19T15:15:28.672] debug: REQUEST_PERSIST_INIT: CLUSTER:cluster
VERSION:7936 UID:0 IP:10.0.1.51 CONN:10
[2022-07-19T15:15:28.672] debug2: acct_storage_p_get_connection: request
new connection 1
[2022-07-19T15:15:28.710] debug2: DBD_FINI: CLOSE:0 COMMIT:0
[2022-07-19T15:15:28.790] debug2: DBD_GET_USERS: called
[2022-07-19T15:15:28.847] debug2: DBD_FINI: CLOSE:1 COMMIT:0
[2022-07-19T15:15:28.847] debug2: persistant connection is closed
[2022-07-19T15:15:28.847] debug2: Closed connection 10 uid(0)
[2022-07-19T15:17:19.421] debug2: Closed connection 9 uid(64030)
[2022-07-19T15:17:53.635] debug2: Opened connection 12 from 10.0.1.51
[2022-07-19T15:17:53.636] debug: REQUEST_PERSIST_INIT: CLUSTER:cluster
VERSION:7936 UID:64030 IP:10.0.1.51 CONN:12
[2022-07-19T15:17:53.636] debug2: acct_storage_p_get_connection: request
new connection 1
[2022-07-19T15:17:53.674] debug2: DBD_GET_TRES: called
[2022-07-19T15:17:53.754] debug2: DBD_GET_QOS: called
[2022-07-19T15:17:53.834] debug2: DBD_GET_USERS: called
[2022-07-19T15:17:54.150] debug2: DBD_GET_ASSOCS: called
[2022-07-19T15:17:58.304] debug2: DBD_GET_RES: called
[2022-07-19T16:00:00.171] debug2: running rollup at Tue Jul 19 16:00:00 2022
[2022-07-19T16:00:00.318] debug2: No need to roll cluster clusterdev
this day 1658181600 <= 1658181600
[2022-07-19T16:00:00.318] debug2: No need to roll cluster clusterdev
this month 1656626400 <= 1656626400
[2022-07-19T16:00:00.320] debug2: Got 1 of 2 rolled up
[2022-07-19T16:00:01.603] error: We have more time than is possible
(1576800+2160000+0)(3736800) > 3456000 for cluster cluster(960) from
2022-07-19T15:00:00 - 2022-07-19T16:00:00 tres 1
[2022-07-19T16:00:01.693] debug2: No need to roll cluster cluster this
day 1658181600 <= 1658181600
[2022-07-19T16:00:01.694] debug2: No need to roll cluster cluster this
month 1656626400 <= 1656626400
[2022-07-19T16:00:01.711] debug2: Got 2 of 2 rolled up
[2022-07-19T16:00:01.711] debug2: Everything rolled up
--
Julien Rey
Plate-forme RPBS
Unité BFA - CMPLI
Université de Paris
tel: 01 57 27 83 95
More information about the slurm-users
mailing list