[slurm-users] slurmctld up and running but not really working

Julien Rey julien.rey at univ-paris-diderot.fr
Tue Jul 19 14:28:46 UTC 2022


Hello,

I am currently facing an issue with an old install of slurm (17.02.11). 
However, I cannot upgrade this version because I had troubles with 
database migration in the past (when upgrading to 17.11) and this 
install is set to be replaced in the next coming monthes. For the time 
being I have to keep it running because some of our services still rely 
on it.

This issue occured after a power outage.

slurmctld is up and running, however, when I enter "sinfo", I end up 
with this message after a few minutes:

slurm_load_partitions: Unable to contact slurm controller (connect failure)

I set SlurmctldDebug=7 in slurm.conf and DebugLevel=7 in slurmdbd.conf, 
however I don't get much info about any specific error that would 
prevent the slurm controller from working in the logs.

Any help would be greatly appreciated.

/var/log/slurm-llnl/slurmctld.log:

[2022-07-19T15:17:58.342] debug3: Version in assoc_usage header is 7936
[2022-07-19T15:17:58.345] debug3: Version in qos_usage header is 7936
[2022-07-19T15:17:58.345] debug:  Reading slurm.conf file: 
/etc/slurm-llnl/slurm.conf
[2022-07-19T15:17:58.347] debug:  Ignoring obsolete SchedulerPort option.
[2022-07-19T15:17:58.347] debug3: layouts: layouts_init()...
[2022-07-19T15:17:58.347] layouts: no layout to initialize
[2022-07-19T15:17:58.347] debug3: Trying to load plugin 
/usr/local/lib/slurm/topology_none.so
[2022-07-19T15:17:58.347] topology NONE plugin loaded
[2022-07-19T15:17:58.347] debug3: Success.
[2022-07-19T15:17:58.348] debug:  No DownNodes
[2022-07-19T15:17:58.348] debug3: Version in last_conf_lite header is 7936
[2022-07-19T15:17:58.349] debug3: Trying to load plugin 
/usr/local/lib/slurm/jobcomp_none.so
[2022-07-19T15:17:58.349] debug3: Success.
[2022-07-19T15:17:58.349] debug3: Trying to load plugin 
/usr/local/lib/slurm/sched_backfill.so
[2022-07-19T15:17:58.349] sched: Backfill scheduler plugin loaded
[2022-07-19T15:17:58.349] debug3: Success.
[2022-07-19T15:17:58.350] debug3: Trying to load plugin 
/usr/local/lib/slurm/route_default.so
[2022-07-19T15:17:58.350] route default plugin loaded
[2022-07-19T15:17:58.350] debug3: Success.
[2022-07-19T15:17:58.355] layouts: loading entities/relations information
[2022-07-19T15:17:58.355] debug3: layouts: loading node node0
[2022-07-19T15:17:58.356] debug3: layouts: loading node node1
[2022-07-19T15:17:58.356] debug3: layouts: loading node node2
[2022-07-19T15:17:58.356] debug3: layouts: loading node node3
[2022-07-19T15:17:58.356] debug3: layouts: loading node node4
[2022-07-19T15:17:58.356] debug3: layouts: loading node node5
[2022-07-19T15:17:58.356] debug3: layouts: loading node node6
[2022-07-19T15:17:58.356] debug3: layouts: loading node node7
[2022-07-19T15:17:58.356] debug3: layouts: loading node node8
[2022-07-19T15:17:58.356] debug3: layouts: loading node node9
[2022-07-19T15:17:58.356] debug3: layouts: loading node node10
[2022-07-19T15:17:58.356] debug3: layouts: loading node node11
[2022-07-19T15:17:58.356] debug3: layouts: loading node node12
[2022-07-19T15:17:58.356] debug3: layouts: loading node node13
[2022-07-19T15:17:58.356] debug3: layouts: loading node node14
[2022-07-19T15:17:58.356] debug3: layouts: loading node node15
[2022-07-19T15:17:58.356] debug3: layouts: loading node node16
[2022-07-19T15:17:58.356] debug3: layouts: loading node node17
[2022-07-19T15:17:58.356] debug3: layouts: loading node node18
[2022-07-19T15:17:58.356] debug3: layouts: loading node node19
[2022-07-19T15:17:58.356] debug3: layouts: loading node node20
[2022-07-19T15:17:58.356] debug3: layouts: loading node node21
[2022-07-19T15:17:58.356] debug3: layouts: loading node node22
[2022-07-19T15:17:58.356] debug3: layouts: loading node node23
[2022-07-19T15:17:58.356] debug3: layouts: loading node node24
[2022-07-19T15:17:58.356] debug3: layouts: loading node node25
[2022-07-19T15:17:58.356] debug3: layouts: loading node node26
[2022-07-19T15:17:58.356] debug3: layouts: loading node node27
[2022-07-19T15:17:58.356] debug3: layouts: loading node node28
[2022-07-19T15:17:58.356] debug3: layouts: loading node node29
[2022-07-19T15:17:58.356] debug3: layouts: loading node node30
[2022-07-19T15:17:58.356] debug3: layouts: loading node node31
[2022-07-19T15:17:58.356] debug3: layouts: loading node node42
[2022-07-19T15:17:58.356] debug3: layouts: loading node node43
[2022-07-19T15:17:58.356] debug3: layouts: loading node node44
[2022-07-19T15:17:58.356] debug3: layouts: loading node node45
[2022-07-19T15:17:58.356] debug3: layouts: loading node node46
[2022-07-19T15:17:58.356] debug3: layouts: loading node node47
[2022-07-19T15:17:58.356] debug3: layouts: loading node node49
[2022-07-19T15:17:58.356] debug3: layouts: loading node node50
[2022-07-19T15:17:58.356] debug3: layouts: loading node node51
[2022-07-19T15:17:58.356] debug3: layouts: loading node node52
[2022-07-19T15:17:58.356] debug3: layouts: loading node node53
[2022-07-19T15:17:58.356] debug3: layouts: loading node node54
[2022-07-19T15:17:58.356] debug3: layouts: loading node node55
[2022-07-19T15:17:58.356] debug3: layouts: loading node node56
[2022-07-19T15:17:58.356] debug3: layouts: loading node node60
[2022-07-19T15:17:58.356] debug3: layouts: loading node node61
[2022-07-19T15:17:58.356] debug3: layouts: loading node node62
[2022-07-19T15:17:58.356] debug3: layouts: loading node node63
[2022-07-19T15:17:58.356] debug3: layouts: loading node node64
[2022-07-19T15:17:58.356] debug3: layouts: loading node node65
[2022-07-19T15:17:58.356] debug3: layouts: loading node node66
[2022-07-19T15:17:58.356] debug3: layouts: loading node node67
[2022-07-19T15:17:58.356] debug3: layouts: loading node node68
[2022-07-19T15:17:58.356] debug3: layouts: loading node node73
[2022-07-19T15:17:58.356] debug3: layouts: loading node node74
[2022-07-19T15:17:58.356] debug3: layouts: loading node node75
[2022-07-19T15:17:58.356] debug3: layouts: loading node node76
[2022-07-19T15:17:58.356] debug3: layouts: loading node node77
[2022-07-19T15:17:58.356] debug3: layouts: loading node node78
[2022-07-19T15:17:58.356] debug3: layouts: loading node node100
[2022-07-19T15:17:58.356] debug3: layouts: loading node node101
[2022-07-19T15:17:58.356] debug3: layouts: loading node node102
[2022-07-19T15:17:58.356] debug3: layouts: loading node node103
[2022-07-19T15:17:58.356] debug3: layouts: loading node node104
[2022-07-19T15:17:58.356] debug3: layouts: loading node node105
[2022-07-19T15:17:58.356] debug3: layouts: loading node node106
[2022-07-19T15:17:58.356] debug3: layouts: loading node node107
[2022-07-19T15:17:58.356] debug3: layouts: loading node node108
[2022-07-19T15:17:58.356] debug3: layouts: loading node node109
[2022-07-19T15:17:58.356] debug:  layouts: 71/71 nodes in hash table, rc=0
[2022-07-19T15:17:58.356] debug:  layouts: loading stage 1
[2022-07-19T15:17:58.356] debug:  layouts: loading stage 1.1 (restore state)
[2022-07-19T15:17:58.356] debug:  layouts: loading stage 2
[2022-07-19T15:17:58.356] debug:  layouts: loading stage 3
[2022-07-19T15:17:58.356] error: Node state file 
/var/lib/slurm-llnl/slurmctld/node_state too small
[2022-07-19T15:17:58.356] error: NOTE: Trying backup state save file. 
Information may be lost!
[2022-07-19T15:17:58.356] debug3: Version string in node_state header is 
PROTOCOL_VERSION
[2022-07-19T15:17:58.357] Recovered state of 71 nodes
[2022-07-19T15:17:58.357] error: Job state file 
/var/lib/slurm-llnl/slurmctld/job_state too small
[2022-07-19T15:17:58.357] error: NOTE: Trying backup state save file. 
Jobs may be lost!
[2022-07-19T15:17:58.357] error: Incomplete job state save file
[2022-07-19T15:17:58.357] Recovered information about 0 jobs
[2022-07-19T15:17:58.357] cons_res: select_p_node_init
[2022-07-19T15:17:58.357] cons_res: preparing for 7 partitions
[2022-07-19T15:17:58.357] debug:  Ports available for reservation 
10000-30000
[2022-07-19T15:17:58.359] debug2: init_requeue_policy: 
kill_invalid_depend is set to 0
[2022-07-19T15:17:58.359] debug:  Updating partition uid access list
[2022-07-19T15:17:58.359] debug3: Version string in resv_state header is 
PROTOCOL_VERSION
[2022-07-19T15:17:58.359] Recovered state of 0 reservations
[2022-07-19T15:17:58.359] State of 0 triggers recovered


/var/log/slurm-llnl/slurmdbd.log:

[2022-07-19T15:00:45.265] debug3: Trying to load plugin 
/usr/local/lib/slurm/auth_munge.so
[2022-07-19T15:00:45.265] debug:  Munge authentication plugin loaded
[2022-07-19T15:00:45.265] debug3: Success.
[2022-07-19T15:00:45.265] debug3: Trying to load plugin 
/usr/local/lib/slurm/accounting_storage_mysql.so
[2022-07-19T15:00:45.268] debug2: mysql_connect() called for db 
slurm_acct_db
[2022-07-19T15:00:45.402] debug2: It appears the table conversions have 
already taken place, hooray!
[2022-07-19T15:00:48.146] Accounting storage MYSQL plugin loaded
[2022-07-19T15:00:48.147] debug3: Success.
[2022-07-19T15:00:48.153] debug2: ArchiveDir        = /home/slurm
[2022-07-19T15:00:48.153] debug2: ArchiveScript     = (null)
[2022-07-19T15:00:48.153] debug2: AuthInfo          = 
/var/run/munge/munge.socket.2
[2022-07-19T15:00:48.154] debug2: AuthType          = auth/munge
[2022-07-19T15:00:48.154] debug2: CommitDelay       = 0
[2022-07-19T15:00:48.154] debug2: DbdAddr           = localhost
[2022-07-19T15:00:48.154] debug2: DbdBackupHost     = (null)
[2022-07-19T15:00:48.154] debug2: DbdHost           = localhost
[2022-07-19T15:00:48.154] debug2: DbdPort           = 6819
[2022-07-19T15:00:48.154] debug2: DebugFlags        = (null)
[2022-07-19T15:00:48.154] debug2: DebugLevel        = 7
[2022-07-19T15:00:48.154] debug2: DefaultQOS        = (null)
[2022-07-19T15:00:48.154] debug2: LogFile           = 
/var/log/slurm-llnl/slurmdbd.log
[2022-07-19T15:00:48.154] debug2: MessageTimeout    = 10
[2022-07-19T15:00:48.154] debug2: PidFile           = 
/var/run/slurm-llnl/slurmdbd.pid
[2022-07-19T15:00:48.154] debug2: PluginDir         = /usr/local/lib/slurm
[2022-07-19T15:00:48.154] debug2: PrivateData       = none
[2022-07-19T15:00:48.154] debug2: PurgeEventAfter   = 730 days
[2022-07-19T15:00:48.154] debug2: PurgeJobAfter     = 730 days
[2022-07-19T15:00:48.154] debug2: PurgeResvAfter    = 730 days
[2022-07-19T15:00:48.154] debug2: PurgeStepAfter    = 730 days
[2022-07-19T15:00:48.154] debug2: PurgeSuspendAfter = 730 days
[2022-07-19T15:00:48.154] debug2: PurgeTXNAfter = NONE
[2022-07-19T15:00:48.154] debug2: PurgeUsageAfter = NONE
[2022-07-19T15:00:48.154] debug2: SlurmUser         = slurm(64030)
[2022-07-19T15:00:48.154] debug2: StorageBackupHost = (null)
[2022-07-19T15:00:48.154] debug2: StorageHost       = localhost
[2022-07-19T15:00:48.154] debug2: StorageLoc        = slurm_acct_db
[2022-07-19T15:00:48.154] debug2: StoragePort       = 3306
[2022-07-19T15:00:48.154] debug2: StorageType       = 
accounting_storage/mysql
[2022-07-19T15:00:48.154] debug2: StorageUser       = slurm
[2022-07-19T15:00:48.154] debug2: TCPTimeout        = 2
[2022-07-19T15:00:48.154] debug2: TrackWCKey        = 0
[2022-07-19T15:00:48.154] debug2: TrackSlurmctldDown= 0
[2022-07-19T15:00:48.154] debug2: acct_storage_p_get_connection: request 
new connection 1
[2022-07-19T15:00:48.430] slurmdbd version 17.02.11 started
[2022-07-19T15:00:48.431] debug2: running rollup at Tue Jul 19 15:00:48 2022
[2022-07-19T15:00:48.435] debug2: No need to roll cluster clusterdev 
this day 1658181600 <= 1658181600
[2022-07-19T15:00:48.435] debug2: No need to roll cluster clusterdev 
this month 1656626400 <= 1656626400
[2022-07-19T15:00:48.436] debug2: Got 1 of 2 rolled up
[2022-07-19T15:00:48.454] error: We have more time than is possible 
(1576800+2160000+0)(3736800) > 3456000 for cluster cluster(960) from 
2022-07-19T14:00:00 - 2022-07-19T15:00:00 tres 1
[2022-07-19T15:00:48.456] debug2: No need to roll cluster cluster this 
day 1658181600 <= 1658181600
[2022-07-19T15:00:48.457] debug2: No need to roll cluster cluster this 
month 1656626400 <= 1656626400
[2022-07-19T15:00:48.458] debug2: Got 2 of 2 rolled up
[2022-07-19T15:00:48.458] debug2: Everything rolled up
[2022-07-19T15:01:05.000] debug2: Opened connection 9 from 10.0.1.51
[2022-07-19T15:01:05.001] debug:  REQUEST_PERSIST_INIT: CLUSTER:cluster 
VERSION:7936 UID:64030 IP:10.0.1.51 CONN:9
[2022-07-19T15:01:05.001] debug2: acct_storage_p_get_connection: request 
new connection 1
[2022-07-19T15:06:52.329] Terminate signal (SIGINT or SIGTERM) received
[2022-07-19T15:06:52.330] debug:  rpc_mgr shutting down
[2022-07-19T15:06:52.331] debug2: Closed connection 9 uid(64030)
[2022-07-19T15:06:52.332] debug3: starting mysql cleaning up
[2022-07-19T15:06:52.332] debug3: finished mysql cleaning up
[2022-07-19T15:11:13.288] debug3: Trying to load plugin 
/usr/local/lib/slurm/auth_munge.so
[2022-07-19T15:11:13.301] debug:  Munge authentication plugin loaded
[2022-07-19T15:11:13.301] debug3: Success.
[2022-07-19T15:11:13.301] debug3: Trying to load plugin 
/usr/local/lib/slurm/accounting_storage_mysql.so
[2022-07-19T15:11:13.362] debug2: mysql_connect() called for db 
slurm_acct_db
[2022-07-19T15:11:15.447] debug2: It appears the table conversions have 
already taken place, hooray!
[2022-07-19T15:11:40.975] Accounting storage MYSQL plugin loaded
[2022-07-19T15:11:40.975] debug3: Success.
[2022-07-19T15:11:40.978] debug2: ArchiveDir        = /home/slurm
[2022-07-19T15:11:40.978] debug2: ArchiveScript     = (null)
[2022-07-19T15:11:40.978] debug2: AuthInfo          = 
/var/run/munge/munge.socket.2
[2022-07-19T15:11:40.978] debug2: AuthType          = auth/munge
[2022-07-19T15:11:40.978] debug2: CommitDelay       = 0
[2022-07-19T15:11:40.978] debug2: DbdAddr           = localhost
[2022-07-19T15:11:40.978] debug2: DbdBackupHost     = (null)
[2022-07-19T15:11:40.978] debug2: DbdHost           = localhost
[2022-07-19T15:11:40.978] debug2: DbdPort           = 6819
[2022-07-19T15:11:40.978] debug2: DebugFlags        = (null)
[2022-07-19T15:11:40.978] debug2: DebugLevel        = 7
[2022-07-19T15:11:40.978] debug2: DefaultQOS        = (null)
[2022-07-19T15:11:40.978] debug2: LogFile           = 
/var/log/slurm-llnl/slurmdbd.log
[2022-07-19T15:11:40.978] debug2: MessageTimeout    = 10
[2022-07-19T15:11:40.978] debug2: PidFile           = 
/var/run/slurm-llnl/slurmdbd.pid
[2022-07-19T15:11:40.978] debug2: PluginDir         = /usr/local/lib/slurm
[2022-07-19T15:11:40.978] debug2: PrivateData       = none
[2022-07-19T15:11:40.978] debug2: PurgeEventAfter   = 730 days
[2022-07-19T15:11:40.978] debug2: PurgeJobAfter     = 730 days
[2022-07-19T15:11:40.978] debug2: PurgeResvAfter    = 730 days
[2022-07-19T15:11:40.978] debug2: PurgeStepAfter    = 730 days
[2022-07-19T15:11:40.978] debug2: PurgeSuspendAfter = 730 days
[2022-07-19T15:11:40.978] debug2: PurgeTXNAfter = NONE
[2022-07-19T15:11:40.978] debug2: PurgeUsageAfter = NONE
[2022-07-19T15:11:40.979] debug2: SlurmUser         = slurm(64030)
[2022-07-19T15:11:40.979] debug2: StorageBackupHost = (null)
[2022-07-19T15:11:40.979] debug2: StorageHost       = localhost
[2022-07-19T15:11:40.979] debug2: StorageLoc        = slurm_acct_db
[2022-07-19T15:11:40.979] debug2: StoragePort       = 3306
[2022-07-19T15:11:40.979] debug2: StorageType       = 
accounting_storage/mysql
[2022-07-19T15:11:40.979] debug2: StorageUser       = slurm
[2022-07-19T15:11:40.979] debug2: TCPTimeout        = 2
[2022-07-19T15:11:40.979] debug2: TrackWCKey        = 0
[2022-07-19T15:11:40.979] debug2: TrackSlurmctldDown= 0
[2022-07-19T15:11:40.979] debug2: acct_storage_p_get_connection: request 
new connection 1
[2022-07-19T15:11:41.168] slurmdbd version 17.02.11 started
[2022-07-19T15:11:41.168] debug2: running rollup at Tue Jul 19 15:11:41 2022
[2022-07-19T15:11:41.170] debug2: No need to roll cluster clusterdev 
this hour 1658235600 <= 1658235600
[2022-07-19T15:11:41.170] debug2: No need to roll cluster clusterdev 
this day 1658181600 <= 1658181600
[2022-07-19T15:11:41.170] debug2: No need to roll cluster clusterdev 
this month 1656626400 <= 1656626400
[2022-07-19T15:11:41.170] debug2: No need to roll cluster cluster this 
hour 1658235600 <= 1658235600
[2022-07-19T15:11:41.170] debug2: No need to roll cluster cluster this 
day 1658181600 <= 1658181600
[2022-07-19T15:11:41.170] debug2: No need to roll cluster cluster this 
month 1656626400 <= 1656626400
[2022-07-19T15:11:41.170] debug2: Got 2 of 2 rolled up
[2022-07-19T15:11:41.170] debug2: Everything rolled up
[2022-07-19T15:11:58.000] debug2: Opened connection 9 from 10.0.1.51
[2022-07-19T15:11:58.003] debug:  REQUEST_PERSIST_INIT: CLUSTER:cluster 
VERSION:7936 UID:64030 IP:10.0.1.51 CONN:9
[2022-07-19T15:11:58.003] debug2: acct_storage_p_get_connection: request 
new connection 1
[2022-07-19T15:15:28.671] debug2: Opened connection 10 from 10.0.1.51
[2022-07-19T15:15:28.672] debug:  REQUEST_PERSIST_INIT: CLUSTER:cluster 
VERSION:7936 UID:0 IP:10.0.1.51 CONN:10
[2022-07-19T15:15:28.672] debug2: acct_storage_p_get_connection: request 
new connection 1
[2022-07-19T15:15:28.710] debug2: DBD_FINI: CLOSE:0 COMMIT:0
[2022-07-19T15:15:28.790] debug2: DBD_GET_USERS: called
[2022-07-19T15:15:28.847] debug2: DBD_FINI: CLOSE:1 COMMIT:0
[2022-07-19T15:15:28.847] debug2: persistant connection is closed
[2022-07-19T15:15:28.847] debug2: Closed connection 10 uid(0)
[2022-07-19T15:17:19.421] debug2: Closed connection 9 uid(64030)
[2022-07-19T15:17:53.635] debug2: Opened connection 12 from 10.0.1.51
[2022-07-19T15:17:53.636] debug:  REQUEST_PERSIST_INIT: CLUSTER:cluster 
VERSION:7936 UID:64030 IP:10.0.1.51 CONN:12
[2022-07-19T15:17:53.636] debug2: acct_storage_p_get_connection: request 
new connection 1
[2022-07-19T15:17:53.674] debug2: DBD_GET_TRES: called
[2022-07-19T15:17:53.754] debug2: DBD_GET_QOS: called
[2022-07-19T15:17:53.834] debug2: DBD_GET_USERS: called
[2022-07-19T15:17:54.150] debug2: DBD_GET_ASSOCS: called
[2022-07-19T15:17:58.304] debug2: DBD_GET_RES: called
[2022-07-19T16:00:00.171] debug2: running rollup at Tue Jul 19 16:00:00 2022
[2022-07-19T16:00:00.318] debug2: No need to roll cluster clusterdev 
this day 1658181600 <= 1658181600
[2022-07-19T16:00:00.318] debug2: No need to roll cluster clusterdev 
this month 1656626400 <= 1656626400
[2022-07-19T16:00:00.320] debug2: Got 1 of 2 rolled up
[2022-07-19T16:00:01.603] error: We have more time than is possible 
(1576800+2160000+0)(3736800) > 3456000 for cluster cluster(960) from 
2022-07-19T15:00:00 - 2022-07-19T16:00:00 tres 1
[2022-07-19T16:00:01.693] debug2: No need to roll cluster cluster this 
day 1658181600 <= 1658181600
[2022-07-19T16:00:01.694] debug2: No need to roll cluster cluster this 
month 1656626400 <= 1656626400
[2022-07-19T16:00:01.711] debug2: Got 2 of 2 rolled up
[2022-07-19T16:00:01.711] debug2: Everything rolled up

-- 
Julien Rey

Plate-forme RPBS
Unité BFA - CMPLI
Université de Paris
tel: 01 57 27 83 95




More information about the slurm-users mailing list