[slurm-users] slurmctld up and running but not really working

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Tue Jul 19 18:29:53 UTC 2022


Hi Julien,

Apparently your slurmdbd is quite happy, but it seems that your 
slurmctld StateSaveLocation has been corrupted:

> [2022-07-19T15:17:58.356] error: Node state file /var/lib/slurm-llnl/slurmctld/node_state too small
> [2022-07-19T15:17:58.356] error: NOTE: Trying backup state save file. Information may be lost!
> [2022-07-19T15:17:58.356] debug3: Version string in node_state header is PROTOCOL_VERSION
> [2022-07-19T15:17:58.357] Recovered state of 71 nodes
> [2022-07-19T15:17:58.357] error: Job state file /var/lib/slurm-llnl/slurmctld/job_state too small
> [2022-07-19T15:17:58.357] error: NOTE: Trying backup state save file. Jobs may be lost!
> [2022-07-19T15:17:58.357] error: Incomplete job state save file 

Did something bad happen to your storage of 
/var/lib/slurm-llnl/slurmctld/ ?  Could you possibly restore this folder 
from the last backup?

I don't know if it's possible to recover from a corrupted slurmctld 
StateSaveLocation, maybe some others have an experience?

Even if you could restore it, the Slurm database probably needs to be 
consistent with your slurmctld StateSaveLocation, and I don't know if 
this is feasible...

Could you initialize your slurm 17.02.11 and start it from scratch?

Regarding an upgrade from 17.02 or 17.11, you may find some useful notes 
in my Wiki page 
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm

/Ole


On 19-07-2022 16:28, Julien Rey wrote:
> I am currently facing an issue with an old install of slurm (17.02.11). 
> However, I cannot upgrade this version because I had troubles with 
> database migration in the past (when upgrading to 17.11) and this 
> install is set to be replaced in the next coming monthes. For the time 
> being I have to keep it running because some of our services still rely 
> on it.
> 
> This issue occured after a power outage.
> 
> slurmctld is up and running, however, when I enter "sinfo", I end up 
> with this message after a few minutes:
> 
> slurm_load_partitions: Unable to contact slurm controller (connect failure)
> 
> I set SlurmctldDebug=7 in slurm.conf and DebugLevel=7 in slurmdbd.conf, 
> however I don't get much info about any specific error that would 
> prevent the slurm controller from working in the logs.
> 
> Any help would be greatly appreciated.
> 
> /var/log/slurm-llnl/slurmctld.log:
> 
> [2022-07-19T15:17:58.342] debug3: Version in assoc_usage header is 7936
> [2022-07-19T15:17:58.345] debug3: Version in qos_usage header is 7936
> [2022-07-19T15:17:58.345] debug:  Reading slurm.conf file: 
> /etc/slurm-llnl/slurm.conf
> [2022-07-19T15:17:58.347] debug:  Ignoring obsolete SchedulerPort option.
> [2022-07-19T15:17:58.347] debug3: layouts: layouts_init()...
> [2022-07-19T15:17:58.347] layouts: no layout to initialize
> [2022-07-19T15:17:58.347] debug3: Trying to load plugin 
> /usr/local/lib/slurm/topology_none.so
> [2022-07-19T15:17:58.347] topology NONE plugin loaded
> [2022-07-19T15:17:58.347] debug3: Success.
> [2022-07-19T15:17:58.348] debug:  No DownNodes
> [2022-07-19T15:17:58.348] debug3: Version in last_conf_lite header is 7936
> [2022-07-19T15:17:58.349] debug3: Trying to load plugin 
> /usr/local/lib/slurm/jobcomp_none.so
> [2022-07-19T15:17:58.349] debug3: Success.
> [2022-07-19T15:17:58.349] debug3: Trying to load plugin 
> /usr/local/lib/slurm/sched_backfill.so
> [2022-07-19T15:17:58.349] sched: Backfill scheduler plugin loaded
> [2022-07-19T15:17:58.349] debug3: Success.
> [2022-07-19T15:17:58.350] debug3: Trying to load plugin 
> /usr/local/lib/slurm/route_default.so
> [2022-07-19T15:17:58.350] route default plugin loaded
> [2022-07-19T15:17:58.350] debug3: Success.
> [2022-07-19T15:17:58.355] layouts: loading entities/relations information
> [2022-07-19T15:17:58.355] debug3: layouts: loading node node0
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node1
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node2
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node3
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node4
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node5
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node6
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node7
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node8
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node9
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node10
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node11
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node12
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node13
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node14
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node15
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node16
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node17
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node18
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node19
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node20
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node21
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node22
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node23
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node24
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node25
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node26
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node27
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node28
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node29
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node30
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node31
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node42
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node43
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node44
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node45
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node46
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node47
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node49
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node50
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node51
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node52
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node53
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node54
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node55
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node56
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node60
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node61
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node62
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node63
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node64
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node65
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node66
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node67
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node68
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node73
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node74
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node75
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node76
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node77
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node78
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node100
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node101
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node102
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node103
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node104
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node105
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node106
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node107
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node108
> [2022-07-19T15:17:58.356] debug3: layouts: loading node node109
> [2022-07-19T15:17:58.356] debug:  layouts: 71/71 nodes in hash table, rc=0
> [2022-07-19T15:17:58.356] debug:  layouts: loading stage 1
> [2022-07-19T15:17:58.356] debug:  layouts: loading stage 1.1 (restore 
> state)
> [2022-07-19T15:17:58.356] debug:  layouts: loading stage 2
> [2022-07-19T15:17:58.356] debug:  layouts: loading stage 3
> [2022-07-19T15:17:58.356] error: Node state file 
> /var/lib/slurm-llnl/slurmctld/node_state too small
> [2022-07-19T15:17:58.356] error: NOTE: Trying backup state save file. 
> Information may be lost!
> [2022-07-19T15:17:58.356] debug3: Version string in node_state header is 
> PROTOCOL_VERSION
> [2022-07-19T15:17:58.357] Recovered state of 71 nodes
> [2022-07-19T15:17:58.357] error: Job state file 
> /var/lib/slurm-llnl/slurmctld/job_state too small
> [2022-07-19T15:17:58.357] error: NOTE: Trying backup state save file. 
> Jobs may be lost!
> [2022-07-19T15:17:58.357] error: Incomplete job state save file
> [2022-07-19T15:17:58.357] Recovered information about 0 jobs
> [2022-07-19T15:17:58.357] cons_res: select_p_node_init
> [2022-07-19T15:17:58.357] cons_res: preparing for 7 partitions
> [2022-07-19T15:17:58.357] debug:  Ports available for reservation 
> 10000-30000
> [2022-07-19T15:17:58.359] debug2: init_requeue_policy: 
> kill_invalid_depend is set to 0
> [2022-07-19T15:17:58.359] debug:  Updating partition uid access list
> [2022-07-19T15:17:58.359] debug3: Version string in resv_state header is 
> PROTOCOL_VERSION
> [2022-07-19T15:17:58.359] Recovered state of 0 reservations
> [2022-07-19T15:17:58.359] State of 0 triggers recovered




More information about the slurm-users mailing list