[slurm-users] slurmctld up and running but not really working

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Wed Jul 20 12:45:07 UTC 2022


Hi Julien,

You could make a database dump of the current database so that you can 
load it on another server outside the cluster, while you reinitialize 
Slurm with a fresh database.

So the database thinks that you have 253 running jobs?  I guess that 
slurmctld is not working, otherwise you could do: squeue -t running

This command can report current jobs that have been orphaned on the local 
cluster and are now runaway:

sacctmgr show runawayjobs

Read the sacctmgr manual page.

I hope this helps.

/Ole

On 7/20/22 14:19, Julien Rey wrote:
> I don't mind losing jobs information but I certainly don't want to clear 
> the slurm database altogether.
> 
> The /var/lib/slurm-llnl/slurmctld/node_state and 
> /var/lib/slurm-llnl/slurmctld/node_state.old files look effectively empty. 
> I then entered the following command:
> 
> sacct | grep RUNNING
> 
> and found about 253 jobs.
> 
> Is there any elegant way to remove these jobs from the database ?
> 
> J.
> 
> Le 19/07/2022 à 20:29, Ole Holm Nielsen a écrit :
>> Hi Julien,
>>
>> Apparently your slurmdbd is quite happy, but it seems that your 
>> slurmctld StateSaveLocation has been corrupted:
>>
>>> [2022-07-19T15:17:58.356] error: Node state file 
>>> /var/lib/slurm-llnl/slurmctld/node_state too small
>>> [2022-07-19T15:17:58.356] error: NOTE: Trying backup state save file. 
>>> Information may be lost!
>>> [2022-07-19T15:17:58.356] debug3: Version string in node_state header 
>>> is PROTOCOL_VERSION
>>> [2022-07-19T15:17:58.357] Recovered state of 71 nodes
>>> [2022-07-19T15:17:58.357] error: Job state file 
>>> /var/lib/slurm-llnl/slurmctld/job_state too small
>>> [2022-07-19T15:17:58.357] error: NOTE: Trying backup state save file. 
>>> Jobs may be lost!
>>> [2022-07-19T15:17:58.357] error: Incomplete job state save file 
>>
>> Did something bad happen to your storage of 
>> /var/lib/slurm-llnl/slurmctld/ ?  Could you possibly restore this folder 
>> from the last backup?
>>
>> I don't know if it's possible to recover from a corrupted slurmctld 
>> StateSaveLocation, maybe some others have an experience?
>>
>> Even if you could restore it, the Slurm database probably needs to be 
>> consistent with your slurmctld StateSaveLocation, and I don't know if 
>> this is feasible...
>>
>> Could you initialize your slurm 17.02.11 and start it from scratch?
>>
>> Regarding an upgrade from 17.02 or 17.11, you may find some useful notes 
>> in my Wiki page 
>> https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
>>
>> /Ole
>>
>>
>> On 19-07-2022 16:28, Julien Rey wrote:
>>> I am currently facing an issue with an old install of slurm (17.02.11). 
>>> However, I cannot upgrade this version because I had troubles with 
>>> database migration in the past (when upgrading to 17.11) and this 
>>> install is set to be replaced in the next coming monthes. For the time 
>>> being I have to keep it running because some of our services still rely 
>>> on it.
>>>
>>> This issue occured after a power outage.
>>>
>>> slurmctld is up and running, however, when I enter "sinfo", I end up 
>>> with this message after a few minutes:
>>>
>>> slurm_load_partitions: Unable to contact slurm controller (connect 
>>> failure)
>>>
>>> I set SlurmctldDebug=7 in slurm.conf and DebugLevel=7 in slurmdbd.conf, 
>>> however I don't get much info about any specific error that would 
>>> prevent the slurm controller from working in the logs.
>>>
>>> Any help would be greatly appreciated.
>>>
>>> /var/log/slurm-llnl/slurmctld.log:
>>>
>>> [2022-07-19T15:17:58.342] debug3: Version in assoc_usage header is 7936
>>> [2022-07-19T15:17:58.345] debug3: Version in qos_usage header is 7936
>>> [2022-07-19T15:17:58.345] debug:  Reading slurm.conf file: 
>>> /etc/slurm-llnl/slurm.conf
>>> [2022-07-19T15:17:58.347] debug:  Ignoring obsolete SchedulerPort option.
>>> [2022-07-19T15:17:58.347] debug3: layouts: layouts_init()...
>>> [2022-07-19T15:17:58.347] layouts: no layout to initialize
>>> [2022-07-19T15:17:58.347] debug3: Trying to load plugin 
>>> /usr/local/lib/slurm/topology_none.so
>>> [2022-07-19T15:17:58.347] topology NONE plugin loaded
>>> [2022-07-19T15:17:58.347] debug3: Success.
>>> [2022-07-19T15:17:58.348] debug:  No DownNodes
>>> [2022-07-19T15:17:58.348] debug3: Version in last_conf_lite header is 7936
>>> [2022-07-19T15:17:58.349] debug3: Trying to load plugin 
>>> /usr/local/lib/slurm/jobcomp_none.so
>>> [2022-07-19T15:17:58.349] debug3: Success.
>>> [2022-07-19T15:17:58.349] debug3: Trying to load plugin 
>>> /usr/local/lib/slurm/sched_backfill.so
>>> [2022-07-19T15:17:58.349] sched: Backfill scheduler plugin loaded
>>> [2022-07-19T15:17:58.349] debug3: Success.
>>> [2022-07-19T15:17:58.350] debug3: Trying to load plugin 
>>> /usr/local/lib/slurm/route_default.so
>>> [2022-07-19T15:17:58.350] route default plugin loaded
>>> [2022-07-19T15:17:58.350] debug3: Success.
>>> [2022-07-19T15:17:58.355] layouts: loading entities/relations information
>>> [2022-07-19T15:17:58.355] debug3: layouts: loading node node0
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node1
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node2
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node3
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node4
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node5
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node6
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node7
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node8
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node9
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node10
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node11
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node12
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node13
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node14
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node15
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node16
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node17
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node18
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node19
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node20
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node21
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node22
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node23
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node24
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node25
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node26
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node27
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node28
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node29
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node30
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node31
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node42
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node43
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node44
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node45
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node46
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node47
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node49
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node50
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node51
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node52
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node53
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node54
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node55
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node56
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node60
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node61
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node62
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node63
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node64
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node65
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node66
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node67
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node68
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node73
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node74
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node75
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node76
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node77
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node78
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node100
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node101
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node102
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node103
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node104
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node105
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node106
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node107
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node108
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node109
>>> [2022-07-19T15:17:58.356] debug:  layouts: 71/71 nodes in hash table, rc=0
>>> [2022-07-19T15:17:58.356] debug:  layouts: loading stage 1
>>> [2022-07-19T15:17:58.356] debug:  layouts: loading stage 1.1 (restore 
>>> state)
>>> [2022-07-19T15:17:58.356] debug:  layouts: loading stage 2
>>> [2022-07-19T15:17:58.356] debug:  layouts: loading stage 3
>>> [2022-07-19T15:17:58.356] error: Node state file 
>>> /var/lib/slurm-llnl/slurmctld/node_state too small
>>> [2022-07-19T15:17:58.356] error: NOTE: Trying backup state save file. 
>>> Information may be lost!
>>> [2022-07-19T15:17:58.356] debug3: Version string in node_state header 
>>> is PROTOCOL_VERSION
>>> [2022-07-19T15:17:58.357] Recovered state of 71 nodes
>>> [2022-07-19T15:17:58.357] error: Job state file 
>>> /var/lib/slurm-llnl/slurmctld/job_state too small
>>> [2022-07-19T15:17:58.357] error: NOTE: Trying backup state save file. 
>>> Jobs may be lost!
>>> [2022-07-19T15:17:58.357] error: Incomplete job state save file
>>> [2022-07-19T15:17:58.357] Recovered information about 0 jobs
>>> [2022-07-19T15:17:58.357] cons_res: select_p_node_init
>>> [2022-07-19T15:17:58.357] cons_res: preparing for 7 partitions
>>> [2022-07-19T15:17:58.357] debug:  Ports available for reservation 
>>> 10000-30000
>>> [2022-07-19T15:17:58.359] debug2: init_requeue_policy: 
>>> kill_invalid_depend is set to 0
>>> [2022-07-19T15:17:58.359] debug:  Updating partition uid access list
>>> [2022-07-19T15:17:58.359] debug3: Version string in resv_state header 
>>> is PROTOCOL_VERSION
>>> [2022-07-19T15:17:58.359] Recovered state of 0 reservations
>>> [2022-07-19T15:17:58.359] State of 0 triggers recovered



More information about the slurm-users mailing list