[slurm-users] slurmctld up and running but not really working
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Wed Jul 20 12:45:07 UTC 2022
Hi Julien,
You could make a database dump of the current database so that you can
load it on another server outside the cluster, while you reinitialize
Slurm with a fresh database.
So the database thinks that you have 253 running jobs? I guess that
slurmctld is not working, otherwise you could do: squeue -t running
This command can report current jobs that have been orphaned on the local
cluster and are now runaway:
sacctmgr show runawayjobs
Read the sacctmgr manual page.
I hope this helps.
/Ole
On 7/20/22 14:19, Julien Rey wrote:
> I don't mind losing jobs information but I certainly don't want to clear
> the slurm database altogether.
>
> The /var/lib/slurm-llnl/slurmctld/node_state and
> /var/lib/slurm-llnl/slurmctld/node_state.old files look effectively empty.
> I then entered the following command:
>
> sacct | grep RUNNING
>
> and found about 253 jobs.
>
> Is there any elegant way to remove these jobs from the database ?
>
> J.
>
> Le 19/07/2022 à 20:29, Ole Holm Nielsen a écrit :
>> Hi Julien,
>>
>> Apparently your slurmdbd is quite happy, but it seems that your
>> slurmctld StateSaveLocation has been corrupted:
>>
>>> [2022-07-19T15:17:58.356] error: Node state file
>>> /var/lib/slurm-llnl/slurmctld/node_state too small
>>> [2022-07-19T15:17:58.356] error: NOTE: Trying backup state save file.
>>> Information may be lost!
>>> [2022-07-19T15:17:58.356] debug3: Version string in node_state header
>>> is PROTOCOL_VERSION
>>> [2022-07-19T15:17:58.357] Recovered state of 71 nodes
>>> [2022-07-19T15:17:58.357] error: Job state file
>>> /var/lib/slurm-llnl/slurmctld/job_state too small
>>> [2022-07-19T15:17:58.357] error: NOTE: Trying backup state save file.
>>> Jobs may be lost!
>>> [2022-07-19T15:17:58.357] error: Incomplete job state save file
>>
>> Did something bad happen to your storage of
>> /var/lib/slurm-llnl/slurmctld/ ? Could you possibly restore this folder
>> from the last backup?
>>
>> I don't know if it's possible to recover from a corrupted slurmctld
>> StateSaveLocation, maybe some others have an experience?
>>
>> Even if you could restore it, the Slurm database probably needs to be
>> consistent with your slurmctld StateSaveLocation, and I don't know if
>> this is feasible...
>>
>> Could you initialize your slurm 17.02.11 and start it from scratch?
>>
>> Regarding an upgrade from 17.02 or 17.11, you may find some useful notes
>> in my Wiki page
>> https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
>>
>> /Ole
>>
>>
>> On 19-07-2022 16:28, Julien Rey wrote:
>>> I am currently facing an issue with an old install of slurm (17.02.11).
>>> However, I cannot upgrade this version because I had troubles with
>>> database migration in the past (when upgrading to 17.11) and this
>>> install is set to be replaced in the next coming monthes. For the time
>>> being I have to keep it running because some of our services still rely
>>> on it.
>>>
>>> This issue occured after a power outage.
>>>
>>> slurmctld is up and running, however, when I enter "sinfo", I end up
>>> with this message after a few minutes:
>>>
>>> slurm_load_partitions: Unable to contact slurm controller (connect
>>> failure)
>>>
>>> I set SlurmctldDebug=7 in slurm.conf and DebugLevel=7 in slurmdbd.conf,
>>> however I don't get much info about any specific error that would
>>> prevent the slurm controller from working in the logs.
>>>
>>> Any help would be greatly appreciated.
>>>
>>> /var/log/slurm-llnl/slurmctld.log:
>>>
>>> [2022-07-19T15:17:58.342] debug3: Version in assoc_usage header is 7936
>>> [2022-07-19T15:17:58.345] debug3: Version in qos_usage header is 7936
>>> [2022-07-19T15:17:58.345] debug: Reading slurm.conf file:
>>> /etc/slurm-llnl/slurm.conf
>>> [2022-07-19T15:17:58.347] debug: Ignoring obsolete SchedulerPort option.
>>> [2022-07-19T15:17:58.347] debug3: layouts: layouts_init()...
>>> [2022-07-19T15:17:58.347] layouts: no layout to initialize
>>> [2022-07-19T15:17:58.347] debug3: Trying to load plugin
>>> /usr/local/lib/slurm/topology_none.so
>>> [2022-07-19T15:17:58.347] topology NONE plugin loaded
>>> [2022-07-19T15:17:58.347] debug3: Success.
>>> [2022-07-19T15:17:58.348] debug: No DownNodes
>>> [2022-07-19T15:17:58.348] debug3: Version in last_conf_lite header is 7936
>>> [2022-07-19T15:17:58.349] debug3: Trying to load plugin
>>> /usr/local/lib/slurm/jobcomp_none.so
>>> [2022-07-19T15:17:58.349] debug3: Success.
>>> [2022-07-19T15:17:58.349] debug3: Trying to load plugin
>>> /usr/local/lib/slurm/sched_backfill.so
>>> [2022-07-19T15:17:58.349] sched: Backfill scheduler plugin loaded
>>> [2022-07-19T15:17:58.349] debug3: Success.
>>> [2022-07-19T15:17:58.350] debug3: Trying to load plugin
>>> /usr/local/lib/slurm/route_default.so
>>> [2022-07-19T15:17:58.350] route default plugin loaded
>>> [2022-07-19T15:17:58.350] debug3: Success.
>>> [2022-07-19T15:17:58.355] layouts: loading entities/relations information
>>> [2022-07-19T15:17:58.355] debug3: layouts: loading node node0
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node1
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node2
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node3
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node4
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node5
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node6
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node7
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node8
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node9
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node10
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node11
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node12
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node13
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node14
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node15
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node16
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node17
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node18
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node19
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node20
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node21
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node22
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node23
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node24
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node25
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node26
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node27
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node28
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node29
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node30
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node31
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node42
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node43
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node44
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node45
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node46
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node47
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node49
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node50
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node51
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node52
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node53
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node54
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node55
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node56
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node60
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node61
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node62
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node63
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node64
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node65
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node66
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node67
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node68
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node73
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node74
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node75
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node76
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node77
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node78
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node100
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node101
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node102
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node103
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node104
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node105
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node106
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node107
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node108
>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node109
>>> [2022-07-19T15:17:58.356] debug: layouts: 71/71 nodes in hash table, rc=0
>>> [2022-07-19T15:17:58.356] debug: layouts: loading stage 1
>>> [2022-07-19T15:17:58.356] debug: layouts: loading stage 1.1 (restore
>>> state)
>>> [2022-07-19T15:17:58.356] debug: layouts: loading stage 2
>>> [2022-07-19T15:17:58.356] debug: layouts: loading stage 3
>>> [2022-07-19T15:17:58.356] error: Node state file
>>> /var/lib/slurm-llnl/slurmctld/node_state too small
>>> [2022-07-19T15:17:58.356] error: NOTE: Trying backup state save file.
>>> Information may be lost!
>>> [2022-07-19T15:17:58.356] debug3: Version string in node_state header
>>> is PROTOCOL_VERSION
>>> [2022-07-19T15:17:58.357] Recovered state of 71 nodes
>>> [2022-07-19T15:17:58.357] error: Job state file
>>> /var/lib/slurm-llnl/slurmctld/job_state too small
>>> [2022-07-19T15:17:58.357] error: NOTE: Trying backup state save file.
>>> Jobs may be lost!
>>> [2022-07-19T15:17:58.357] error: Incomplete job state save file
>>> [2022-07-19T15:17:58.357] Recovered information about 0 jobs
>>> [2022-07-19T15:17:58.357] cons_res: select_p_node_init
>>> [2022-07-19T15:17:58.357] cons_res: preparing for 7 partitions
>>> [2022-07-19T15:17:58.357] debug: Ports available for reservation
>>> 10000-30000
>>> [2022-07-19T15:17:58.359] debug2: init_requeue_policy:
>>> kill_invalid_depend is set to 0
>>> [2022-07-19T15:17:58.359] debug: Updating partition uid access list
>>> [2022-07-19T15:17:58.359] debug3: Version string in resv_state header
>>> is PROTOCOL_VERSION
>>> [2022-07-19T15:17:58.359] Recovered state of 0 reservations
>>> [2022-07-19T15:17:58.359] State of 0 triggers recovered
More information about the slurm-users
mailing list