[slurm-users] slurmctld up and running but not really working

Julien Rey julien.rey at univ-paris-diderot.fr
Wed Jul 20 15:17:04 UTC 2022


Actually, I was able to fix the problem by starting slurmctld with the 
-c option and then clear the runaway jobs with sacctmgr.

Thanks for your help.

J.

Le 20/07/2022 à 17:06, Julien Rey a écrit :
> Hello,
>
> Unfortunately, the sacctmgr show runawayjobs is returning the 
> following error:
>
> sacctmgr: error: Slurmctld running on cluster cluster is not up, can't 
> check running jobs
>
> J.
>
> Le 20/07/2022 à 14:45, Ole Holm Nielsen a écrit :
>> Hi Julien,
>>
>> You could make a database dump of the current database so that you 
>> can load it on another server outside the cluster, while you 
>> reinitialize Slurm with a fresh database.
>>
>> So the database thinks that you have 253 running jobs?  I guess that 
>> slurmctld is not working, otherwise you could do: squeue -t running
>>
>> This command can report current jobs that have been orphaned on the 
>> local cluster and are now runaway:
>>
>> sacctmgr show runawayjobs
>>
>> Read the sacctmgr manual page.
>>
>> I hope this helps.
>>
>> /Ole
>>
>> On 7/20/22 14:19, Julien Rey wrote:
>>> I don't mind losing jobs information but I certainly don't want to 
>>> clear the slurm database altogether.
>>>
>>> The /var/lib/slurm-llnl/slurmctld/node_state and 
>>> /var/lib/slurm-llnl/slurmctld/node_state.old files look effectively 
>>> empty. I then entered the following command:
>>>
>>> sacct | grep RUNNING
>>>
>>> and found about 253 jobs.
>>>
>>> Is there any elegant way to remove these jobs from the database ?
>>>
>>> J.
>>>
>>> Le 19/07/2022 à 20:29, Ole Holm Nielsen a écrit :
>>>> Hi Julien,
>>>>
>>>> Apparently your slurmdbd is quite happy, but it seems that your 
>>>> slurmctld StateSaveLocation has been corrupted:
>>>>
>>>>> [2022-07-19T15:17:58.356] error: Node state file 
>>>>> /var/lib/slurm-llnl/slurmctld/node_state too small
>>>>> [2022-07-19T15:17:58.356] error: NOTE: Trying backup state save 
>>>>> file. Information may be lost!
>>>>> [2022-07-19T15:17:58.356] debug3: Version string in node_state 
>>>>> header is PROTOCOL_VERSION
>>>>> [2022-07-19T15:17:58.357] Recovered state of 71 nodes
>>>>> [2022-07-19T15:17:58.357] error: Job state file 
>>>>> /var/lib/slurm-llnl/slurmctld/job_state too small
>>>>> [2022-07-19T15:17:58.357] error: NOTE: Trying backup state save 
>>>>> file. Jobs may be lost!
>>>>> [2022-07-19T15:17:58.357] error: Incomplete job state save file 
>>>>
>>>> Did something bad happen to your storage of 
>>>> /var/lib/slurm-llnl/slurmctld/ ?  Could you possibly restore this 
>>>> folder from the last backup?
>>>>
>>>> I don't know if it's possible to recover from a corrupted slurmctld 
>>>> StateSaveLocation, maybe some others have an experience?
>>>>
>>>> Even if you could restore it, the Slurm database probably needs to 
>>>> be consistent with your slurmctld StateSaveLocation, and I don't 
>>>> know if this is feasible...
>>>>
>>>> Could you initialize your slurm 17.02.11 and start it from scratch?
>>>>
>>>> Regarding an upgrade from 17.02 or 17.11, you may find some useful 
>>>> notes in my Wiki page 
>>>> https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
>>>>
>>>> /Ole
>>>>
>>>>
>>>> On 19-07-2022 16:28, Julien Rey wrote:
>>>>> I am currently facing an issue with an old install of slurm 
>>>>> (17.02.11). However, I cannot upgrade this version because I had 
>>>>> troubles with database migration in the past (when upgrading to 
>>>>> 17.11) and this install is set to be replaced in the next coming 
>>>>> monthes. For the time being I have to keep it running because some 
>>>>> of our services still rely on it.
>>>>>
>>>>> This issue occured after a power outage.
>>>>>
>>>>> slurmctld is up and running, however, when I enter "sinfo", I end 
>>>>> up with this message after a few minutes:
>>>>>
>>>>> slurm_load_partitions: Unable to contact slurm controller (connect 
>>>>> failure)
>>>>>
>>>>> I set SlurmctldDebug=7 in slurm.conf and DebugLevel=7 in 
>>>>> slurmdbd.conf, however I don't get much info about any specific 
>>>>> error that would prevent the slurm controller from working in the 
>>>>> logs.
>>>>>
>>>>> Any help would be greatly appreciated.
>>>>>
>>>>> /var/log/slurm-llnl/slurmctld.log:
>>>>>
>>>>> [2022-07-19T15:17:58.342] debug3: Version in assoc_usage header is 
>>>>> 7936
>>>>> [2022-07-19T15:17:58.345] debug3: Version in qos_usage header is 7936
>>>>> [2022-07-19T15:17:58.345] debug:  Reading slurm.conf file: 
>>>>> /etc/slurm-llnl/slurm.conf
>>>>> [2022-07-19T15:17:58.347] debug:  Ignoring obsolete SchedulerPort 
>>>>> option.
>>>>> [2022-07-19T15:17:58.347] debug3: layouts: layouts_init()...
>>>>> [2022-07-19T15:17:58.347] layouts: no layout to initialize
>>>>> [2022-07-19T15:17:58.347] debug3: Trying to load plugin 
>>>>> /usr/local/lib/slurm/topology_none.so
>>>>> [2022-07-19T15:17:58.347] topology NONE plugin loaded
>>>>> [2022-07-19T15:17:58.347] debug3: Success.
>>>>> [2022-07-19T15:17:58.348] debug:  No DownNodes
>>>>> [2022-07-19T15:17:58.348] debug3: Version in last_conf_lite header 
>>>>> is 7936
>>>>> [2022-07-19T15:17:58.349] debug3: Trying to load plugin 
>>>>> /usr/local/lib/slurm/jobcomp_none.so
>>>>> [2022-07-19T15:17:58.349] debug3: Success.
>>>>> [2022-07-19T15:17:58.349] debug3: Trying to load plugin 
>>>>> /usr/local/lib/slurm/sched_backfill.so
>>>>> [2022-07-19T15:17:58.349] sched: Backfill scheduler plugin loaded
>>>>> [2022-07-19T15:17:58.349] debug3: Success.
>>>>> [2022-07-19T15:17:58.350] debug3: Trying to load plugin 
>>>>> /usr/local/lib/slurm/route_default.so
>>>>> [2022-07-19T15:17:58.350] route default plugin loaded
>>>>> [2022-07-19T15:17:58.350] debug3: Success.
>>>>> [2022-07-19T15:17:58.355] layouts: loading entities/relations 
>>>>> information
>>>>> [2022-07-19T15:17:58.355] debug3: layouts: loading node node0
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node1
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node2
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node3
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node4
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node5
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node6
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node7
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node8
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node9
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node10
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node11
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node12
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node13
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node14
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node15
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node16
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node17
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node18
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node19
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node20
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node21
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node22
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node23
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node24
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node25
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node26
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node27
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node28
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node29
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node30
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node31
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node42
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node43
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node44
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node45
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node46
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node47
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node49
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node50
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node51
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node52
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node53
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node54
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node55
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node56
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node60
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node61
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node62
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node63
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node64
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node65
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node66
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node67
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node68
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node73
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node74
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node75
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node76
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node77
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node78
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node100
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node101
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node102
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node103
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node104
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node105
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node106
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node107
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node108
>>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node109
>>>>> [2022-07-19T15:17:58.356] debug:  layouts: 71/71 nodes in hash 
>>>>> table, rc=0
>>>>> [2022-07-19T15:17:58.356] debug:  layouts: loading stage 1
>>>>> [2022-07-19T15:17:58.356] debug:  layouts: loading stage 1.1 
>>>>> (restore state)
>>>>> [2022-07-19T15:17:58.356] debug:  layouts: loading stage 2
>>>>> [2022-07-19T15:17:58.356] debug:  layouts: loading stage 3
>>>>> [2022-07-19T15:17:58.356] error: Node state file 
>>>>> /var/lib/slurm-llnl/slurmctld/node_state too small
>>>>> [2022-07-19T15:17:58.356] error: NOTE: Trying backup state save 
>>>>> file. Information may be lost!
>>>>> [2022-07-19T15:17:58.356] debug3: Version string in node_state 
>>>>> header is PROTOCOL_VERSION
>>>>> [2022-07-19T15:17:58.357] Recovered state of 71 nodes
>>>>> [2022-07-19T15:17:58.357] error: Job state file 
>>>>> /var/lib/slurm-llnl/slurmctld/job_state too small
>>>>> [2022-07-19T15:17:58.357] error: NOTE: Trying backup state save 
>>>>> file. Jobs may be lost!
>>>>> [2022-07-19T15:17:58.357] error: Incomplete job state save file
>>>>> [2022-07-19T15:17:58.357] Recovered information about 0 jobs
>>>>> [2022-07-19T15:17:58.357] cons_res: select_p_node_init
>>>>> [2022-07-19T15:17:58.357] cons_res: preparing for 7 partitions
>>>>> [2022-07-19T15:17:58.357] debug:  Ports available for reservation 
>>>>> 10000-30000
>>>>> [2022-07-19T15:17:58.359] debug2: init_requeue_policy: 
>>>>> kill_invalid_depend is set to 0
>>>>> [2022-07-19T15:17:58.359] debug:  Updating partition uid access list
>>>>> [2022-07-19T15:17:58.359] debug3: Version string in resv_state 
>>>>> header is PROTOCOL_VERSION
>>>>> [2022-07-19T15:17:58.359] Recovered state of 0 reservations
>>>>> [2022-07-19T15:17:58.359] State of 0 triggers recovered
>>
-- 
Julien Rey

Plate-forme RPBS
Unité BFA - CMPLI
Université de Paris
tel: 01 57 27 83 95




More information about the slurm-users mailing list