[slurm-users] slurmctld up and running but not really working
Julien Rey
julien.rey at univ-paris-diderot.fr
Wed Jul 20 15:06:16 UTC 2022
Hello,
Unfortunately, the sacctmgr show runawayjobs is returning the following
error:
sacctmgr: error: Slurmctld running on cluster cluster is not up, can't
check running jobs
J.
Le 20/07/2022 à 14:45, Ole Holm Nielsen a écrit :
> Hi Julien,
>
> You could make a database dump of the current database so that you can
> load it on another server outside the cluster, while you reinitialize
> Slurm with a fresh database.
>
> So the database thinks that you have 253 running jobs? I guess that
> slurmctld is not working, otherwise you could do: squeue -t running
>
> This command can report current jobs that have been orphaned on the
> local cluster and are now runaway:
>
> sacctmgr show runawayjobs
>
> Read the sacctmgr manual page.
>
> I hope this helps.
>
> /Ole
>
> On 7/20/22 14:19, Julien Rey wrote:
>> I don't mind losing jobs information but I certainly don't want to
>> clear the slurm database altogether.
>>
>> The /var/lib/slurm-llnl/slurmctld/node_state and
>> /var/lib/slurm-llnl/slurmctld/node_state.old files look effectively
>> empty. I then entered the following command:
>>
>> sacct | grep RUNNING
>>
>> and found about 253 jobs.
>>
>> Is there any elegant way to remove these jobs from the database ?
>>
>> J.
>>
>> Le 19/07/2022 à 20:29, Ole Holm Nielsen a écrit :
>>> Hi Julien,
>>>
>>> Apparently your slurmdbd is quite happy, but it seems that your
>>> slurmctld StateSaveLocation has been corrupted:
>>>
>>>> [2022-07-19T15:17:58.356] error: Node state file
>>>> /var/lib/slurm-llnl/slurmctld/node_state too small
>>>> [2022-07-19T15:17:58.356] error: NOTE: Trying backup state save
>>>> file. Information may be lost!
>>>> [2022-07-19T15:17:58.356] debug3: Version string in node_state
>>>> header is PROTOCOL_VERSION
>>>> [2022-07-19T15:17:58.357] Recovered state of 71 nodes
>>>> [2022-07-19T15:17:58.357] error: Job state file
>>>> /var/lib/slurm-llnl/slurmctld/job_state too small
>>>> [2022-07-19T15:17:58.357] error: NOTE: Trying backup state save
>>>> file. Jobs may be lost!
>>>> [2022-07-19T15:17:58.357] error: Incomplete job state save file
>>>
>>> Did something bad happen to your storage of
>>> /var/lib/slurm-llnl/slurmctld/ ? Could you possibly restore this
>>> folder from the last backup?
>>>
>>> I don't know if it's possible to recover from a corrupted slurmctld
>>> StateSaveLocation, maybe some others have an experience?
>>>
>>> Even if you could restore it, the Slurm database probably needs to
>>> be consistent with your slurmctld StateSaveLocation, and I don't
>>> know if this is feasible...
>>>
>>> Could you initialize your slurm 17.02.11 and start it from scratch?
>>>
>>> Regarding an upgrade from 17.02 or 17.11, you may find some useful
>>> notes in my Wiki page
>>> https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
>>>
>>> /Ole
>>>
>>>
>>> On 19-07-2022 16:28, Julien Rey wrote:
>>>> I am currently facing an issue with an old install of slurm
>>>> (17.02.11). However, I cannot upgrade this version because I had
>>>> troubles with database migration in the past (when upgrading to
>>>> 17.11) and this install is set to be replaced in the next coming
>>>> monthes. For the time being I have to keep it running because some
>>>> of our services still rely on it.
>>>>
>>>> This issue occured after a power outage.
>>>>
>>>> slurmctld is up and running, however, when I enter "sinfo", I end
>>>> up with this message after a few minutes:
>>>>
>>>> slurm_load_partitions: Unable to contact slurm controller (connect
>>>> failure)
>>>>
>>>> I set SlurmctldDebug=7 in slurm.conf and DebugLevel=7 in
>>>> slurmdbd.conf, however I don't get much info about any specific
>>>> error that would prevent the slurm controller from working in the
>>>> logs.
>>>>
>>>> Any help would be greatly appreciated.
>>>>
>>>> /var/log/slurm-llnl/slurmctld.log:
>>>>
>>>> [2022-07-19T15:17:58.342] debug3: Version in assoc_usage header is
>>>> 7936
>>>> [2022-07-19T15:17:58.345] debug3: Version in qos_usage header is 7936
>>>> [2022-07-19T15:17:58.345] debug: Reading slurm.conf file:
>>>> /etc/slurm-llnl/slurm.conf
>>>> [2022-07-19T15:17:58.347] debug: Ignoring obsolete SchedulerPort
>>>> option.
>>>> [2022-07-19T15:17:58.347] debug3: layouts: layouts_init()...
>>>> [2022-07-19T15:17:58.347] layouts: no layout to initialize
>>>> [2022-07-19T15:17:58.347] debug3: Trying to load plugin
>>>> /usr/local/lib/slurm/topology_none.so
>>>> [2022-07-19T15:17:58.347] topology NONE plugin loaded
>>>> [2022-07-19T15:17:58.347] debug3: Success.
>>>> [2022-07-19T15:17:58.348] debug: No DownNodes
>>>> [2022-07-19T15:17:58.348] debug3: Version in last_conf_lite header
>>>> is 7936
>>>> [2022-07-19T15:17:58.349] debug3: Trying to load plugin
>>>> /usr/local/lib/slurm/jobcomp_none.so
>>>> [2022-07-19T15:17:58.349] debug3: Success.
>>>> [2022-07-19T15:17:58.349] debug3: Trying to load plugin
>>>> /usr/local/lib/slurm/sched_backfill.so
>>>> [2022-07-19T15:17:58.349] sched: Backfill scheduler plugin loaded
>>>> [2022-07-19T15:17:58.349] debug3: Success.
>>>> [2022-07-19T15:17:58.350] debug3: Trying to load plugin
>>>> /usr/local/lib/slurm/route_default.so
>>>> [2022-07-19T15:17:58.350] route default plugin loaded
>>>> [2022-07-19T15:17:58.350] debug3: Success.
>>>> [2022-07-19T15:17:58.355] layouts: loading entities/relations
>>>> information
>>>> [2022-07-19T15:17:58.355] debug3: layouts: loading node node0
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node1
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node2
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node3
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node4
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node5
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node6
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node7
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node8
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node9
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node10
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node11
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node12
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node13
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node14
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node15
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node16
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node17
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node18
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node19
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node20
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node21
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node22
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node23
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node24
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node25
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node26
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node27
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node28
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node29
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node30
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node31
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node42
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node43
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node44
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node45
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node46
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node47
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node49
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node50
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node51
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node52
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node53
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node54
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node55
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node56
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node60
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node61
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node62
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node63
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node64
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node65
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node66
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node67
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node68
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node73
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node74
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node75
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node76
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node77
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node78
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node100
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node101
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node102
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node103
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node104
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node105
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node106
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node107
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node108
>>>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node109
>>>> [2022-07-19T15:17:58.356] debug: layouts: 71/71 nodes in hash
>>>> table, rc=0
>>>> [2022-07-19T15:17:58.356] debug: layouts: loading stage 1
>>>> [2022-07-19T15:17:58.356] debug: layouts: loading stage 1.1
>>>> (restore state)
>>>> [2022-07-19T15:17:58.356] debug: layouts: loading stage 2
>>>> [2022-07-19T15:17:58.356] debug: layouts: loading stage 3
>>>> [2022-07-19T15:17:58.356] error: Node state file
>>>> /var/lib/slurm-llnl/slurmctld/node_state too small
>>>> [2022-07-19T15:17:58.356] error: NOTE: Trying backup state save
>>>> file. Information may be lost!
>>>> [2022-07-19T15:17:58.356] debug3: Version string in node_state
>>>> header is PROTOCOL_VERSION
>>>> [2022-07-19T15:17:58.357] Recovered state of 71 nodes
>>>> [2022-07-19T15:17:58.357] error: Job state file
>>>> /var/lib/slurm-llnl/slurmctld/job_state too small
>>>> [2022-07-19T15:17:58.357] error: NOTE: Trying backup state save
>>>> file. Jobs may be lost!
>>>> [2022-07-19T15:17:58.357] error: Incomplete job state save file
>>>> [2022-07-19T15:17:58.357] Recovered information about 0 jobs
>>>> [2022-07-19T15:17:58.357] cons_res: select_p_node_init
>>>> [2022-07-19T15:17:58.357] cons_res: preparing for 7 partitions
>>>> [2022-07-19T15:17:58.357] debug: Ports available for reservation
>>>> 10000-30000
>>>> [2022-07-19T15:17:58.359] debug2: init_requeue_policy:
>>>> kill_invalid_depend is set to 0
>>>> [2022-07-19T15:17:58.359] debug: Updating partition uid access list
>>>> [2022-07-19T15:17:58.359] debug3: Version string in resv_state
>>>> header is PROTOCOL_VERSION
>>>> [2022-07-19T15:17:58.359] Recovered state of 0 reservations
>>>> [2022-07-19T15:17:58.359] State of 0 triggers recovered
>
--
Julien Rey
Plate-forme RPBS
Unité BFA - CMPLI
Université de Paris
tel: 01 57 27 83 95
More information about the slurm-users
mailing list