[slurm-users] slurmctld up and running but not really working
Julien Rey
julien.rey at univ-paris-diderot.fr
Wed Jul 20 12:19:54 UTC 2022
Hello,
Thanks for your quick reply.
I don't mind losing jobs information but I certainly don't want to clear
the slurm database altogether.
The /var/lib/slurm-llnl/slurmctld/node_state and
/var/lib/slurm-llnl/slurmctld/node_state.old files look effectively
empty. I then entered the following command:
sacct | grep RUNNING
and found about 253 jobs.
Is there any elegant way to remove these jobs from the database ?
J.
Le 19/07/2022 à 20:29, Ole Holm Nielsen a écrit :
> Hi Julien,
>
> Apparently your slurmdbd is quite happy, but it seems that your
> slurmctld StateSaveLocation has been corrupted:
>
>> [2022-07-19T15:17:58.356] error: Node state file
>> /var/lib/slurm-llnl/slurmctld/node_state too small
>> [2022-07-19T15:17:58.356] error: NOTE: Trying backup state save file.
>> Information may be lost!
>> [2022-07-19T15:17:58.356] debug3: Version string in node_state header
>> is PROTOCOL_VERSION
>> [2022-07-19T15:17:58.357] Recovered state of 71 nodes
>> [2022-07-19T15:17:58.357] error: Job state file
>> /var/lib/slurm-llnl/slurmctld/job_state too small
>> [2022-07-19T15:17:58.357] error: NOTE: Trying backup state save file.
>> Jobs may be lost!
>> [2022-07-19T15:17:58.357] error: Incomplete job state save file
>
> Did something bad happen to your storage of
> /var/lib/slurm-llnl/slurmctld/ ? Could you possibly restore this
> folder from the last backup?
>
> I don't know if it's possible to recover from a corrupted slurmctld
> StateSaveLocation, maybe some others have an experience?
>
> Even if you could restore it, the Slurm database probably needs to be
> consistent with your slurmctld StateSaveLocation, and I don't know if
> this is feasible...
>
> Could you initialize your slurm 17.02.11 and start it from scratch?
>
> Regarding an upgrade from 17.02 or 17.11, you may find some useful
> notes in my Wiki page
> https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
>
> /Ole
>
>
> On 19-07-2022 16:28, Julien Rey wrote:
>> I am currently facing an issue with an old install of slurm
>> (17.02.11). However, I cannot upgrade this version because I had
>> troubles with database migration in the past (when upgrading to
>> 17.11) and this install is set to be replaced in the next coming
>> monthes. For the time being I have to keep it running because some of
>> our services still rely on it.
>>
>> This issue occured after a power outage.
>>
>> slurmctld is up and running, however, when I enter "sinfo", I end up
>> with this message after a few minutes:
>>
>> slurm_load_partitions: Unable to contact slurm controller (connect
>> failure)
>>
>> I set SlurmctldDebug=7 in slurm.conf and DebugLevel=7 in
>> slurmdbd.conf, however I don't get much info about any specific error
>> that would prevent the slurm controller from working in the logs.
>>
>> Any help would be greatly appreciated.
>>
>> /var/log/slurm-llnl/slurmctld.log:
>>
>> [2022-07-19T15:17:58.342] debug3: Version in assoc_usage header is 7936
>> [2022-07-19T15:17:58.345] debug3: Version in qos_usage header is 7936
>> [2022-07-19T15:17:58.345] debug: Reading slurm.conf file:
>> /etc/slurm-llnl/slurm.conf
>> [2022-07-19T15:17:58.347] debug: Ignoring obsolete SchedulerPort
>> option.
>> [2022-07-19T15:17:58.347] debug3: layouts: layouts_init()...
>> [2022-07-19T15:17:58.347] layouts: no layout to initialize
>> [2022-07-19T15:17:58.347] debug3: Trying to load plugin
>> /usr/local/lib/slurm/topology_none.so
>> [2022-07-19T15:17:58.347] topology NONE plugin loaded
>> [2022-07-19T15:17:58.347] debug3: Success.
>> [2022-07-19T15:17:58.348] debug: No DownNodes
>> [2022-07-19T15:17:58.348] debug3: Version in last_conf_lite header is
>> 7936
>> [2022-07-19T15:17:58.349] debug3: Trying to load plugin
>> /usr/local/lib/slurm/jobcomp_none.so
>> [2022-07-19T15:17:58.349] debug3: Success.
>> [2022-07-19T15:17:58.349] debug3: Trying to load plugin
>> /usr/local/lib/slurm/sched_backfill.so
>> [2022-07-19T15:17:58.349] sched: Backfill scheduler plugin loaded
>> [2022-07-19T15:17:58.349] debug3: Success.
>> [2022-07-19T15:17:58.350] debug3: Trying to load plugin
>> /usr/local/lib/slurm/route_default.so
>> [2022-07-19T15:17:58.350] route default plugin loaded
>> [2022-07-19T15:17:58.350] debug3: Success.
>> [2022-07-19T15:17:58.355] layouts: loading entities/relations
>> information
>> [2022-07-19T15:17:58.355] debug3: layouts: loading node node0
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node1
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node2
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node3
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node4
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node5
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node6
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node7
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node8
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node9
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node10
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node11
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node12
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node13
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node14
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node15
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node16
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node17
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node18
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node19
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node20
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node21
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node22
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node23
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node24
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node25
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node26
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node27
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node28
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node29
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node30
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node31
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node42
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node43
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node44
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node45
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node46
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node47
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node49
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node50
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node51
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node52
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node53
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node54
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node55
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node56
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node60
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node61
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node62
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node63
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node64
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node65
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node66
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node67
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node68
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node73
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node74
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node75
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node76
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node77
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node78
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node100
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node101
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node102
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node103
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node104
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node105
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node106
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node107
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node108
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node109
>> [2022-07-19T15:17:58.356] debug: layouts: 71/71 nodes in hash table,
>> rc=0
>> [2022-07-19T15:17:58.356] debug: layouts: loading stage 1
>> [2022-07-19T15:17:58.356] debug: layouts: loading stage 1.1 (restore
>> state)
>> [2022-07-19T15:17:58.356] debug: layouts: loading stage 2
>> [2022-07-19T15:17:58.356] debug: layouts: loading stage 3
>> [2022-07-19T15:17:58.356] error: Node state file
>> /var/lib/slurm-llnl/slurmctld/node_state too small
>> [2022-07-19T15:17:58.356] error: NOTE: Trying backup state save file.
>> Information may be lost!
>> [2022-07-19T15:17:58.356] debug3: Version string in node_state header
>> is PROTOCOL_VERSION
>> [2022-07-19T15:17:58.357] Recovered state of 71 nodes
>> [2022-07-19T15:17:58.357] error: Job state file
>> /var/lib/slurm-llnl/slurmctld/job_state too small
>> [2022-07-19T15:17:58.357] error: NOTE: Trying backup state save file.
>> Jobs may be lost!
>> [2022-07-19T15:17:58.357] error: Incomplete job state save file
>> [2022-07-19T15:17:58.357] Recovered information about 0 jobs
>> [2022-07-19T15:17:58.357] cons_res: select_p_node_init
>> [2022-07-19T15:17:58.357] cons_res: preparing for 7 partitions
>> [2022-07-19T15:17:58.357] debug: Ports available for reservation
>> 10000-30000
>> [2022-07-19T15:17:58.359] debug2: init_requeue_policy:
>> kill_invalid_depend is set to 0
>> [2022-07-19T15:17:58.359] debug: Updating partition uid access list
>> [2022-07-19T15:17:58.359] debug3: Version string in resv_state header
>> is PROTOCOL_VERSION
>> [2022-07-19T15:17:58.359] Recovered state of 0 reservations
>> [2022-07-19T15:17:58.359] State of 0 triggers recovered
>
>
--
Julien Rey
Plate-forme RPBS
Unité BFA - CMPLI
Université de Paris
tel: 01 57 27 83 95
More information about the slurm-users
mailing list