[slurm-users] slurmctld up and running but not really working

Julien Rey julien.rey at univ-paris-diderot.fr
Wed Jul 20 12:19:54 UTC 2022


Hello,

Thanks for your quick reply.

I don't mind losing jobs information but I certainly don't want to clear 
the slurm database altogether.

The /var/lib/slurm-llnl/slurmctld/node_state and 
/var/lib/slurm-llnl/slurmctld/node_state.old files look effectively 
empty. I then entered the following command:

sacct | grep RUNNING

and found about 253 jobs.

Is there any elegant way to remove these jobs from the database ?

J.

Le 19/07/2022 à 20:29, Ole Holm Nielsen a écrit :
> Hi Julien,
>
> Apparently your slurmdbd is quite happy, but it seems that your 
> slurmctld StateSaveLocation has been corrupted:
>
>> [2022-07-19T15:17:58.356] error: Node state file 
>> /var/lib/slurm-llnl/slurmctld/node_state too small
>> [2022-07-19T15:17:58.356] error: NOTE: Trying backup state save file. 
>> Information may be lost!
>> [2022-07-19T15:17:58.356] debug3: Version string in node_state header 
>> is PROTOCOL_VERSION
>> [2022-07-19T15:17:58.357] Recovered state of 71 nodes
>> [2022-07-19T15:17:58.357] error: Job state file 
>> /var/lib/slurm-llnl/slurmctld/job_state too small
>> [2022-07-19T15:17:58.357] error: NOTE: Trying backup state save file. 
>> Jobs may be lost!
>> [2022-07-19T15:17:58.357] error: Incomplete job state save file 
>
> Did something bad happen to your storage of 
> /var/lib/slurm-llnl/slurmctld/ ?  Could you possibly restore this 
> folder from the last backup?
>
> I don't know if it's possible to recover from a corrupted slurmctld 
> StateSaveLocation, maybe some others have an experience?
>
> Even if you could restore it, the Slurm database probably needs to be 
> consistent with your slurmctld StateSaveLocation, and I don't know if 
> this is feasible...
>
> Could you initialize your slurm 17.02.11 and start it from scratch?
>
> Regarding an upgrade from 17.02 or 17.11, you may find some useful 
> notes in my Wiki page 
> https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
>
> /Ole
>
>
> On 19-07-2022 16:28, Julien Rey wrote:
>> I am currently facing an issue with an old install of slurm 
>> (17.02.11). However, I cannot upgrade this version because I had 
>> troubles with database migration in the past (when upgrading to 
>> 17.11) and this install is set to be replaced in the next coming 
>> monthes. For the time being I have to keep it running because some of 
>> our services still rely on it.
>>
>> This issue occured after a power outage.
>>
>> slurmctld is up and running, however, when I enter "sinfo", I end up 
>> with this message after a few minutes:
>>
>> slurm_load_partitions: Unable to contact slurm controller (connect 
>> failure)
>>
>> I set SlurmctldDebug=7 in slurm.conf and DebugLevel=7 in 
>> slurmdbd.conf, however I don't get much info about any specific error 
>> that would prevent the slurm controller from working in the logs.
>>
>> Any help would be greatly appreciated.
>>
>> /var/log/slurm-llnl/slurmctld.log:
>>
>> [2022-07-19T15:17:58.342] debug3: Version in assoc_usage header is 7936
>> [2022-07-19T15:17:58.345] debug3: Version in qos_usage header is 7936
>> [2022-07-19T15:17:58.345] debug:  Reading slurm.conf file: 
>> /etc/slurm-llnl/slurm.conf
>> [2022-07-19T15:17:58.347] debug:  Ignoring obsolete SchedulerPort 
>> option.
>> [2022-07-19T15:17:58.347] debug3: layouts: layouts_init()...
>> [2022-07-19T15:17:58.347] layouts: no layout to initialize
>> [2022-07-19T15:17:58.347] debug3: Trying to load plugin 
>> /usr/local/lib/slurm/topology_none.so
>> [2022-07-19T15:17:58.347] topology NONE plugin loaded
>> [2022-07-19T15:17:58.347] debug3: Success.
>> [2022-07-19T15:17:58.348] debug:  No DownNodes
>> [2022-07-19T15:17:58.348] debug3: Version in last_conf_lite header is 
>> 7936
>> [2022-07-19T15:17:58.349] debug3: Trying to load plugin 
>> /usr/local/lib/slurm/jobcomp_none.so
>> [2022-07-19T15:17:58.349] debug3: Success.
>> [2022-07-19T15:17:58.349] debug3: Trying to load plugin 
>> /usr/local/lib/slurm/sched_backfill.so
>> [2022-07-19T15:17:58.349] sched: Backfill scheduler plugin loaded
>> [2022-07-19T15:17:58.349] debug3: Success.
>> [2022-07-19T15:17:58.350] debug3: Trying to load plugin 
>> /usr/local/lib/slurm/route_default.so
>> [2022-07-19T15:17:58.350] route default plugin loaded
>> [2022-07-19T15:17:58.350] debug3: Success.
>> [2022-07-19T15:17:58.355] layouts: loading entities/relations 
>> information
>> [2022-07-19T15:17:58.355] debug3: layouts: loading node node0
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node1
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node2
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node3
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node4
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node5
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node6
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node7
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node8
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node9
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node10
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node11
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node12
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node13
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node14
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node15
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node16
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node17
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node18
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node19
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node20
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node21
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node22
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node23
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node24
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node25
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node26
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node27
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node28
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node29
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node30
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node31
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node42
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node43
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node44
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node45
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node46
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node47
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node49
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node50
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node51
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node52
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node53
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node54
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node55
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node56
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node60
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node61
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node62
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node63
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node64
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node65
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node66
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node67
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node68
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node73
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node74
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node75
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node76
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node77
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node78
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node100
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node101
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node102
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node103
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node104
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node105
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node106
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node107
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node108
>> [2022-07-19T15:17:58.356] debug3: layouts: loading node node109
>> [2022-07-19T15:17:58.356] debug:  layouts: 71/71 nodes in hash table, 
>> rc=0
>> [2022-07-19T15:17:58.356] debug:  layouts: loading stage 1
>> [2022-07-19T15:17:58.356] debug:  layouts: loading stage 1.1 (restore 
>> state)
>> [2022-07-19T15:17:58.356] debug:  layouts: loading stage 2
>> [2022-07-19T15:17:58.356] debug:  layouts: loading stage 3
>> [2022-07-19T15:17:58.356] error: Node state file 
>> /var/lib/slurm-llnl/slurmctld/node_state too small
>> [2022-07-19T15:17:58.356] error: NOTE: Trying backup state save file. 
>> Information may be lost!
>> [2022-07-19T15:17:58.356] debug3: Version string in node_state header 
>> is PROTOCOL_VERSION
>> [2022-07-19T15:17:58.357] Recovered state of 71 nodes
>> [2022-07-19T15:17:58.357] error: Job state file 
>> /var/lib/slurm-llnl/slurmctld/job_state too small
>> [2022-07-19T15:17:58.357] error: NOTE: Trying backup state save file. 
>> Jobs may be lost!
>> [2022-07-19T15:17:58.357] error: Incomplete job state save file
>> [2022-07-19T15:17:58.357] Recovered information about 0 jobs
>> [2022-07-19T15:17:58.357] cons_res: select_p_node_init
>> [2022-07-19T15:17:58.357] cons_res: preparing for 7 partitions
>> [2022-07-19T15:17:58.357] debug:  Ports available for reservation 
>> 10000-30000
>> [2022-07-19T15:17:58.359] debug2: init_requeue_policy: 
>> kill_invalid_depend is set to 0
>> [2022-07-19T15:17:58.359] debug:  Updating partition uid access list
>> [2022-07-19T15:17:58.359] debug3: Version string in resv_state header 
>> is PROTOCOL_VERSION
>> [2022-07-19T15:17:58.359] Recovered state of 0 reservations
>> [2022-07-19T15:17:58.359] State of 0 triggers recovered
>
>
-- 
Julien Rey

Plate-forme RPBS
Unité BFA - CMPLI
Université de Paris
tel: 01 57 27 83 95




More information about the slurm-users mailing list