[slurm-users] slurmctld segfaulting

Mon Mar 1 21:23:28 UTC 2021

Hello
   I have a ticket posted with schedmd, but this may be an issue the community has seen and may have a quick response.

Slurmctld segfaulted (signal 11) on us and now segfaults on restart. I'm not aware of an obvious trigger for this behavior.
We upgraded this cluster from 20.02.5 to 20.11.4 a week ago (Feb 23rd)
Slurmdbd is running on a different machine than the scheduler and seems to be ok. No obvious errors and sacct returns information.

The last log lines before crashing were...

[2021-03-01T08:28:20.944] error: The modification time of /no_backup/shared/slurm/slurmstate/job_state moved backwards by 31
seconds
[2021-03-01T08:28:20.944] error: The clock of the file system and this computer appear to not be synchronized
[2021-03-01T08:28:30.072] error: Nodes un1 not responding
[2021-03-01T08:30:33.208] error: Nodes un1 not responding, setting DOWN
[2021-03-01T08:31:02.240] error: job_resources_node_inx_to_cpu_inx: no job_resrcs or node_bitmap
[2021-03-01T08:31:02.241] error: job_update_tres_cnt: problem getting offset of JobId=2386112_2091(2386112)
[2021-03-01T08:31:02.241] cleanup_completing: JobId=2386112_2091(2386112) completion process took 478645 seconds

The modification time error looks like it has been there for a while and we need to check the ntp service on the file server. (The slurm statedir is NFS mounted). The ntpd service is working on the scheduler and the time seems correct. (Though someone may have fixed it after the crash and before I got on site).

An attempt at a restart gives a similar error...

[2021-03-01T13:39:00.054] _sync_nodes_to_comp_job: JobId=2386112_2091(2386112) in completing state
<CUT list of debug2 lines with reasonable usage values, including 900 tres cpu seconds, for job 2386122_2091(2386112)>
[2021-03-01T13:39:00.055] debug2: We have already ran the job_fini for JobId=2386112_2091(2386112)
[2021-03-01T13:39:00.055] select/cons_tres: job_res_rm_job: plugin still initializing
[2021-03-01T13:39:00.055] cleanup_completing: JobId=2386112_2091(2386112) completion process took 497131 seconds

I'd guess that something is corrupt about the spooled information for the job mentioned but I am not aware of the proper way to verify and fix this.
Is this an issue that people have run into before and have any suggestions on how to solve it?

Thanks.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210301/74f91215/attachment-0001.htm>