[slurm-users] job restart :: how to find the reason

Adrian Sevcenco Adrian.Sevcenco at spacescience.ro
Wed Dec 2 15:58:41 UTC 2020


On 12/2/20 4:18 PM, Paul Edmon wrote:
> You can dig through the slurmctld log and search for the JobID. That should tell you what Slurm was doing at the time.
Aha, thanks a lot! Found the culprit:

[2020-12-02T06:45:14.200] error: Nodes issaf-0-1 not responding
[2020-12-02T06:45:28.212] requeue job JobId=29594 due to failure of node issaf-0-1
[2020-12-02T06:45:28.212] Requeuing JobId=29594
.......

[2020-12-02T06:45:28.213] error: Nodes issaf-0-1 not responding, setting DOWN
[2020-12-02T06:45:28.248] Node issaf-0-1 now responding
[2020-12-02T06:45:28.248] node_did_resp: node issaf-0-1 returned to service
2020-12-02T06:45:28.700] _job_complete: JobId=29594 WTERMSIG 15
[2020-12-02T06:45:28.700] _job_complete: JobId=29594 cancelled by interactive user
......

[2020-12-02T06:47:30.304] sched: Allocate JobId=29594 NodeList=issaf-0-1 #CPUs=1 Partition=CLUSTER

The weird thing is that i have continuous monitoring (ganglia) data, but this is beyond the scope of this list.

Thanks a lot!
Adrian


> 
> -Paul Edmon-
> 
> On 12/2/2020 6:27 AM, Adrian Sevcenco wrote:
>> Hi! I encountered a situation when a bunch of jobs were restarted
>> and this is seen from Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0
>>
>> So, i would like to know, how i can i find why there is a Requeue
>> (when there is only one partition defined) and why there is a restart ..
>>
>> Thanks a lot!!!
>> Adrian
>>
> 




More information about the slurm-users mailing list