[slurm-users] Job failure issue in Slurm

Mon Jun 8 16:11:36 UTC 2020

Thanks sathish.

All other jobs are running fine across the cluster so I don't think it is
related to any pam  module issue. I am investigating issue further.i will
come back to you with more details

Regards
Navin

On Mon, Jun 8, 2020, 19:24 sathish <sathish.sathishkumar at gmail.com> wrote:

> Hi Navin,
>
> Was this working earlier or is this the first time are you trying ?
> Are you using pam module ? if yes, try disabling the pam module and see
> if it works.
>
> Thanks
> Sathish
>
> On Thu, Jun 4, 2020 at 10:47 PM navin srivastava <navin.altair at gmail.com>
> wrote:
>
>> Hi Team,
>>
>> i am seeing a weird issue in my environment.
>> one of the gaussian job is failing with the slurm within a minute after
>> it go for the execution without writing anything and unable to figure out
>> the reason.
>> The same job works fine without slurm on the same node.
>>
>> slurmctld.log
>>
>> [2020-06-03T19:14:33.170] debug:  Job 1357498 has more than one partition
>> (normal)(21052)
>> [2020-06-03T19:14:33.170] debug:  Job 1357498 has more than one partition
>> (normalGPUsmall)(21052)
>> [2020-06-03T19:14:33.170] debug:  Job 1357498 has more than one partition
>> (normalGPUbig)(21052)
>> [2020-06-03T19:15:12.497] debug:  sched: JobId=1357498. State=PENDING.
>> Reason=Priority, Priority=21052.
>> Partition=normal,normalGPUsmall,normalGPUbig.
>> [2020-06-03T19:15:12.497] debug:  sched: JobId=1357498. State=PENDING.
>> Reason=Priority, Priority=21052.
>> Partition=normal,normalGPUsmall,normalGPUbig.
>> [2020-06-03T19:15:12.497] debug:  sched: JobId=1357498. State=PENDING.
>> Reason=Priority, Priority=21052.
>> Partition=normal,normalGPUsmall,normalGPUbig.
>> [2020-06-03T19:16:12.626] debug:  sched: JobId=1357498. State=PENDING.
>> Reason=Priority, Priority=21052.
>> Partition=normal,normalGPUsmall,normalGPUbig.
>> [2020-06-03T19:17:12.753] debug:  sched: JobId=1357498. State=PENDING.
>> Reason=Priority, Priority=21052.
>> Partition=normal,normalGPUsmall,normalGPUbig.
>> [2020-06-03T19:18:12.882] debug:  sched: JobId=1357498. State=PENDING.
>> Reason=Priority, Priority=21052.
>> Partition=normal,normalGPUsmall,normalGPUbig.
>> [2020-06-03T19:19:13.633] sched: Allocate JobID=1357498 NodeList=oled4
>> #CPUs=4 Partition=normal
>> [2020-06-04T12:25:36.961] _job_complete: JobID=1357498 State=0x1
>> NodeCnt=1 WEXITSTATUS 2
>> [2020-06-04T12:25:36.961]  SLURM Job_id=1357498 Name=job1 Ended, Run time
>> 17:06:23, FAILED, ExitCode 2
>> [2020-06-04T12:25:36.962] _job_complete: JobID=1357498 State=0x8005
>> NodeCnt=1 done
>>
>> slurmd.log
>>
>> [2020-06-04T12:22:43.712] [1357498.batch] debug:  jag_common_poll_data:
>> Task average frequency = 2769 pid 64084 mem size 4625724 23696420 time
>> 164642.840000(164537+105)
>> [2020-06-04T12:23:13.712] [1357498.batch] debug:  jag_common_poll_data:
>> Task average frequency = 2769 pid 64084 mem size 4625724 23696420 time
>> 164762.820000(164657+105)
>> [2020-06-04T12:23:43.712] [1357498.batch] debug:  jag_common_poll_data:
>> Task average frequency = 2769 pid 64084 mem size 4625724 23696420 time
>> 164882.810000(164777+105)
>> [2020-06-04T12:24:13.712] [1357498.batch] debug:  jag_common_poll_data:
>> Task average frequency = 2769 pid 64084 mem size 4625724 23696420 time
>> 165002.790000(164897+105)
>> [2020-06-04T12:24:43.712] [1357498.batch] debug:  jag_common_poll_data:
>> Task average frequency = 2769 pid 64084 mem size 4625724 23696420 time
>> 165122.770000(165016+105)
>> [2020-06-04T12:25:13.713] [1357498.batch] debug:  jag_common_poll_data:
>> Task average frequency = 2769 pid 64084 mem size 4625724 23696420 time
>> 165242.750000(165136+105)
>> [2020-06-04T12:25:36.955] [1357498.batch] task 0 (64084) exited with exit
>> code 2.
>> [2020-06-04T12:25:36.955] [1357498.batch] debug:  task_p_post_term:
>> affinity 1357498.4294967294, task 0
>> [2020-06-04T12:25:36.960] [1357498.batch] debug:
>>  step_terminate_monitor_stop signaling condition
>> [2020-06-04T12:25:36.960] [1357498.batch] job 1357498 completed with
>> slurm_rc = 0, job_rc = 512
>> [2020-06-04T12:25:36.960] [1357498.batch] sending
>> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 512
>> [2020-06-04T12:25:36.961] [1357498.batch] debug:  Message thread exited
>> [2020-06-04T12:25:36.962] [1357498.batch] done with job
>> [2020-06-04T12:25:36.962] debug:  task_p_slurmd_release_resources:
>> affinity jobid 1357498
>> [2020-06-04T12:25:36.962] debug:  credential for job 1357498 revoked
>> [2020-06-04T12:25:36.963] debug:  Waiting for job 1357498's prolog to
>> complete
>> [2020-06-04T12:25:36.963] debug:  Finished wait for job 1357498's prolog
>> to complete
>> [2020-06-04T12:25:36.963] debug:  [job 1357498] attempting to run epilog
>> [/etc/slurm/slurm.epilog.clean]
>> [2020-06-04T12:25:37.254] debug:  completed epilog for jobid 1357498
>> [2020-06-04T12:25:37.254] debug:  Job 1357498: sent epilog complete msg:
>> rc = 0
>>
>> any suggestion will be welcome to troubleshoot this issue further.
>>
>> Regards
>> Navin.
>>
>>
>>
>>
>
> --
> Regards.....
> Sathish
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200608/4a1f1f5c/attachment-0001.htm>