<div dir="auto">Thanks sathish.<div dir="auto"><br></div><div dir="auto">All other jobs are running fine across the cluster so I don't think it is related to any pam module issue. I am investigating issue further.i will come back to you with more details</div><div dir="auto"><br></div><div dir="auto">Regards </div><div dir="auto">Navin </div><div dir="auto"><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jun 8, 2020, 19:24 sathish <<a href="mailto:sathish.sathishkumar@gmail.com">sathish.sathishkumar@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi Navin, <div><br></div><div>Was this working earlier or is this the first time are you trying ? </div><div>Are you using pam module ? if yes, try disabling the pam module and see if it works. </div><div><br></div><div>Thanks</div><div>Sathish</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jun 4, 2020 at 10:47 PM navin srivastava <<a href="mailto:navin.altair@gmail.com" target="_blank" rel="noreferrer">navin.altair@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi Team,<br><div><br></div><div>i am seeing a weird issue in my environment.</div><div>one of the gaussian job is failing with the slurm within a minute after it go for the execution without writing anything and unable to figure out the reason.</div><div>The same job works fine without slurm on the same node.<br></div><div><br></div><div>slurmctld.log </div><div><br></div><div>[2020-06-03T19:14:33.170] debug: Job 1357498 has more than one partition (normal)(21052)<br>[2020-06-03T19:14:33.170] debug: Job 1357498 has more than one partition (normalGPUsmall)(21052)<br>[2020-06-03T19:14:33.170] debug: Job 1357498 has more than one partition (normalGPUbig)(21052)<br>[2020-06-03T19:15:12.497] debug: sched: JobId=1357498. State=PENDING. Reason=Priority, Priority=21052. Partition=normal,normalGPUsmall,normalGPUbig.<br>[2020-06-03T19:15:12.497] debug: sched: JobId=1357498. State=PENDING. Reason=Priority, Priority=21052. Partition=normal,normalGPUsmall,normalGPUbig.<br>[2020-06-03T19:15:12.497] debug: sched: JobId=1357498. State=PENDING. Reason=Priority, Priority=21052. Partition=normal,normalGPUsmall,normalGPUbig.<br>[2020-06-03T19:16:12.626] debug: sched: JobId=1357498. State=PENDING. Reason=Priority, Priority=21052. Partition=normal,normalGPUsmall,normalGPUbig.<br>[2020-06-03T19:17:12.753] debug: sched: JobId=1357498. State=PENDING. Reason=Priority, Priority=21052. Partition=normal,normalGPUsmall,normalGPUbig.<br>[2020-06-03T19:18:12.882] debug: sched: JobId=1357498. State=PENDING. Reason=Priority, Priority=21052. Partition=normal,normalGPUsmall,normalGPUbig.<br>[2020-06-03T19:19:13.633] sched: Allocate JobID=1357498 NodeList=oled4 #CPUs=4 Partition=normal<br>[2020-06-04T12:25:36.961] _job_complete: JobID=1357498 State=0x1 NodeCnt=1 WEXITSTATUS 2<br>[2020-06-04T12:25:36.961] SLURM Job_id=1357498 Name=job1 Ended, Run time 17:06:23, FAILED, ExitCode 2<br>[2020-06-04T12:25:36.962] _job_complete: JobID=1357498 State=0x8005 NodeCnt=1 done<br></div><div><br></div><div>slurmd.log </div><div><br></div><div>[2020-06-04T12:22:43.712] [1357498.batch] debug: jag_common_poll_data: Task average frequency = 2769 pid 64084 mem size 4625724 23696420 time 164642.840000(164537+105)<br>[2020-06-04T12:23:13.712] [1357498.batch] debug: jag_common_poll_data: Task average frequency = 2769 pid 64084 mem size 4625724 23696420 time 164762.820000(164657+105)<br>[2020-06-04T12:23:43.712] [1357498.batch] debug: jag_common_poll_data: Task average frequency = 2769 pid 64084 mem size 4625724 23696420 time 164882.810000(164777+105)<br>[2020-06-04T12:24:13.712] [1357498.batch] debug: jag_common_poll_data: Task average frequency = 2769 pid 64084 mem size 4625724 23696420 time 165002.790000(164897+105)<br>[2020-06-04T12:24:43.712] [1357498.batch] debug: jag_common_poll_data: Task average frequency = 2769 pid 64084 mem size 4625724 23696420 time 165122.770000(165016+105)<br>[2020-06-04T12:25:13.713] [1357498.batch] debug: jag_common_poll_data: Task average frequency = 2769 pid 64084 mem size 4625724 23696420 time 165242.750000(165136+105)<br>[2020-06-04T12:25:36.955] [1357498.batch] task 0 (64084) exited with exit code 2.<br>[2020-06-04T12:25:36.955] [1357498.batch] debug: task_p_post_term: affinity 1357498.4294967294, task 0<br>[2020-06-04T12:25:36.960] [1357498.batch] debug: step_terminate_monitor_stop signaling condition<br>[2020-06-04T12:25:36.960] [1357498.batch] job 1357498 completed with slurm_rc = 0, job_rc = 512<br>[2020-06-04T12:25:36.960] [1357498.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 512<br>[2020-06-04T12:25:36.961] [1357498.batch] debug: Message thread exited<br>[2020-06-04T12:25:36.962] [1357498.batch] done with job<br>[2020-06-04T12:25:36.962] debug: task_p_slurmd_release_resources: affinity jobid 1357498<br>[2020-06-04T12:25:36.962] debug: credential for job 1357498 revoked<br>[2020-06-04T12:25:36.963] debug: Waiting for job 1357498's prolog to complete<br>[2020-06-04T12:25:36.963] debug: Finished wait for job 1357498's prolog to complete<br>[2020-06-04T12:25:36.963] debug: [job 1357498] attempting to run epilog [/etc/slurm/slurm.epilog.clean]<br>[2020-06-04T12:25:37.254] debug: completed epilog for jobid 1357498<br>[2020-06-04T12:25:37.254] debug: Job 1357498: sent epilog complete msg: rc = 0<br></div><div><br></div><div>any suggestion will be welcome to troubleshoot this issue further.</div><div><br></div><div>Regards<br></div><div>Navin.<br></div><div><br></div><div><br></div><div><br></div></div>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr">Regards.....<br>Sathish</div>
</blockquote></div>