<div dir="ltr">Hey,<div><br></div><div>you can use the 'defer' scheduler parameter: <a href="https://slurm.schedmd.com/sched_config.html">https://slurm.schedmd.com/sched_config.html</a> if you don't require immediate start of jobs.</div><div><br></div><div>best regards</div><div>Maciej Pawlik</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">pt., 28 sie 2020 o 12:32 navin srivastava <<a href="mailto:navin.altair@gmail.com">navin.altair@gmail.com</a>> napisał(a):<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi Team,<br><div><br></div><div>facing one issue. several users submitting 20000 job in a single batch job which is very short jobs( says 1-2 sec). so while submitting more job slurmctld become unresponsive and started giving message</div><div><br></div><div><div style="box-sizing:border-box;font-family:"Segoe UI",system-ui,"Apple Color Emoji","Segoe UI Emoji",sans-serif;font-size:14px">ending job 6e508a88155d9bec40d752c8331d7ae8 to queue.<br>
sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure)<br>
Sending job 6e51ed0e322c87802b0f3a2f23a7967f to queue.<br>
sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure)<br>
Sending job 6e638939f90cd59e60c23b8450af9839 to queue.<br>
sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure)<br>
Sending job 6e6acf36bc7e1394a92155a95feb1c92 to queue.<br>
sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure)<br>
Sending job 6e6c646a29f0ad4e9df35001c367a9f5 to queue.<br>
sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure)<br>
Sending job 6ebcecb4c27d88f0f48d402e2b079c52 to queue.</div></div><div style="box-sizing:border-box;font-family:"Segoe UI",system-ui,"Apple Color Emoji","Segoe UI Emoji",sans-serif;font-size:14px"><br></div><div style="box-sizing:border-box;font-family:"Segoe UI",system-ui,"Apple Color Emoji","Segoe UI Emoji",sans-serif;font-size:14px">even that time the load of cpu started consuming more than 100% of slurmctld process.</div><div style="box-sizing:border-box;font-family:"Segoe UI",system-ui,"Apple Color Emoji","Segoe UI Emoji",sans-serif;font-size:14px">I found that the node is not able to acknowledge immediately to server. it is moving from comp to idle.<br></div><div style="box-sizing:border-box;font-family:"Segoe UI",system-ui,"Apple Color Emoji","Segoe UI Emoji",sans-serif;font-size:14px">so in my thought delay a scheduling cycle will help here. any idea how it can be done.</div><div style="box-sizing:border-box;font-family:"Segoe UI",system-ui,"Apple Color Emoji","Segoe UI Emoji",sans-serif;font-size:14px"><br></div><div style="box-sizing:border-box;font-family:"Segoe UI",system-ui,"Apple Color Emoji","Segoe UI Emoji",sans-serif;font-size:14px">so is there any other solution available for such issues.</div><div style="box-sizing:border-box;font-family:"Segoe UI",system-ui,"Apple Color Emoji","Segoe UI Emoji",sans-serif;font-size:14px"><br></div><div style="box-sizing:border-box;font-family:"Segoe UI",system-ui,"Apple Color Emoji","Segoe UI Emoji",sans-serif;font-size:14px">Regards</div><div style="box-sizing:border-box;font-family:"Segoe UI",system-ui,"Apple Color Emoji","Segoe UI Emoji",sans-serif;font-size:14px">Navin.</div><div style="box-sizing:border-box;font-family:"Segoe UI",system-ui,"Apple Color Emoji","Segoe UI Emoji",sans-serif;font-size:14px"><br></div><div style="box-sizing:border-box;font-family:"Segoe UI",system-ui,"Apple Color Emoji","Segoe UI Emoji",sans-serif;font-size:14px"><br></div><div style="box-sizing:border-box;font-family:"Segoe UI",system-ui,"Apple Color Emoji","Segoe UI Emoji",sans-serif;font-size:14px"><br></div></div>
</blockquote></div>