<div dir="ltr">Hey Raj,<div><br></div><div>To me this all sounds, at a high level, a job for some kind of lightweight middleware on top of SLURM. E.g. makefiles or something like that. Where each pipeline would be managed outside of slurm and would maybe submit a job to install some software, then submit a job to run something on that node, then run a third job to clean up / remove software. And it would have to interact with the several slurm features that have been mentioned in this thread, such as features or licenses or job dependencies, or gres.</div><div><br></div><div>snakemake might be an example, but there are many others.</div><div><br></div><div>Regards,</div><div>Alex</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jul 10, 2020 at 11:14 AM Raj Sahae <<a href="mailto:rsahae@tesla.com">rsahae@tesla.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div lang="EN-US">
<div class="gmail-m_-6673707000695490390WordSection1">
<p class="MsoNormal"><span style="font-size:10pt;font-family:Arial,sans-serif">Hi Paddy,<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:Arial,sans-serif"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:Arial,sans-serif">Yes, this is a CI/CD pipeline. We currently use Jenkins pipelines but it has some significant drawbacks that Slurm solves out of the box that make it an attractive alternative.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:Arial,sans-serif">You noted some of them already, like good real time queue management, pre-emption, node weighting, high resolution priority queueing.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:Arial,sans-serif">Jenkins also doesn’t scale as well w.r.t. node management, it’s quite resource heavy.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:Arial,sans-serif"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:Arial,sans-serif">My original email was a bit wordy but I should emphasize that if we want Slurm to do the exact same thing as our current Jenkins pipeline, we can already do that and it works
reasonably well.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:Arial,sans-serif">Now I’m trying to move beyond feature parity and am having trouble doing so.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:Arial,sans-serif"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:Arial,sans-serif">Thanks,<u></u><u></u></span></p>
<div>
<p class="MsoNormal"><span style="font-size:10pt;font-family:Arial,sans-serif;color:rgb(192,0,0)"><u></u> <u></u></span></p>
</div>
<p class="MsoNormal"><b><span style="font-size:10pt;font-family:Arial,sans-serif;color:rgb(127,127,127)">Raj Sahae | </span></b><span style="font-size:10pt;font-family:Arial,sans-serif;color:rgb(127,127,127)">m. +1 (408) 230-8531</span><span style="font-size:10pt;font-family:Arial,sans-serif"><u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:Arial,sans-serif"><u></u> <u></u></span></p>
<div style="border-right:none;border-bottom:none;border-left:none;border-top:1pt solid rgb(181,196,223);padding:3pt 0in 0in">
<p class="MsoNormal"><b><span style="font-size:12pt;color:black">From: </span></b><span style="font-size:12pt;color:black">slurm-users <<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a>> on behalf of Paddy Doyle <<a href="mailto:paddy@tchpc.tcd.ie" target="_blank">paddy@tchpc.tcd.ie</a>><br>
<b>Reply-To: </b>Slurm User Community List <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a>><br>
<b>Date: </b>Friday, July 10, 2020 at 10:31 AM<br>
<b>To: </b>Slurm User Community List <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a>><br>
<b>Subject: </b>Re: [slurm-users] How to queue jobs based on non-existent features<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<p class="MsoNormal">Hi Raj,<br>
<br>
It sounds like you might be coming from a CI/CD pipeline setup, but just in<br>
case you're not, would you consider something like Jenkins or Gitlab CI<br>
instead of Slurm?<br>
<br>
The users could create multi-stage pipelines, with the 'build' stage<br>
installing the required software version, and then multiple 'test' stages<br>
to run the tests.<br>
<br>
It's not the same idea as queuing up multiple jobs. Nor do you get queue<br>
priorities or weighting and all of that good stuff from Slurm that you are<br>
looking for.<br>
<br>
Within Slurm, yeah writing custom JobSubmitPlugins and NodeFeaturesPlugins<br>
might be required.<br>
<br>
Paddy<br>
<br>
On Thu, Jul 09, 2020 at 11:15:57PM +0000, Raj Sahae wrote:<br>
<br>
> Hi all,<br>
> <br>
> My apologies if this is sent twice. The first time I sent it without my subscription to the list being complete.<br>
> <br>
> I am attempting to use Slurm as a test automation system for its fairly advanced queueing and job control abilities, and also because it scales very well.<br>
> However, since our use case is a bit outside the standard usage of Slurm, we are hitting some issues that don’t appear to have obvious solutions.<br>
> <br>
> In our current setup, the Slurm nodes are hosts attached to a test system. Our pipeline (greatly simplified) would be to install some software on the test system and then run sets of tests against it.<br>
> In our old pipeline, this was done in a single job, however with Slurm I was hoping to decouple these two actions as it makes the entire pipeline more robust to update failures and would give us more finely grained job control for the actual test run.<br>
> <br>
> I would like to allow users to queue jobs with constraints indicating which software version they need. Then separately some automated job would scan the queue, see jobs that are not being allocated due to missing resources, and queue software installs appropriately.
We attempted to do this using the Active/Available Features configuration. We use HealthCheck and Epilog scripts to scrape the test system for software properties (version, commit, etc.) and assign them as Features. Once an install is complete and the Features
are updated, queued jobs would start to be allocated on those nodes.<br>
> <br>
> Herein lies the conundrum. If a user submits a job, constraining to run on Version A, but all nodes in the cluster are currently configured with Features=Version-B, Slurm will fail to queue the job, indicating an invalid feature specification. I completely
understand why Features are implemented this way, so my question is, is there some workaround or other Slurm capabilities that I could use to achieve this behavior? Otherwise my options seem to be:<br>
> <br>
> 1. Go back to how we did it before. The pipeline would have the same level of robustness as before but at least we would still be able to leverage other queueing capabilities of Slurm.<br>
> 2. Write our own Feature or Job Submit plugin that customizes this behavior just for us. Seems possible but adds lead time and complexity to the situation.<br>
> <br>
> It's not feasible to update the config for all branches/versions/commits to be AvailableFeatures, as our branch ecosystem is quite large and the maintenance of that approach would not scale well.<br>
> <br>
> Thanks,<br>
> <br>
> Raj Sahae | Manager, Software QA<br>
> 3500 Deer Creek Rd, Palo Alto, CA 94304<br>
> m. +1 (408) 230-8531 | <a href="mailto:rsahae@tesla.com" target="_blank">rsahae@tesla.com</a><file:///composeviewinternalloadurl/%<a href="mailto:3Cmailto%3Arsahae@tesla.com" target="_blank">3Cmailto:rsahae@tesla.com</a>%3E><br>
> <br>
> [cid:image001.png@01D6560C.399F5D30]<<a href="http://www.tesla.com" target="_blank">http://www.tesla.com/</a>><br>
> <br>
<br>
<br>
<br>
-- <br>
Paddy Doyle<br>
Research IT / Trinity Centre for High Performance Computing,<br>
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.<br>
Phone: +353-1-896-3725<br>
<a href="https://www.tchpc.tcd.ie" target="_blank">https://www.tchpc.tcd.ie/</a><br>
<br>
<u></u><u></u></p>
</div>
</div>
</blockquote></div>