Hello, We wish to have a schedulingintegration with Slurm. Our own application has a backend system which willdecide the placement of jobs across hosts & CPU cores. The backend takesits own time to come back with a placement (which may take a few seconds) & we expect slurm to update it regularly about any change in the current state ofavailable resources. For this we believe we have 3 options broadly:
- We use the const_tres Select plugin & modify it to let it query our backend system for job placements. - We write our own Select plugin avoiding any other Select plugin. - We use existing select plugin & also register our own plugin. Idea is that our plugin will cater to 'our' jobs (specific partition, say) while all other jobs would be taken up by the default plugin.
Problem with a> is that this leads to modification ofexisting plugin code & calling (our) library code from inside Select pluginlib.
With b> the issue is unless we have the full Slurm cluster toourselves this isn't viable. Any insight how to proceed with this? Where shouldour select plugin, assuming we need to make one, fits in the slurm integration.
We are not sure whether c> is allowed in Slurm. We went through existing Select plugins Linear & cons_tres.However, not able to figure out how to use them or write something on similarlines to suit our purpose. Any help in this regard is appreciated. Apologies if this question (or any other very similar) is already answered,please point to the relevant thread then.
Thanks in advance for any pointers.
Regards, Bhaskar.
I'm not sure I understand why your app must decide the placement, rather then tell Slurm about the requirements (This sounds suspiciously like Not Invented Here syndrome), but Slurm does have the '-w' flag to salloc,sbatch and srun.
I just don't understand if you don't have an entire cluster to yourselves, how can you do a, not to mention b or c. Any change to Slurm select mechanism is always site-wide.
I might go on a limb here, but I think Slurm would probably make better placement choices than your self developed app, if you can communicate the requirements well enough.
How does you app choose placement and cores? Why can't it communicate those requirements to Slurm instead of making the decision itself?
I can guess at some reasons and there can be many, including but not limited to: topology, heterogeneous HW and different parts of the app having different HW requirements, some results placed on some nodes requiring followup jobs to run on same nodes, NUMA considerations for acceleration cards (including custom, mostly FPGA cards) etc.
If you provide the placement algorithm (in broad strokes), perhaps we can find a Slurm solution that doesn't require breaking existing sites. If that is the case, how much will it cost to 'degrade' your app to communicate those requirements to Slurm instead of making the placement decisions itself?
It's possible that you would be better off investing in developing a monitoring solution that would cover the ' update it regularly about any change in the current state of available resources '.
Again, that is also ruled out if you use a site without total ownership - no site will allow you to place jobs without first allocating you the resources, no matter the scheduling solution, which brings us back to using `salloc -w`.
That said, --nodelist has the downside of requesting nodes that might not be available, causing your jobs to starve while resources are available.
Imagine the following scenario:
1. Your app gets resource availability from Slurm.
2. Your app starts calculating the placement.
3. Meanwhile Slurm allocates those resources.
4. The plugin communicates the need to recalculate placement.
5. Your app restarts it's calculation
6. Meanwhile Slurm allocates the resources your app was going to use now, since it was never told to reserve anything for you.
...
On highly active clusters, with pending queues in the millions, such a starvation scenario is not that far fetched.
Best,
--Dani_L.
On 09/07/2024 11:15:51, Bhaskar Chakraborty via slurm-users wrote:
Hello,
We wish to have a scheduling integration with Slurm. Our own application has a backend system which will decide the placement of jobs across hosts & CPU cores. The backend takes its own time to come back with a placement (which may take a few seconds) & we expect slurm to update it regularly about any change in the current state of available resources.
For this we believe we have 3 options broadly:
- We use the const_tres Select plugin & modify it to let it query our backend system for job placements.
- We write our own Select plugin avoiding any other Select plugin.
- We use existing select plugin & also register our own plugin. Idea is that our plugin will cater to 'our' jobs (specific partition, say) while all other jobs would be taken up by the default plugin.
Problem with a> is that this leads to modification of existing plugin code & calling (our) library code from inside Select plugin lib.
With b> the issue is unless we have the full Slurm cluster to ourselves this isn't viable. Any insight how to proceed with this? Where should our select plugin, assuming we need to make one, fits in the slurm integration.
We are not sure whether c> is allowed in Slurm.
We went through existing Select plugins Linear & cons_tres. However, not able to figure out how to use them or write something on similar lines to suit our purpose. Any help in this regard is appreciated.
Apologies if this question (or any other very similar) is already answered, please point to the relevant thread then.
Thanks in advance for any pointers.
Regards,
Bhaskar.
Hi Daniel, Thanks for picking up this query. Let me try to briefly describe my problem.
As you rightly guessed, we have some hardware on the backend which would be used for our jobs to run. The app which manages the h/w has its own set of resource placement/remapping rules to place a job. So, for eg., if only 3 hosts h1, h2, h3 (2 cores available each) are available at some point for a 4 core job then it's only a few combination of cores from these hosts can be allowed for the job. Also there is a preference order of the placements decided by our app.
It's in this respect we want our backend app to bring the placement for the job. Slurm would then dispatch the job accordingly while honoring the exact resource distribution as asked for. In case for the need of preemption as well our backend would decide the placement which would decide which preemptable job candidates to preempt.
So, how should we proceed then? We mayn't have the whole site/cluster to ourselves. There me be other jobs which we don't care about & hence they should go in the usual route from the select plugin which is there (linear, cons_tres etc).
Is there a scope for a separate partition which will encompass our resources only & trigger our plugin only for our jobs? How do the options a>, b> , c> stand (as described in my 1st message) now that I mention our requirement?
A 4th option which comes to my mind is that if there's a possibility through some API interface from Slurm which will inform a separate process P (say) about resource availability on a real time basis. P will talk to our backend app, bring a placement & then ask lSurm to place our job.
Your concern about everchanging resources (being allocated before our backend comes up) is uncalled for as the hosts are segregated as far as our system is concerned. Our hosts will run only our jobs & other Slurm jobs would run in different hosts.
Hope I make myself little more clearer ! Any help would be appreciated.
(Note: We already have a working solution with LSF! LSF does provide option for custom scheduler plugins to let one connect in the decision making loop during scheduling. This only led us to believe Slurm would also have some possibilities.)
Regards, Bhaskar.
Hi Daniel, Appreciate your response. I think you may be feeling that since we take the placement part of the scheduling to ourselves then Slurm has no other role to play! That's not quite true. Below in brief are other important roles which Slurm must perform which presently come to my mind(this mayn't be exhaustive): 1> Slurm should inform the job scheduling priority order. Admin can impose policies (fairshare, usergroup preference etc) which can prioritize more recent jobs compared to older ones & we would like to see them as computed by Slurm on a real time basis. 2> Slurm should update us the info for any configured resource limits. Limits on resources can be there like CPU Cores, number of running jobs, host group CPU limits etc. Our backend app need to be updated from time to time about the same so that unnecessary allocates are avoided right away. 3> Preemptable job candidates. Admin can mark certain jobs from certain users as preemptable ones. Our app needs to be informed about that should the need arise to preempt running jobs. 4> Specified host resources for job start. User may want their jobs to start on specific hosts & the same should be communicated back by Slurm. Similarly, the same applies if user wants his job to run on a certain set of hosts. 5> Preferential hosts for scheduling. If there is some preferential order of hosts or backfill scheduling enabled the same needs to be communicated to us. 6> Regular intimation of job events, like dispatch , suspended, finish, re-submission etc so that we can take appropriate action. Hope this clears our requirements & expectations. Regards,Bhaskar. On Thursday, 18 July, 2024 at 04:47:51 am IST, Daniel Letai via slurm-users slurm-users@lists.schedmd.com wrote:
#yiv5617354796 body p {margin-bottom:0cm;margin-top:0pt;} In the scenario you provide, you don't need anything special.
You just have to configure a partition that is available only to you, and to no other account on the cluster. This partition will only include your hosts. All other partition will not include any of your hosts.
Then use you own implementation to do whatever you want with the hosts. As long as you are the exclusive owners of the hosts, Slurm is not really part of the equation.
You don't even have to allocate the hosts using Slurm, as there is no contention.
If you want to use Slurm to use your placement, instead of directly starting the app on the nodes, just use the -w (--nodelist) option with the hosts requested. Make sure to only request your partition.
You really don't need anything special from Slurm, you don't really need Slurm for this.
On 15/07/2024 19:26:10, jubhaskar--- via slurm-users wrote:
Hi Daniel, Thanks for picking up this query. Let me try to briefly describe my problem.
As you rightly guessed, we have some hardware on the backend which would be used for our jobs to run. The app which manages the h/w has its own set of resource placement/remapping rules to place a job. So, for eg., if only 3 hosts h1, h2, h3 (2 cores available each) are available at some point for a 4 core job then it's only a few combination of cores from these hosts can be allowed for the job. Also there is a preference order of the placements decided by our app.
It's in this respect we want our backend app to bring the placement for the job. Slurm would then dispatch the job accordingly while honoring the exact resource distribution as asked for. In case for the need of preemption as well our backend would decide the placement which would decide which preemptable job candidates to preempt.
So, how should we proceed then? We mayn't have the whole site/cluster to ourselves. There me be other jobs which we don't care about & hence they should go in the usual route from the select plugin which is there (linear, cons_tres etc).
Is there a scope for a separate partition which will encompass our resources only & trigger our plugin only for our jobs? How do the options a>, b> , c> stand (as described in my 1st message) now that I mention our requirement?
A 4th option which comes to my mind is that if there's a possibility through some API interface from Slurm which will inform a separate process P (say) about resource availability on a real time basis. P will talk to our backend app, bring a placement & then ask lSurm to place our job.
Your concern about everchanging resources (being allocated before our backend comes up) is uncalled for as the hosts are segregated as far as our system is concerned. Our hosts will run only our jobs & other Slurm jobs would run in different hosts.
Hope I make myself little more clearer ! Any help would be appreciated.
(Note: We already have a working solution with LSF! LSF does provide option for custom scheduler plugins to let one connect in the decision making loop during scheduling. This only led us to believe Slurm would also have some possibilities.)
Regards, Bhaskar.
-- Regards,
Daniel Letai +972 (0)505 870 456
Hi Bhaskar,
I think I'm missing something here.
All the below points are valid, but mainly of concern for shared resources. You have your private, non-sharing resources, dedicated to the app, or so I understood.
Does your app compete against itself? Do users of the app interfere with jobs of each other? Are the nodes mutable? Do you expect resources to be added to your dedicated subcluster/partition unannounced?
On 18/07/2024 16:39:25, Bhaskar Chakraborty wrote:
Hi Daniel,
Appreciate your response.
I think you may be feeling that since we take the placement part of the scheduling to ourselves then Slurm has no other role to play!
That's not quite true. Below in brief are other important roles which Slurm must perform which presently come to my mind (this mayn't be exhaustive):
1> Slurm should inform the job scheduling priority order. Admin can impose policies (fairshare, usergroup preference etc) which can prioritize more recent jobs compared to older ones & we would like to see them as computed by Slurm on a real time basis.
Such policies are cluster wide and usually immutable unless there is a good reason to modify them. As owners of just some of the resources, the most you can usually expect is to set priority internally (only between your own jobs) via flags or qos.
2> Slurm should update us the info for any configured resource limits. Limits on resources can be there like CPU Cores, number of running jobs, host group CPU limits etc. Our backend app need to be updated from time to time about the same so that unnecessary allocates are avoided right away.
Resource limits shouldn't change. Number of running jobs are a known quantity as they are product of the app. and the app is the only thing that produces jobs in those nodes.
In the case of node failures, any monitoring solution will do the job as well as Slurm. Slurm is convenient in that regard, but not essential.
3> Preemptable job candidates. Admin can mark certain jobs from certain users as preemptable ones. Our app needs to be informed about that should the need arise to preempt running jobs.
The nodes are yours. What would preempt them aside from you?
4> Specified host resources for job start. User may want their jobs to start on specific hosts & the same should be communicated back by Slurm. Similarly, the same applies if user wants his job to run on a certain set of hosts.
The impetus for this discussion was that your app is the arbiter of placement, and only your app is running on those nodes.
I think you haven't decided (programmatically speaking) on an exact flow for all edge cases. If your app chooses placement based on it's own algorithm, the users SHOULD be able to add that kind of input to the app directly, rather than some ping-pong between the app, Slurm and the user.
5> Preferential hosts for scheduling. If there is some preferential order of hosts or backfill scheduling enabled the same needs to be communicated to us.
Backfill is valid, but the rest is not, considering your app chooses placement. Or does your app chooses placement overlapping previous placements? Does it not preserve state?
6> Regular intimation of job events, like dispatch , suspended, finish, re-submission etc so that we can take appropriate action.
All valid, but doesn't invalidate my statement. You don't need anything special from Slurm for these.
Hope this clears our requirements & expectations.
Almost any solution that I can think of for your requirements requires admin level changes to Slurm - using a healthcheckprogram, prologs, submit plugins - all are cluster-wide modifications that would affect other users of the cluster, not just your nodes.
That is why I suggested you might be better off using your own private solution, since Slurm really is not designed to work with external placement. It can be done but would be suboptimal.
I still believe the best option is to rewrite the app to communicate the placement requirements (based on the algorithm and previous runs as input) to Slurm as a simple string of sbatch flags, and just let Slurm do it's thing. It sounds simpler than forcing all other users of the cluster to adhere to your particular needs without introducing unnecessary complexity to the cluster.
Regards, Bhaskar.
Regards,
--Dani_L.
On Thursday, 18 July, 2024 at 04:47:51 am IST, Daniel Letai via slurm-users slurm-users@lists.schedmd.com wrote:
In the scenario you provide, you don't need anything special.
You just have to configure a partition that is available only to you, and to no other account on the cluster. This partition will only include your hosts. All other partition will not include any of your hosts.
Then use you own implementation to do whatever you want with the hosts. As long as you are the exclusive owners of the hosts, Slurm is not really part of the equation.
You don't even have to allocate the hosts using Slurm, as there is no contention.
If you want to use Slurm to use your placement, instead of directly starting the app on the nodes, just use the -w (--nodelist) option with the hosts requested. Make sure to only request your partition.
You really don't need anything special from Slurm, you don't really need Slurm for this.
On 15/07/2024 19:26:10, jubhaskar--- via slurm-users wrote:
Hi Daniel, Thanks for picking up this query. Let me try to briefly describe my problem.
As you rightly guessed, we have some hardware on the backend which would be used for our jobs to run. The app which manages the h/w has its own set of resource placement/remapping rules to place a job. So, for eg., if only 3 hosts h1, h2, h3 (2 cores available each) are available at some point for a 4 core job then it's only a few combination of cores from these hosts can be allowed for the job. Also there is a preference order of the placements decided by our app.
It's in this respect we want our backend app to bring the placement for the job. Slurm would then dispatch the job accordingly while honoring the exact resource distribution as asked for. In case for the need of preemption as well our backend would decide the placement which would decide which preemptable job candidates to preempt.
So, how should we proceed then? We mayn't have the whole site/cluster to ourselves. There me be other jobs which we don't care about & hence they should go in the usual route from the select plugin which is there (linear, cons_tres etc).
Is there a scope for a separate partition which will encompass our resources only & trigger our plugin only for our jobs? How do the options a>, b> , c> stand (as described in my 1st message) now that I mention our requirement?
A 4th option which comes to my mind is that if there's a possibility through some API interface from Slurm which will inform a separate process P (say) about resource availability on a real time basis. P will talk to our backend app, bring a placement & then ask lSurm to place our job.
Your concern about everchanging resources (being allocated before our backend comes up) is uncalled for as the hosts are segregated as far as our system is concerned. Our hosts will run only our jobs & other Slurm jobs would run in different hosts.
Hope I make myself little more clearer ! Any help would be appreciated.
(Note: We already have a working solution with LSF! LSF does provide option for custom scheduler plugins to let one connect in the decision making loop during scheduling. This only led us to believe Slurm would also have some possibilities.)
Regards, Bhaskar.
-- Regards,
Daniel Letai +972 (0)505 870 456
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Hi Daniel, Our jobs would have mutually exhaustive set of resources, though they may share hosts.Resources are mainly CPU cores on hosts.Private trackable resources can be defined according to needs & they would be exclusive to jobs too. Jobs, as I mentioned, can share hosts. Their resource granularity can vary from 1 CPU core/slot in 1 host to multiple hosts withvarying number of CPU cores for the host. Jobs can launch parallel tasks in their allocated hosts with appropriate utility fromSlurm. If I take the smallest possible piece of hardware as 1 unit then a job requiring 5 such units would need an orderof 5 CPU cores. Hardware by itself is physically connected to hosts & hence the job which would be mapped toa certain group of h/w units would be accordingly run on the hosts to which they are connected.More than 1 h/w unit may be connected to a single host. If 5 such units are connected to 2 hosts (3 + 2) then the job would be dispatched in those 2 hosts taking 3 & 2 CPU cores respectively. Jobs may require other resource like memory and the same should be managed by Slurm according to a jobs' need.So, even if h/w is free & accordingly CPU cores are available for a job to be placed but it may be denied a placement asits memory requirements is/are not met by the host(s).This should be communicated by Slurm to our process. As far as private resources is concerned, this will only depend how slurm is able to manage the basic requirements that of hosts& CPU cores. The CPU cores' identification is mainly to deal with preemption. For eg., a pending job jP may need to preempt only 1 of jR1 & jR2 (say jR1) both running on a host h1 with 1 CPU core for each.This is governed by the h/w placement it gets by the backend. Now, how to distinguish between jR1 & jR2 in order to suspend jR1 & not jR2? This gets solved by a markable resource which gets incorporated into jP which will trigger the preemption.This markable resource can be a uniquely identifiable CPU core or a custom trackable resource. We may make custom trackable resources, should the need arise and Slurm allows such, which can map to individual CPU cores, in turn mapping to the h/w units on which the job is placed at the other end, & let jobs possess them when running. Hope things are clearer. Regards,Bhaskar On Friday, 19 July, 2024 at 02:06:35 pm IST, Daniel Letai via slurm-users slurm-users@lists.schedmd.com wrote:
Hi Bhaskar,
I think I'm missing something here.
All the below points are valid, but mainly of concern for shared resources. You have your private, non-sharing resources, dedicated to the app, or so I understood.
Does your app compete against itself? Do users of the app interfere with jobs of each other? Are the nodes mutable? Do you expect resources to be added to your dedicated subcluster/partition unannounced?
On 18/07/2024 16:39:25, Bhaskar Chakraborty wrote:
Hi Daniel,
Appreciate your response.
I think you may be feeling that since we take the placement part of the scheduling to ourselves then Slurm has no other role to play!
That's not quite true. Below in brief are other important roles which Slurm must perform which presently come to my mind (this mayn't be exhaustive):
1> Slurm should inform the job scheduling priority order. Admin can impose policies (fairshare, usergroup preference etc) which can prioritize more recent jobs compared to older ones & we would like to see them as computed by Slurm on a real time basis.
Such policies are cluster wide and usually immutable unless there is a good reason to modify them. As owners of just some of the resources, the most you can usually expect is to set priority internally (only between your own jobs) via flags or qos.
2> Slurm should update us the info for any configured resource limits. Limits on resources can be there like CPU Cores, number of running jobs, host group CPU limits etc. Our backend app need to be updated from time to time about the same so that unnecessary allocates are avoided right away.
Resource limits shouldn't change. Number of running jobs are a known quantity as they are product of the app. and the app is the only thing that produces jobs in those nodes.
In the case of node failures, any monitoring solution will do the job as well as Slurm. Slurm is convenient in that regard, but not essential.
3> Preemptable job candidates. Admin can mark certain jobs from certain users as preemptable ones. Our app needs to be informed about that should the need arise to preempt running jobs.
The nodes are yours. What would preempt them aside from you?
4> Specified host resources for job start. User may want their jobs to start on specific hosts & the same should be communicated back by Slurm. Similarly, the same applies if user wants his job to run on a certain set of hosts.
The impetus for this discussion was that your app is the arbiter of placement, and only your app is running on those nodes.
I think you haven't decided (programmatically speaking) on an exact flow for all edge cases. If your app chooses placement based on it's own algorithm, the users SHOULD be able to add that kind of input to the app directly, rather than some ping-pong between the app, Slurm and the user.
5> Preferential hosts for scheduling. If there is some preferential order of hosts or backfill scheduling enabled the same needs to be communicated to us.
Backfill is valid, but the rest is not, considering your app chooses placement. Or does your app chooses placement overlapping previous placements? Does it not preserve state?
6> Regular intimation of job events, like dispatch , suspended, finish, re-submission etc so that we can take appropriate action.
All valid, but doesn't invalidate my statement. You don't need anything special from Slurm for these.
Hope this clears our requirements & expectations.
Almost any solution that I can think of for your requirements requires admin level changes to Slurm - using a healthcheckprogram, prologs, submit plugins - all are cluster-wide modifications that would affect other users of the cluster, not just your nodes.
That is why I suggested you might be better off using your own private solution, since Slurm really is not designed to work with external placement. It can be done but would be suboptimal.
I still believe the best option is to rewrite the app to communicate the placement requirements (based on the algorithm and previous runs as input) to Slurm as a simple string of sbatch flags, and just let Slurm do it's thing. It sounds simpler than forcing all other users of the cluster to adhere to your particular needs without introducing unnecessary complexity to the cluster.
Regards, Bhaskar.
Regards,
--Dani_L.
On Thursday, 18 July, 2024 at 04:47:51 am IST, Daniel Letai via slurm-users slurm-users@lists.schedmd.com wrote:
In the scenario you provide, you don't need anything special.
You just have to configure a partition that is available only to you, and to no other account on the cluster. This partition will only include your hosts. All other partition will not include any of your hosts.
Then use you own implementation to do whatever you want with the hosts. As long as you are the exclusive owners of the hosts, Slurm is not really part of the equation.
You don't even have to allocate the hosts using Slurm, as there is no contention.
If you want to use Slurm to use your placement, instead of directly starting the app on the nodes, just use the -w (--nodelist) option with the hosts requested. Make sure to only request your partition.
You really don't need anything special from Slurm, you don't really need Slurm for this.
On 15/07/2024 19:26:10, jubhaskar--- via slurm-users wrote:
Hi Daniel, Thanks for picking up this query. Let me try to briefly describe my problem.
As you rightly guessed, we have some hardware on the backend which would be used for our jobs to run. The app which manages the h/w has its own set of resource placement/remapping rules to place a job. So, for eg., if only 3 hosts h1, h2, h3 (2 cores available each) are available at some point for a 4 core job then it's only a few combination of cores from these hosts can be allowed for the job. Also there is a preference order of the placements decided by our app.
It's in this respect we want our backend app to bring the placement for the job. Slurm would then dispatch the job accordingly while honoring the exact resource distribution as asked for. In case for the need of preemption as well our backend would decide the placement which would decide which preemptable job candidates to preempt.
So, how should we proceed then? We mayn't have the whole site/cluster to ourselves. There me be other jobs which we don't care about & hence they should go in the usual route from the select plugin which is there (linear, cons_tres etc).
Is there a scope for a separate partition which will encompass our resources only & trigger our plugin only for our jobs? How do the options a>, b> , c> stand (as described in my 1st message) now that I mention our requirement?
A 4th option which comes to my mind is that if there's a possibility through some API interface from Slurm which will inform a separate process P (say) about resource availability on a real time basis. P will talk to our backend app, bring a placement & then ask lSurm to place our job.
Your concern about everchanging resources (being allocated before our backend comes up) is uncalled for as the hosts are segregated as far as our system is concerned. Our hosts will run only our jobs & other Slurm jobs would run in different hosts.
Hope I make myself little more clearer ! Any help would be appreciated.
(Note: We already have a working solution with LSF! LSF does provide option for custom scheduler plugins to let one connect in the decision making loop during scheduling. This only led us to believe Slurm would also have some possibilities.)
Regards, Bhaskar.
-- Regards,
Daniel Letai +972 (0)505 870 456
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Just to add a some more ideas in response to your other comments which I realized & read a little later. It's not practical for user to write rules for priority order or job resource limits etc other thanthrough configuration of an existing scheduling solution.Also, such config about job limits, user priorities etc I wonder can't they be done partition wise ? We don't want to write a scheduling solution ourselves but rather plan to use one, like Slurm.Writing a full scheduling solution is beyond our scope.
Hence, our backend app would also depend on Slurm toupdate itself the job submission, finish, resource limits margins available etc.The state of jobs (pending/running/suspended) at any time would be informed through Slurm which our appwould store & act accordingly. I don't entirely rule out the possibility of having a dedicated Slurm cluster to ourselves for our jobs but thiswould again be a decision by higher management and the customer in question. Regards,Bhaskar.
On Saturday, 20 July, 2024 at 10:16:36 pm IST, Bhaskar Chakraborty via slurm-users slurm-users@lists.schedmd.com wrote:
Hi Daniel, Our jobs would have mutually exhaustive set of resources, though they may share hosts.Resources are mainly CPU cores on hosts.Private trackable resources can be defined according to needs & they would be exclusive to jobs too. Jobs, as I mentioned, can share hosts. Their resource granularity can vary from 1 CPU core/slot in 1 host to multiple hosts withvarying number of CPU cores for the host. Jobs can launch parallel tasks in their allocated hosts with appropriate utility fromSlurm. If I take the smallest possible piece of hardware as 1 unit then a job requiring 5 such units would need an orderof 5 CPU cores. Hardware by itself is physically connected to hosts & hence the job which would be mapped toa certain group of h/w units would be accordingly run on the hosts to which they are connected.More than 1 h/w unit may be connected to a single host. If 5 such units are connected to 2 hosts (3 + 2) then the job would be dispatched in those 2 hosts taking 3 & 2 CPU cores respectively. Jobs may require other resource like memory and the same should be managed by Slurm according to a jobs' need.So, even if h/w is free & accordingly CPU cores are available for a job to be placed but it may be denied a placement asits memory requirements is/are not met by the host(s).This should be communicated by Slurm to our process. As far as private resources is concerned, this will only depend how slurm is able to manage the basic requirements that of hosts& CPU cores. The CPU cores' identification is mainly to deal with preemption. For eg., a pending job jP may need to preempt only 1 of jR1 & jR2 (say jR1) both running on a host h1 with 1 CPU core for each.This is governed by the h/w placement it gets by the backend. Now, how to distinguish between jR1 & jR2 in order to suspend jR1 & not jR2? This gets solved by a markable resource which gets incorporated into jP which will trigger the preemption.This markable resource can be a uniquely identifiable CPU core or a custom trackable resource. We may make custom trackable resources, should the need arise and Slurm allows such, which can map to individual CPU cores, in turn mapping to the h/w units on which the job is placed at the other end, & let jobs possess them when running. Hope things are clearer. Regards,Bhaskar On Friday, 19 July, 2024 at 02:06:35 pm IST, Daniel Letai via slurm-users slurm-users@lists.schedmd.com wrote:
Hi Bhaskar,
I think I'm missing something here.
All the below points are valid, but mainly of concern for shared resources. You have your private, non-sharing resources, dedicated to the app, or so I understood.
Does your app compete against itself? Do users of the app interfere with jobs of each other? Are the nodes mutable? Do you expect resources to be added to your dedicated subcluster/partition unannounced?
On 18/07/2024 16:39:25, Bhaskar Chakraborty wrote:
Hi Daniel,
Appreciate your response.
I think you may be feeling that since we take the placement part of the scheduling to ourselves then Slurm has no other role to play!
That's not quite true. Below in brief are other important roles which Slurm must perform which presently come to my mind (this mayn't be exhaustive):
1> Slurm should inform the job scheduling priority order. Admin can impose policies (fairshare, usergroup preference etc) which can prioritize more recent jobs compared to older ones & we would like to see them as computed by Slurm on a real time basis.
Such policies are cluster wide and usually immutable unless there is a good reason to modify them. As owners of just some of the resources, the most you can usually expect is to set priority internally (only between your own jobs) via flags or qos.
2> Slurm should update us the info for any configured resource limits. Limits on resources can be there like CPU Cores, number of running jobs, host group CPU limits etc. Our backend app need to be updated from time to time about the same so that unnecessary allocates are avoided right away.
Resource limits shouldn't change. Number of running jobs are a known quantity as they are product of the app. and the app is the only thing that produces jobs in those nodes.
In the case of node failures, any monitoring solution will do the job as well as Slurm. Slurm is convenient in that regard, but not essential.
3> Preemptable job candidates. Admin can mark certain jobs from certain users as preemptable ones. Our app needs to be informed about that should the need arise to preempt running jobs.
The nodes are yours. What would preempt them aside from you?
4> Specified host resources for job start. User may want their jobs to start on specific hosts & the same should be communicated back by Slurm. Similarly, the same applies if user wants his job to run on a certain set of hosts.
The impetus for this discussion was that your app is the arbiter of placement, and only your app is running on those nodes.
I think you haven't decided (programmatically speaking) on an exact flow for all edge cases. If your app chooses placement based on it's own algorithm, the users SHOULD be able to add that kind of input to the app directly, rather than some ping-pong between the app, Slurm and the user.
5> Preferential hosts for scheduling. If there is some preferential order of hosts or backfill scheduling enabled the same needs to be communicated to us.
Backfill is valid, but the rest is not, considering your app chooses placement. Or does your app chooses placement overlapping previous placements? Does it not preserve state?
6> Regular intimation of job events, like dispatch , suspended, finish, re-submission etc so that we can take appropriate action.
All valid, but doesn't invalidate my statement. You don't need anything special from Slurm for these.
Hope this clears our requirements & expectations.
Almost any solution that I can think of for your requirements requires admin level changes to Slurm - using a healthcheckprogram, prologs, submit plugins - all are cluster-wide modifications that would affect other users of the cluster, not just your nodes.
That is why I suggested you might be better off using your own private solution, since Slurm really is not designed to work with external placement. It can be done but would be suboptimal.
I still believe the best option is to rewrite the app to communicate the placement requirements (based on the algorithm and previous runs as input) to Slurm as a simple string of sbatch flags, and just let Slurm do it's thing. It sounds simpler than forcing all other users of the cluster to adhere to your particular needs without introducing unnecessary complexity to the cluster.
Regards, Bhaskar.
Regards,
--Dani_L.
On Thursday, 18 July, 2024 at 04:47:51 am IST, Daniel Letai via slurm-users slurm-users@lists.schedmd.com wrote:
In the scenario you provide, you don't need anything special.
You just have to configure a partition that is available only to you, and to no other account on the cluster. This partition will only include your hosts. All other partition will not include any of your hosts.
Then use you own implementation to do whatever you want with the hosts. As long as you are the exclusive owners of the hosts, Slurm is not really part of the equation.
You don't even have to allocate the hosts using Slurm, as there is no contention.
If you want to use Slurm to use your placement, instead of directly starting the app on the nodes, just use the -w (--nodelist) option with the hosts requested. Make sure to only request your partition.
You really don't need anything special from Slurm, you don't really need Slurm for this.
On 15/07/2024 19:26:10, jubhaskar--- via slurm-users wrote:
Hi Daniel, Thanks for picking up this query. Let me try to briefly describe my problem.
As you rightly guessed, we have some hardware on the backend which would be used for our jobs to run. The app which manages the h/w has its own set of resource placement/remapping rules to place a job. So, for eg., if only 3 hosts h1, h2, h3 (2 cores available each) are available at some point for a 4 core job then it's only a few combination of cores from these hosts can be allowed for the job. Also there is a preference order of the placements decided by our app.
It's in this respect we want our backend app to bring the placement for the job. Slurm would then dispatch the job accordingly while honoring the exact resource distribution as asked for. In case for the need of preemption as well our backend would decide the placement which would decide which preemptable job candidates to preempt.
So, how should we proceed then? We mayn't have the whole site/cluster to ourselves. There me be other jobs which we don't care about & hence they should go in the usual route from the select plugin which is there (linear, cons_tres etc).
Is there a scope for a separate partition which will encompass our resources only & trigger our plugin only for our jobs? How do the options a>, b> , c> stand (as described in my 1st message) now that I mention our requirement?
A 4th option which comes to my mind is that if there's a possibility through some API interface from Slurm which will inform a separate process P (say) about resource availability on a real time basis. P will talk to our backend app, bring a placement & then ask lSurm to place our job.
Your concern about everchanging resources (being allocated before our backend comes up) is uncalled for as the hosts are segregated as far as our system is concerned. Our hosts will run only our jobs & other Slurm jobs would run in different hosts.
Hope I make myself little more clearer ! Any help would be appreciated.
(Note: We already have a working solution with LSF! LSF does provide option for custom scheduler plugins to let one connect in the decision making loop during scheduling. This only led us to believe Slurm would also have some possibilities.)
Regards, Bhaskar.
-- Regards,
Daniel Letai +972 (0)505 870 456
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com