Hi Daniel,

Our jobs would have mutually exhaustive set of resources, though they may share hosts.
Resources are mainly CPU cores on hosts.
Private trackable resources can be defined according to needs & they would be exclusive to jobs too.

Jobs, as I mentioned, can share hosts. Their resource granularity can vary from 1 CPU core/slot in 1 host to multiple hosts with
varying number of CPU cores for the host. Jobs can launch parallel tasks in their allocated hosts with appropriate utility from
Slurm.

If I take the smallest possible piece of hardware as 1 unit then a job requiring 5 such units would need an order
of 5 CPU cores.

Hardware by itself is physically connected to hosts & hence the job which would be mapped to
a certain group of h/w units would be accordingly run on the hosts to which they are connected.
More than 1 h/w unit may be connected to a single host.

If 5 such units are connected to 2 hosts (3 + 2) then the job would be dispatched in those 2 hosts taking 3 & 2 CPU cores respectively.

Jobs may require other resource like memory and the same should be managed by Slurm according to a jobs' need.
So, even if h/w is free & accordingly CPU cores are available for a job to be placed but it may be denied a placement as
its memory requirements is/are not met by the host(s).
This should be communicated by Slurm to our process.

As far as private resources is concerned, this will only depend how slurm is able to manage the basic requirements that of hosts
& CPU cores. The CPU cores' identification is mainly to deal with preemption.

For eg., a pending job jP may need to preempt only 1 of jR1 & jR2 (say jR1) both running on a host h1 with 1 CPU core for each.
This is governed by the h/w placement it gets by the backend.  Now, how to distinguish between jR1 & jR2 in order to suspend jR1 & not jR2?  This gets solved by a markable resource which gets incorporated into jP which will trigger the preemption.
This markable resource can be a uniquely identifiable CPU core or a custom trackable resource.

We may make custom trackable resources, should the need arise and Slurm allows such, which can map to individual CPU cores, in turn mapping to the h/w units on which the job is placed at the other end, & let jobs possess them when running.

Hope things are clearer.

Regards,
Bhaskar
On Friday, 19 July, 2024 at 02:06:35 pm IST, Daniel Letai via slurm-users <slurm-users@lists.schedmd.com> wrote:


Hi Bhaskar,


I think I'm missing something here.


All the below points are valid, but mainly of concern for shared
resources. You have your private, non-sharing resources, dedicated to
the app, or so I understood.


Does your app compete against itself? Do users of the app interfere with
jobs of each other? Are the nodes mutable? Do you expect resources to be
added to your dedicated subcluster/partition unannounced?



On 18/07/2024 16:39:25, Bhaskar Chakraborty wrote:
> Hi Daniel,
>
> Appreciate your response.
>
> I think you may be feeling that since we take the placement part of
> the scheduling to ourselves then Slurm has no other role to play!
>
> That's not quite true. Below in brief are other important roles which
> Slurm must perform which presently come to my mind
> (this mayn't be exhaustive):
>
> 1> Slurm should inform the job scheduling priority order.
>      Admin can impose policies (fairshare, usergroup preference etc)
> which can prioritize more recent jobs compared to older ones &
>      we would like to see them as computed by Slurm on a real time basis.
Such policies are cluster wide and usually immutable unless there is a
good reason to modify them. As owners of just some of the resources, the
most you can usually expect is to set priority internally (only between
your own jobs) via flags or qos.
>
> 2> Slurm should update us the info for any configured resource limits.
>       Limits on resources can be there like CPU Cores, number of
> running jobs, host group CPU limits etc.
>       Our backend app need to be updated from time to time about the
> same so that unnecessary allocates are avoided right away.

Resource limits shouldn't change. Number of running jobs are a known
quantity as they are product of the app. and the app is the only thing
that produces jobs in those nodes.

In the case of node failures, any monitoring solution will do the job as
well as Slurm. Slurm is convenient in that regard, but not essential.

>
> 3> Preemptable job candidates.
>      Admin can mark certain jobs from certain users as preemptable ones.
>      Our app needs to be informed about that should the need arise to
> preempt running jobs.
The nodes are yours. What would preempt them aside from you?
>
> 4> Specified host resources for job start.
>      User may want their jobs to start on specific hosts & the same
> should be communicated back by Slurm.
>      Similarly, the same applies if user wants his job to run on a
> certain  set of hosts.

The impetus for this discussion was that your app is the arbiter of
placement, and only your app is running on those nodes.

I think you haven't decided (programmatically speaking) on an exact flow
for all edge cases. If your app chooses placement based on it's own
algorithm, the users SHOULD be able to add that kind of input to the app
directly, rather than some ping-pong between the app, Slurm and the user.

>
> 5> Preferential hosts for scheduling. If there is some preferential
> order of hosts or backfill scheduling enabled the same
>      needs to be communicated to us.
Backfill is valid, but the rest is not, considering your app chooses
placement. Or does your app chooses placement overlapping previous
placements? Does it not preserve state?
>
> 6> Regular intimation of job events, like dispatch , suspended,
> finish, re-submission etc so that we can take appropriate action.
All valid, but doesn't invalidate my statement. You don't need anything
special from Slurm for these.
>
> Hope this clears our requirements & expectations.


Almost any solution that I can think of for your requirements requires
admin level changes to Slurm - using a healthcheckprogram, prologs,
submit plugins - all are cluster-wide modifications that would affect
other users of the cluster, not just your nodes.

That is why I suggested you might be better off using your own private
solution, since Slurm really is not designed to work with external
placement. It can be done but would be suboptimal.


I still believe the best option is to rewrite the app to communicate the
placement requirements (based on the algorithm and previous runs as
input) to Slurm as a simple string of sbatch flags, and just let Slurm
do it's thing. It sounds simpler than forcing all other users of the
cluster to adhere to your particular needs without introducing
unnecessary complexity to the cluster.


>
> Regards,
> Bhaskar.

Regards,

--Dani_L.


> On Thursday, 18 July, 2024 at 04:47:51 am IST, Daniel Letai via
> slurm-users <slurm-users@lists.schedmd.com> wrote:
>
>
> In the scenario you provide, you don't need anything special.
>
>
> You just have to configure a partition that is available only to you,
> and to no other account on the cluster. This partition will only
> include your hosts. All other partition will not include any of your
> hosts.
>
> Then use you own implementation to do whatever you want with the
> hosts. As long as you are the exclusive owners of the hosts, Slurm is
> not really part of the equation.
>
>
> You don't even have to allocate the hosts using Slurm, as there is no
> contention.
>
>
> If you want to use Slurm to use your placement, instead of directly
> starting the app on the nodes, just use the -w (--nodelist) option
> with the hosts requested. Make sure to only request your partition.
>
>
> You really don't need anything special from Slurm, you don't really
> need Slurm for this.
>
>
>
> On 15/07/2024 19:26:10, jubhaskar--- via slurm-users wrote:
>> Hi Daniel,
>> Thanks for picking up this query. Let me try to briefly describe my problem.
>>
>> As you rightly guessed, we have some hardware on the backend which would be used for our
>> jobs to run. The app which manages the h/w has its own set of resource placement/remapping
>> rules to place a job.
>> So, for eg., if only 3 hosts h1, h2, h3 (2 cores available each) are available at some point for a
>> 4 core job then it's only a few combination of cores from these hosts can be allowed for
>> the job. Also there is a preference order of the placements decided by our app.
>>
>> It's in this respect we want our backend app to bring the placement for the job.
>> Slurm would then dispatch the job accordingly while honoring the exact resource distribution
>> as asked for. In case for the need of preemption as well our backend would decide the placement
>> which would decide which preemptable job candidates to preempt.
>>
>> So, how should we proceed then?
>> We mayn't have the whole site/cluster to ourselves. There me be other jobs which we don't
>> care about & hence they should go in the usual route from the select plugin which is there (linear, cons_tres etc).
>>
>> Is there a scope for a separate partition which will encompass our resources only & trigger our
>> plugin only for our jobs?
>> How do the options a>, b> , c> stand (as described in my 1st message) now that I mention our requirement?
>>
>> A 4th option which comes to my mind is that if there's a possibility through some API interface from Slurm
>> which will inform a separate process P (say) about resource availability on a real time basis.
>> P will talk to our backend app, bring a placement & then ask lSurm to place our job.
>>
>> Your concern about everchanging resources (being allocated before our backend comes up) is uncalled for
>> as the hosts are segregated as far as our system is concerned. Our hosts will run only our jobs & other Slurm
>> jobs would run in different hosts.
>>
>> Hope I make myself little more clearer ! Any help would be appreciated.
>>
>> (Note: We already have a working solution with LSF! LSF does provide option for custom scheduler plugins
>> to let one connect in the decision making loop during scheduling. This only led us to believe Slurm would also
>> have some possibilities.)
>>
>> Regards,
>> Bhaskar.
>>
> --
> Regards,
>
> Daniel Letai
> +972 (0)505 870 456
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-leave@lists.schedmd.com


--
Regards,

Daniel Letai
+972 (0)505 870 456


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com