[slurm-users] Help with preemtion based on licenses
Reuti
reuti at staff.uni-marburg.de
Wed Nov 6 16:56:31 UTC 2019
> Am 06.11.2019 um 16:36 schrieb Oytun Peksel <Oytun.Peksel at semcon.com>:
>
> Thanks for the information Mark.
>
> I understand. GPU part of the discussion is beyond my knowledge so I assumed it would be possible to release it.
>
> But as for the licenses it is always possible to leave it to the system admin. It is possible to take care of license release and reacquire using scripts instead of assuming it is not possible. At least there should be an easy configuration option to configure generic or trackable resources to be releasable.
To name some additional obstacles to Mark's notes:
In the inaction of any queuing system and the license tracking mechanism inside each application there can for sure many things be improved. But it starts already with the constraint that there is to my knowledge no mechanism in any license daemon to "check and reserve/acquire a license if available" in an atomic operation, so that the queuing system is aware of the availability of a license and schedule a job to use it. What might come close is to borrow a license in a scheduling run and use this information for an upcoming job. But here already the limitations of each allocation might be different: some vendors allow to release a borrowed license premature, while others don't allow this and one has to wait for the specified timeframe to elapse.
Then there is the application itself: when does it check for an available license? Just as the application starts, periodic every certain amount of elapsed time, or for each iteration while it's running – or will it hold the license while it's running and only release it when it finishes? What will happen if the application was suspended for some time and when it continues it might discover that there were X minutes without a license daemon response and so it might quit. If one is lucky: results achieved up to this point can still be saved.
To make the things worse: what type of license is used by a particular application? One license per core/thread, per CPU, per job, per machine; or per machine per user or for each group on this machine?
One positive aspect could be, if one job consists of several instances of a program like a compiler when compiling a large application and the job could be stopped exactly when no compiler instance is active but just the job script.
Sure, for some applications it might be possible to script this in some way. So in my opinion the first goal for such a proposal would be to get this working outside of any queuing system. Stop the application on a local machine with a sigstop and attempt to use the license by another instance of this application, being it the same or another machine. Often the state of the license daemon can be checked and the stopped application should allow the counter of the available licenses to increment again in the license daemon's state output.
-- Reuti
> After all software licenses might be the most expensive resource to utilize where preemption might sometimes be inevitable.
>
> For now I have no better plan than to dig in the source code to find an easy way to change this behavior.
>
> Oytun Peksel
> oytun.peksel at semcon.com
> Mobile +46739205917
>
>
> -----Original Message-----
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Mark Hahn
> Sent: den 6 november 2019 16:23
> To: Slurm User Community List <slurm-users at lists.schedmd.com>
> Subject: Re: [slurm-users] Help with preemtion based on licenses
>
>> This does not make sense to me. If gpu is my generic resource why would it not release the gpu resources if a job is suspended?
>
> how would that be implemented? how would the scheduler reach into the application and cause the license to be released and reacquired?
> after all, the license server is otherwise oblivious to whether the job it has granted a license to has been suspended or resumed.
> this applies to other gres as well - for instance GPUs, since there's no mechanism to free up GPU resources allocated to a suspended process.
>
> *that* is the problem - merely adding and substracting is not.
>
> regards, mark hahn.
>
>
>
> When you communicate with us or otherwise interact with Semcon, we will process personal data that you provide to us or we collect about you, please read more in our Privacy Policy<https://semcon.com/data-privacy-policy/>.
>
More information about the slurm-users
mailing list