[slurm-users] License management and invoking scontrol in the prolog
Davide DelVento
davide.quantum at gmail.com
Mon Sep 12 14:51:04 UTC 2022
For other poor souls coming to this conversation, here is the conclusion.
$ sbatch --version
slurm 21.08.5
$ # irrelevant parts omitted from copy-paste for brevity
$ cat /opt/slurm/job_submit.lua
log_prefix = 'slurm_job_submit'
function slurm_job_submit(job_desc, part_list, submit_uid)
slurm.log_info("LOOPING")
for key, value in pairs(job_desc) do
slurm.log_info("%s: key=%s value=%s", log_prefix, key, value)
end
slurm.log_info("END LOOPING")
slurm.log_info("%s: user %s(%u) job_name=%s", log_prefix,
job_desc.user_name, submit_uid, job_desc.name)
return slurm.SUCCESS
end
$ sudo tail /var/log/slurm/slurmctld.log
[2022-09-12T08:13:16.310] _slurm_rpc_submit_batch_job: JobId=16028
InitPrio=4294887729 usec=198
[2022-09-12T08:13:17.213] sched: Allocate JobId=16028
NodeList=co49svnode13 #CPUs=16 Partition=compute512
[2022-09-12T08:15:01.090] lua: LOOPING
[2022-09-12T08:15:01.090] lua: END LOOPING
[2022-09-12T08:15:01.090] lua: slurm_job_submit: user
ddvento(57002254) job_name=lic.slurm
[2022-09-12T08:15:01.091] _slurm_rpc_submit_batch_job: JobId=16029
InitPrio=4294887728 usec=204
[2022-09-12T08:15:01.289] sched/backfill: _start_job: Started
JobId=16029 in compute512 on co49svnode14
[2022-09-12T08:15:17.330] _job_complete: JobId=16028 WEXITSTATUS 0
[2022-09-12T08:15:17.330] _job_complete: JobId=16028 done
Note how the looping is ended. This is probably just my naiveness and
ignorance with how lua work.
In conclusion, I am now able to see job_desc.licenses for example with:
slurm.log_info("%s: licenses=%s", log_prefix, job_desc.licenses)
So now I need to just implement my logic.
Thank you everybody in this conversation for helping to sort out this
issue. Greatly appreciated!
On Thu, Sep 8, 2022 at 9:22 AM Davide DelVento <davide.quantum at gmail.com> wrote:
>
> Thanks Ole, for this clarification, this is very good to know.
>
> However, the problem is that the very example provided by slurm itself
> is the one that has the error. I removed the unpack part with the
> variable arguments and that fixed that part.
>
> Unfortunately, the job_desc table is always empty so the whole
> job_submit.lua seems like a moot point? Or the example is so outdated
> (given that it cannot even log correctly) that this is now performed
> in a different way??
> Davide
>
> On Thu, Sep 8, 2022 at 12:23 AM Ole Holm Nielsen
> <Ole.H.Nielsen at fysik.dtu.dk> wrote:
> >
> > Hi Davide,
> >
> > In your slurmctld log you see an entry "error: job_submit/lua:
> > /opt/slurm/job_submit.lua".
> >
> > What I think happens is that when slurmctld encounters an error in
> > job_submit.lua, it will revert to the last known good script cached by
> > slurmctld and ignore the file on disk from now on, even if it has been
> > corrected. An "scontrol reconfig" may make slurmctld reread the
> > job_submit.lua, please try it.
> >
> > I believe that this slurmctld behavior is undocumented at present. Please
> > see https://bugs.schedmd.com/show_bug.cgi?id=14472#c15 for a description:
> >
> > > And, if after the reconfigure, the job_submit.lua is wrong formatted (or missing), it will use the previous version of the script (which we have stored backup previously):
> >
> > /Ole
> >
> >
> > On 9/7/22 14:21, Davide DelVento wrote:
> > > Thanks Ole, your wiki page sheds some light on this mystery.
> > > Very frustrating that even the simple example provided in the release
> > > fails, and it fails at the most basic logging functionality.
> > >
> > > Note that "my" job_submit.lua is now the unmodified, slurm-provided
> > > one.... and that the luac command returns nothing in my case (this is
> > > Lua 5.3.4) so syntax seems correct?
> > >
> > > Yet the logs report the problem I mentioned rather than the actual
> > > content that the plugin is attempting to log.
> > >
> > > On Wed, Sep 7, 2022 at 2:13 AM Ole Holm Nielsen
> > > <Ole.H.Nielsen at fysik.dtu.dk> wrote:
> > >>
> > >> Hi Davide,
> > >>
> > >> I suggest that you check your job_submit.lua script with the LUA compiler:
> > >>
> > >> luac -p /etc/slurm/job_submit.lua
> > >>
> > >> I have written some more details in my Wiki page
> > >> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#job-submit-plugins
> > >>
> > >> Best regards,
> > >> Ole
> > >>
> > >> On 9/7/22 01:51, Davide DelVento wrote:
> > >>> Thanks again to both of you.
> > >>>
> > >>> I actually did not build Slurm myself, otherwise I'd keep extensive
> > >>> logs of what I did. Other people did, so I don't know. However, I get
> > >>> the same grep'ing results as yours.
> > >>>
> > >>> Looking at the logs reveals some info, but it's cryptic.
> > >>>
> > >>> [2022-09-06T17:33:56.513] debug3: job_submit/lua:
> > >>> slurm_lua_loadscript: skipping loading Lua script:
> > >>> /opt/slurm/job_submit.lua
> > >>> [2022-09-06T17:33:56.513] error: job_submit/lua:
> > >>> /opt/slurm/job_submit.lua: [string "slurm.user_msg
> > >>> (string.format(table.unpack({...})))"]:1: bad argument #2 to 'format'
> > >>> (no value)
> > >>>
> > >>> As you can see, there is no line number and there is nothing like
> > >>> user_msg in this code. There is indeed an "unpack" which is used in
> > >>> the SchedMD-defined logging helper function which has a comment
> > >>> "Implicit definition of arg was removed in Lua 5.2" and that's where I
> > >>> speculate the error occurs.
> > >>>
> > >>> I should stress, this is with their own example, not my code. I guess
> > >>> I could forgo the logging and move forward, but that won't probably
> > >>> lead me very far.
> > >>>
> > >>> I am contemplating submitting a github issue about it? I did check
> > >>> that the version of the job_submit.lua I have is the same currently in
> > >>> the repo at https://github.com/SchedMD/slurm/blob/master/etc/job_submit.lua.example
> > >>>
> > >>> On Thu, Sep 1, 2022 at 11:55 PM Ole Holm Nielsen
> > >>> <Ole.H.Nielsen at fysik.dtu.dk> wrote:
> > >>>>
> > >>>> Did you install all prerequiste packages (including lua) on the server
> > >>>> where you built the Slurm packages?
> > >>>>
> > >>>> On my system I get:
> > >>>>
> > >>>> $ strings `which slurmctld ` | grep HAVE_LUA
> > >>>> HAVE_LUA 1
> > >>>>
> > >>>> /Ole
> > >>>>
> > >>>> https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#install-prerequisites
> > >>>>
> > >>>> On 9/2/22 05:15, Davide DelVento wrote:
> > >>>>> Thanks.
> > >>>>>
> > >>>>> I did try a lua script as soon as I got your first email, but that
> > >>>>> never worked (yes, I enabled it in slurm.conf and ran "scontrol
> > >>>>> reconfigure" after). Slurm simply acted as if there was no job_submit script.
> > >>>>>
> > >>>>> After various tests, all unsuccessful, today I found that link which I
> > >>>>> mentioned saying that lua might not be compiled in, hence all my most
> > >>>>> recent messages of this thread.
> > >>>>>
> > >>>>> That file is indeed there, so that's good news that I don't need to recompile.
> > >>>>> However I'm puzzled on what might be missing...
> > >>>>>
> > >>>>>
> > >>>>> On Thu, Sep 1, 2022 at 6:33 PM Brian Andrus <toomuchit at gmail.com> wrote:
> > >>>>>>
> > >>>>>> lua is the language you can use with the job_submit plugin.
> > >>>>>>
> > >>>>>> I was showing a quick way to see that job_submit capability is indeed in
> > >>>>>> there.
> > >>>>>>
> > >>>>>> You can see if lua support is there by looking for the job_submit_lua.so
> > >>>>>> file is there.
> > >>>>>> It would be part of the slurm rpm (not the slurm-slurmctl rpm)
> > >>>>>>
> > >>>>>> Usually it would be found at /usr/lib64/slurm/job_submit_lua.so
> > >>>>>>
> > >>>>>> If that is there, you should be good with trying out a job_submit lua
> > >>>>>> script.
> > >>>>>>
> > >>>>>> Brian Andrus
> > >>>>>>
> > >>>>>> On 9/1/2022 1:24 PM, Davide DelVento wrote:
> > >>>>>>> Thanks again, Brian, indeed that grep returns many hits, but none of
> > >>>>>>> them includes lua, i.e.
> > >>>>>>>
> > >>>>>>> strings `which slurmctld ` | grep -i job_submit | grep -i lua
> > >>>>>>>
> > >>>>>>> returns nothing. So I should use the C rather than the more convenient
> > >>>>>>> lua interface, unless I recompile or am I missing something?
> > >>>>>>>
> > >>>>>>> On Thu, Sep 1, 2022 at 12:30 PM Brian Andrus <toomuchit at gmail.com> wrote:
> > >>>>>>>> I would be surprised if it were compiled without the support. However,
> > >>>>>>>> you could check and run something like:
> > >>>>>>>>
> > >>>>>>>> strings /sbin/slurmctld | grep job_submit
> > >>>>>>>>
> > >>>>>>>> (or where ever your slurmctld binary is). There should be quite a few
> > >>>>>>>> lines with that in it.
> > >>>>>>>>
> > >>>>>>>> Brian Andrus
> > >>>>>>>>
> > >>>>>>>> On 9/1/2022 10:54 AM, Davide DelVento wrote:
> > >>>>>>>>> Thanks Brian for the suggestion, which I am now exploring.
> > >>>>>>>>>
> > >>>>>>>>> The documentation is a bit cryptic for me, but exploring a few things
> > >>>>>>>>> and checking https://funinit.wordpress.com/2018/06/07/how-to-use-job_submit_lua-with-slurm/
> > >>>>>>>>> I suspect my slurm install (provided by cluster vendor) was not
> > >>>>>>>>> compiled with the lua plugin installed. Do you know how to verify if
> > >>>>>>>>> that is the case or if it's something else? I don't see a way to show
> > >>>>>>>>> if the plugin is actually being "seen" by slurm, and I suspect it's
> > >>>>>>>>> not.
> > >>>>>>>>>
> > >>>>>>>>> Does anyone else have other suggestions or comment on either the
> > >>>>>>>>> plugin or the prolog workaround?
> > >>>>>>>>>
> > >>>>>>>>> Thanks!
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Tue, Aug 30, 2022 at 3:01 PM Brian Andrus <toomuchit at gmail.com> wrote:
> > >>>>>>>>>> Not sure if you can do all the things you intend, but the job_submit
> > >>>>>>>>>> script is precisely where you want to check submission options.
> > >>>>>>>>>>
> > >>>>>>>>>> https://slurm.schedmd.com/job_submit_plugins.html
> > >>>>>>>>>>
> > >>>>>>>>>> Brian Andrus
> > >>>>>>>>>>
> > >>>>>>>>>> On 8/30/2022 12:58 PM, Davide DelVento wrote:
> > >>>>>>>>>>> Hi,
> > >>>>>>>>>>>
> > >>>>>>>>>>> I would like to soft-enforce license utilization only when the -L is
> > >>>>>>>>>>> set. My idea: check in the prolog if the license was requested and
> > >>>>>>>>>>> only if it were, set the environmental variables needed for the
> > >>>>>>>>>>> license.
> > >>>>>>>>>>>
> > >>>>>>>>>>> I looked at all environmental variables set by slurm and did not find
> > >>>>>>>>>>> any related to the license as I was hoping.
> > >>>>>>>>>>>
> > >>>>>>>>>>> As a workaround, I could check
> > >>>>>>>>>>>
> > >>>>>>>>>>> scontrol show job $SLURM_JOB_ID | grep License
> > >>>>>>>>>>>
> > >>>>>>>>>>> and that would work, but (as discussed in other messages in this list)
> > >>>>>>>>>>> the documentation at https://slurm.schedmd.com/prolog_epilog.html say
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Prolog and Epilog scripts should be designed to be as short as possible
> > >>>>>>>>>>>> and should not call Slurm commands (e.g. squeue, scontrol, sacctmgr,
> > >>>>>>>>>>>> etc). [...] Slurm commands in these scripts can potentially lead to performance
> > >>>>>>>>>>>> issues and should not be used.
> > >>>>>>>>>>> This is a bit of a concern, since the prolog would be invoked for
> > >>>>>>>>>>> every job on the cluster, and it's a prolog (rather than the epilogue
> > >>>>>>>>>>> like discussed in earlier messages).
> > >>>>>>>>>>>
> > >>>>>>>>>>> So two questions:
> > >>>>>>>>>>>
> > >>>>>>>>>>> 1) is there a better workaround to check in the prolog if the current
> > >>>>>>>>>>> job requested a license and/or
> > >>>>>>>>>>> 2) would this kind of use of scontrol be okay or is indeed a concern
> >
More information about the slurm-users
mailing list