[slurm-users] enabling job script archival

Paul Edmon pedmon at cfa.harvard.edu
Tue Oct 3 13:42:18 UTC 2023


You will probably need to.

The way we handle it is that we add users when the first submit a job 
via the job_submit.lua script. This way the database autopopulates with 
active users.

-Paul Edmon-

On 10/3/23 9:01 AM, Davide DelVento wrote:
> By increasing the slurmdbd verbosity level, I got additional 
> information, namely the following:
>
> slurmdbd: error: couldn't get information for this user (null)(xxxxxx)
> slurmdbd: debug: accounting_storage/as_mysql: 
> as_mysql_jobacct_process_get_jobs: User xxxxxx  has no associations, 
> and is not admin, so not returning any jobs.
>
> again where xxxxx is the posix ID of the user who's running the query 
> in the slurmdbd logs.
>
> I suspect this is due to the fact that our userbase is small enough 
> (we are a department HPC) that we don't need to use allocation and the 
> like, so I have not configured any association (and not even studied 
> its configuration, since when I was at another place which did use 
> associations, someone else took care of slurm administration).
>
> Anyway, I read the fantastic document by our own member at 
> https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_accounting/#associations 
> and in fact I have not even configured slurm users:
>
> # sacctmgr show user
>       User   Def Acct     Admin
> ---------- ---------- ---------
>       root       root Administ+
> #
>
> So is that the issue? Should I just add all users? Any suggestions on 
> the minimal (but robust) way to do that?
>
> Thanks!
>
>
> On Mon, Oct 2, 2023 at 9:20 AM Davide DelVento 
> <davide.quantum at gmail.com> wrote:
>
>     Thanks Paul, this helps.
>
>     I don't have any PrivateData line in either config file. According
>     to the docs, "By default, all information is visible to all users"
>     so this should not be an issue. I tried to add a line with
>     "PrivateData=jobs" to the conf files, just in case, but that
>     didn't change the behavior.
>
>     On Mon, Oct 2, 2023 at 9:10 AM Paul Edmon <pedmon at cfa.harvard.edu>
>     wrote:
>
>         At least in our setup, users can see their own scripts by
>         doing sacct -B -j JOBID
>
>         I would make sure that the scripts are being stored and how
>         you have PrivateData set.
>
>         -Paul Edmon-
>
>         On 10/2/2023 10:57 AM, Davide DelVento wrote:
>>         I deployed the job_script archival and it is working, however
>>         it can be queried only by root.
>>
>>         A regular user can run sacct -lj towards any jobs (even those
>>         by other users, and that's okay in our setup) with no
>>         problem. However if they run sacct -j job_id --batch-script
>>         even against a job they own themselves, nothing is returned
>>         and I get a
>>
>>         slurmdbd: error: couldn't get information for this user
>>         (null)(xxxxxx)
>>
>>         where xxxxx is the posix ID of the user who's running the
>>         query in the slurmdbd logs.
>>
>>         Both configure files slurmdbd.conf and slurm.conf do not have
>>         any "permission" setting. FWIW, we use LDAP.
>>
>>         Is that the expected behavior, in that by default only root
>>         can see the job scripts? I was assuming the users themselves
>>         should be able to debug their own jobs... Any hint on what
>>         could be changed to achieve this?
>>
>>         Thanks!
>>
>>
>>
>>         On Fri, Sep 29, 2023 at 5:48 AM Davide DelVento
>>         <davide.quantum at gmail.com> wrote:
>>
>>             Fantastic, this is really helpful, thanks!
>>
>>             On Thu, Sep 28, 2023 at 12:05 PM Paul Edmon
>>             <pedmon at cfa.harvard.edu> wrote:
>>
>>                 Yes it was later than that. If you are 23.02 you are
>>                 good.  We've been running with storing job_scripts on
>>                 for years at this point and that part of the database
>>                 only uses up 8.4G.  Our entire database takes up 29G
>>                 on disk. So its about 1/3 of the database.  We also
>>                 have database compression which helps with the on
>>                 disk size. Raw uncompressed our database is about
>>                 90G.  We keep 6 months of data in our active database.
>>
>>                 -Paul Edmon-
>>
>>                 On 9/28/2023 1:57 PM, Ryan Novosielski wrote:
>>>                 Sorry for the duplicate e-mail in a short time: do
>>>                 you know (or anyone) when the hashing was added? Was
>>>                 planning to enable this on 21.08, but we then had to
>>>                 delay our upgrade to it. I’m assuming later than
>>>                 that, as I believe that’s when the feature was added.
>>>
>>>>                 On Sep 28, 2023, at 13:55, Ryan Novosielski
>>>>                 <novosirj at rutgers.edu>
>>>>                 <mailto:novosirj at rutgers.edu> wrote:
>>>>
>>>>                 Thank you; we’ll put in a feature request for
>>>>                 improvements in that area, and also thanks for the
>>>>                 warning? I thought of that in passing, but the real
>>>>                 world experience is really useful. I could easily
>>>>                 see wanting that stuff to be retained less often
>>>>                 than the main records, which is what I’d ask for.
>>>>
>>>>                 I assume that archiving, in general, would also
>>>>                 remove this stuff, since old jobs themselves will
>>>>                 be removed?
>>>>
>>>>                 --
>>>>                 #BlackLivesMatter
>>>>                 ____
>>>>                 || \\UTGERS,
>>>>                 |---------------------------*O*---------------------------
>>>>                 ||_// the State |         Ryan Novosielski -
>>>>                 novosirj at rutgers.edu
>>>>                 || \\ University | Sr. Technologist - 973/972.0922
>>>>                 (2x0922) ~*~ RBHS Campus
>>>>                 ||  \\    of NJ | Office of Advanced Research
>>>>                 Computing - MSB A555B, Newark
>>>>                      `'
>>>>
>>>>>                 On Sep 28, 2023, at 13:48, Paul Edmon
>>>>>                 <pedmon at cfa.harvard.edu>
>>>>>                 <mailto:pedmon at cfa.harvard.edu> wrote:
>>>>>
>>>>>                 Slurm should take care of it when you add it.
>>>>>
>>>>>                 So far as horror stories, under previous versions
>>>>>                 our database size ballooned to be so massive that
>>>>>                 it actually prevented us from upgrading and we had
>>>>>                 to drop the columns containing the job_script and
>>>>>                 job_env.  This was back before slurm started
>>>>>                 hashing the scripts so that it would only store
>>>>>                 one copy of duplicate scripts.  After this point
>>>>>                 we found that the job_script database stayed at a
>>>>>                 fairly reasonable size as most users use
>>>>>                 functionally the same script each time. However
>>>>>                 the job_env continued to grow like crazy as there
>>>>>                 are variables in our environment that change
>>>>>                 fairly consistently depending on where the user
>>>>>                 is. Thus job_envs ended up being too massive to
>>>>>                 keep around and so we had to drop them. Frankly we
>>>>>                 never really used them for debugging. The
>>>>>                 job_scripts though are super useful and not that
>>>>>                 much overhead.
>>>>>
>>>>>                 In summary my recommendation is to only store
>>>>>                 job_scripts. job_envs add too much storage for
>>>>>                 little gain, unless your job_envs are basically
>>>>>                 the same for each user in each location.
>>>>>
>>>>>                 Also it should be noted that there is no way to
>>>>>                 prune out job_scripts or job_envs right now. So
>>>>>                 the only way to get rid of them if they get large
>>>>>                 is to 0 out the column in the table. You can ask
>>>>>                 SchedMD for the mysql command to do this as we had
>>>>>                 to do it here to our job_envs.
>>>>>
>>>>>                 -Paul Edmon-
>>>>>
>>>>>                 On 9/28/2023 1:40 PM, Davide DelVento wrote:
>>>>>>                 In my current slurm installation,
>>>>>>                 (recently upgraded to slurm v23.02.3), I only have
>>>>>>
>>>>>>                 AccountingStoreFlags=job_comment
>>>>>>
>>>>>>                 I now intend to add both
>>>>>>
>>>>>>                 AccountingStoreFlags=job_script
>>>>>>                 AccountingStoreFlags=job_env
>>>>>>
>>>>>>                 leaving the default 4MB value for max_script_size
>>>>>>
>>>>>>                 Do I need to do anything on the DB myself, or
>>>>>>                 will slurm take care of the additional tables if
>>>>>>                 needed?
>>>>>>
>>>>>>                 Any
>>>>>>                 comments/suggestions/gotcha/pitfalls/horror_stories
>>>>>>                 to share? I know about the additional diskspace
>>>>>>                 and potentially load needed, and with our
>>>>>>                 resources and typical workload I should be okay
>>>>>>                 with that.
>>>>>>
>>>>>>                 Thanks!
>>>>>
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231003/e99a6576/attachment-0001.htm>


More information about the slurm-users mailing list