[slurm-users] enabling job script archival

Tue Oct 3 23:43:46 UTC 2023

For others potentially seeing this on mailing list search, yes, I needed
that, which of course required creating an account charge which I wasn't
using. So I ran

sacctmgr add account default_account
sacctmgr add -i user $user Accounts=default_account

with an appropriate looping around for $user and everything is working fine
now.

Thanks everybody!

On Tue, Oct 3, 2023 at 7:44 AM Paul Edmon <pedmon at cfa.harvard.edu> wrote:

> You will probably need to.
>
> The way we handle it is that we add users when the first submit a job via
> the job_submit.lua script. This way the database autopopulates with active
> users.
>
> -Paul Edmon-
> On 10/3/23 9:01 AM, Davide DelVento wrote:
>
> By increasing the slurmdbd verbosity level, I got additional information,
> namely the following:
>
> slurmdbd: error: couldn't get information for this user (null)(xxxxxx)
> slurmdbd: debug: accounting_storage/as_mysql:
> as_mysql_jobacct_process_get_jobs: User  xxxxxx  has no associations, and
> is not admin, so not returning any jobs.
>
> again where xxxxx is the posix ID of the user who's running the query in
> the slurmdbd logs.
>
> I suspect this is due to the fact that our userbase is small enough (we
> are a department HPC) that we don't need to use allocation and the like, so
> I have not configured any association (and not even studied its
> configuration, since when I was at another place which did use
> associations, someone else took care of slurm administration).
>
> Anyway, I read the fantastic document by our own member at
> https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_accounting/#associations
> and in fact I have not even configured slurm users:
>
> # sacctmgr show user
>       User   Def Acct     Admin
> ---------- ---------- ---------
>       root       root Administ+
> #
>
> So is that the issue? Should I just add all users? Any suggestions on the
> minimal (but robust) way to do that?
>
> Thanks!
>
>
> On Mon, Oct 2, 2023 at 9:20 AM Davide DelVento <davide.quantum at gmail.com>
> wrote:
>
>> Thanks Paul, this helps.
>>
>> I don't have any PrivateData line in either config file. According to the
>> docs, "By default, all information is visible to all users" so this should
>> not be an issue. I tried to add a line with "PrivateData=jobs" to the conf
>> files, just in case, but that didn't change the behavior.
>>
>> On Mon, Oct 2, 2023 at 9:10 AM Paul Edmon <pedmon at cfa.harvard.edu> wrote:
>>
>>> At least in our setup, users can see their own scripts by doing sacct -B
>>> -j JOBID
>>>
>>> I would make sure that the scripts are being stored and how you have
>>> PrivateData set.
>>>
>>> -Paul Edmon-
>>> On 10/2/2023 10:57 AM, Davide DelVento wrote:
>>>
>>> I deployed the job_script archival and it is working, however it can be
>>> queried only by root.
>>>
>>> A regular user can run sacct -lj towards any jobs (even those by other
>>> users, and that's okay in our setup) with no problem. However if they run
>>> sacct -j job_id --batch-script even against a job they own themselves,
>>> nothing is returned and I get a
>>>
>>> slurmdbd: error: couldn't get information for this user (null)(xxxxxx)
>>>
>>> where xxxxx is the posix ID of the user who's running the query in the
>>> slurmdbd logs.
>>>
>>> Both configure files slurmdbd.conf and slurm.conf do not have any
>>> "permission" setting. FWIW, we use LDAP.
>>>
>>> Is that the expected behavior, in that by default only root can see the
>>> job scripts? I was assuming the users themselves should be able to debug
>>> their own jobs... Any hint on what could be changed to achieve this?
>>>
>>> Thanks!
>>>
>>>
>>>
>>> On Fri, Sep 29, 2023 at 5:48 AM Davide DelVento <
>>> davide.quantum at gmail.com> wrote:
>>>
>>>> Fantastic, this is really helpful, thanks!
>>>>
>>>> On Thu, Sep 28, 2023 at 12:05 PM Paul Edmon <pedmon at cfa.harvard.edu>
>>>> wrote:
>>>>
>>>>> Yes it was later than that. If you are 23.02 you are good.  We've been
>>>>> running with storing job_scripts on for years at this point and that part
>>>>> of the database only uses up 8.4G.  Our entire database takes up 29G on
>>>>> disk. So its about 1/3 of the database.  We also have database compression
>>>>> which helps with the on disk size. Raw uncompressed our database is about
>>>>> 90G.  We keep 6 months of data in our active database.
>>>>>
>>>>> -Paul Edmon-
>>>>> On 9/28/2023 1:57 PM, Ryan Novosielski wrote:
>>>>>
>>>>> Sorry for the duplicate e-mail in a short time: do you know (or
>>>>> anyone) when the hashing was added? Was planning to enable this on 21.08,
>>>>> but we then had to delay our upgrade to it. I’m assuming later than that,
>>>>> as I believe that’s when the feature was added.
>>>>>
>>>>> On Sep 28, 2023, at 13:55, Ryan Novosielski <novosirj at rutgers.edu>
>>>>> <novosirj at rutgers.edu> wrote:
>>>>>
>>>>> Thank you; we’ll put in a feature request for improvements in that
>>>>> area, and also thanks for the warning? I thought of that in passing, but
>>>>> the real world experience is really useful. I could easily see wanting that
>>>>> stuff to be retained less often than the main records, which is what I’d
>>>>> ask for.
>>>>>
>>>>> I assume that archiving, in general, would also remove this stuff,
>>>>> since old jobs themselves will be removed?
>>>>>
>>>>> --
>>>>> #BlackLivesMatter
>>>>> ____
>>>>> || \\UTGERS,
>>>>> |---------------------------*O*---------------------------
>>>>> ||_// the State  |         Ryan Novosielski - novosirj at rutgers.edu
>>>>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~
>>>>> RBHS Campus
>>>>> ||  \\    of NJ  | Office of Advanced Research Computing - MSB
>>>>> A555B, Newark
>>>>>      `'
>>>>>
>>>>> On Sep 28, 2023, at 13:48, Paul Edmon <pedmon at cfa.harvard.edu>
>>>>> <pedmon at cfa.harvard.edu> wrote:
>>>>>
>>>>> Slurm should take care of it when you add it.
>>>>>
>>>>> So far as horror stories, under previous versions our database size
>>>>> ballooned to be so massive that it actually prevented us from upgrading and
>>>>> we had to drop the columns containing the job_script and job_env.  This was
>>>>> back before slurm started hashing the scripts so that it would only store
>>>>> one copy of duplicate scripts.  After this point we found that the
>>>>> job_script database stayed at a fairly reasonable size as most users use
>>>>> functionally the same script each time. However the job_env continued to
>>>>> grow like crazy as there are variables in our environment that change
>>>>> fairly consistently depending on where the user is. Thus job_envs ended up
>>>>> being too massive to keep around and so we had to drop them. Frankly we
>>>>> never really used them for debugging. The job_scripts though are super
>>>>> useful and not that much overhead.
>>>>>
>>>>> In summary my recommendation is to only store job_scripts. job_envs
>>>>> add too much storage for little gain, unless your job_envs are basically
>>>>> the same for each user in each location.
>>>>>
>>>>> Also it should be noted that there is no way to prune out job_scripts
>>>>> or job_envs right now. So the only way to get rid of them if they get large
>>>>> is to 0 out the column in the table. You can ask SchedMD for the mysql
>>>>> command to do this as we had to do it here to our job_envs.
>>>>>
>>>>> -Paul Edmon-
>>>>>
>>>>> On 9/28/2023 1:40 PM, Davide DelVento wrote:
>>>>>
>>>>> In my current slurm installation, (recently upgraded to slurm
>>>>> v23.02.3), I only have
>>>>>
>>>>> AccountingStoreFlags=job_comment
>>>>>
>>>>> I now intend to add both
>>>>>
>>>>> AccountingStoreFlags=job_script
>>>>> AccountingStoreFlags=job_env
>>>>>
>>>>> leaving the default 4MB value for max_script_size
>>>>>
>>>>> Do I need to do anything on the DB myself, or will slurm take care of
>>>>> the additional tables if needed?
>>>>>
>>>>> Any comments/suggestions/gotcha/pitfalls/horror_stories to share? I
>>>>> know about the additional diskspace and potentially load needed, and with
>>>>> our resources and typical workload I should be okay with that.
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231003/b4913162/attachment-0001.htm>