[slurm-users] enabling job script archival

Mon Oct 2 15:20:40 UTC 2023

Thanks Paul, this helps.

I don't have any PrivateData line in either config file. According to the
docs, "By default, all information is visible to all users" so this should
not be an issue. I tried to add a line with "PrivateData=jobs" to the conf
files, just in case, but that didn't change the behavior.

On Mon, Oct 2, 2023 at 9:10 AM Paul Edmon <pedmon at cfa.harvard.edu> wrote:

> At least in our setup, users can see their own scripts by doing sacct -B
> -j JOBID
>
> I would make sure that the scripts are being stored and how you have
> PrivateData set.
>
> -Paul Edmon-
> On 10/2/2023 10:57 AM, Davide DelVento wrote:
>
> I deployed the job_script archival and it is working, however it can be
> queried only by root.
>
> A regular user can run sacct -lj towards any jobs (even those by other
> users, and that's okay in our setup) with no problem. However if they run
> sacct -j job_id --batch-script even against a job they own themselves,
> nothing is returned and I get a
>
> slurmdbd: error: couldn't get information for this user (null)(xxxxxx)
>
> where xxxxx is the posix ID of the user who's running the query in the
> slurmdbd logs.
>
> Both configure files slurmdbd.conf and slurm.conf do not have any
> "permission" setting. FWIW, we use LDAP.
>
> Is that the expected behavior, in that by default only root can see the
> job scripts? I was assuming the users themselves should be able to debug
> their own jobs... Any hint on what could be changed to achieve this?
>
> Thanks!
>
>
>
> On Fri, Sep 29, 2023 at 5:48 AM Davide DelVento <davide.quantum at gmail.com>
> wrote:
>
>> Fantastic, this is really helpful, thanks!
>>
>> On Thu, Sep 28, 2023 at 12:05 PM Paul Edmon <pedmon at cfa.harvard.edu>
>> wrote:
>>
>>> Yes it was later than that. If you are 23.02 you are good.  We've been
>>> running with storing job_scripts on for years at this point and that part
>>> of the database only uses up 8.4G.  Our entire database takes up 29G on
>>> disk. So its about 1/3 of the database.  We also have database compression
>>> which helps with the on disk size. Raw uncompressed our database is about
>>> 90G.  We keep 6 months of data in our active database.
>>>
>>> -Paul Edmon-
>>> On 9/28/2023 1:57 PM, Ryan Novosielski wrote:
>>>
>>> Sorry for the duplicate e-mail in a short time: do you know (or anyone)
>>> when the hashing was added? Was planning to enable this on 21.08, but we
>>> then had to delay our upgrade to it. I’m assuming later than that, as I
>>> believe that’s when the feature was added.
>>>
>>> On Sep 28, 2023, at 13:55, Ryan Novosielski <novosirj at rutgers.edu>
>>> <novosirj at rutgers.edu> wrote:
>>>
>>> Thank you; we’ll put in a feature request for improvements in that area,
>>> and also thanks for the warning? I thought of that in passing, but the real
>>> world experience is really useful. I could easily see wanting that stuff to
>>> be retained less often than the main records, which is what I’d ask for.
>>>
>>> I assume that archiving, in general, would also remove this stuff, since
>>> old jobs themselves will be removed?
>>>
>>> --
>>> #BlackLivesMatter
>>> ____
>>> || \\UTGERS,
>>> |---------------------------*O*---------------------------
>>> ||_// the State  |         Ryan Novosielski - novosirj at rutgers.edu
>>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~
>>> RBHS Campus
>>> ||  \\    of NJ  | Office of Advanced Research Computing - MSB
>>> A555B, Newark
>>>      `'
>>>
>>> On Sep 28, 2023, at 13:48, Paul Edmon <pedmon at cfa.harvard.edu>
>>> <pedmon at cfa.harvard.edu> wrote:
>>>
>>> Slurm should take care of it when you add it.
>>>
>>> So far as horror stories, under previous versions our database size
>>> ballooned to be so massive that it actually prevented us from upgrading and
>>> we had to drop the columns containing the job_script and job_env.  This was
>>> back before slurm started hashing the scripts so that it would only store
>>> one copy of duplicate scripts.  After this point we found that the
>>> job_script database stayed at a fairly reasonable size as most users use
>>> functionally the same script each time. However the job_env continued to
>>> grow like crazy as there are variables in our environment that change
>>> fairly consistently depending on where the user is. Thus job_envs ended up
>>> being too massive to keep around and so we had to drop them. Frankly we
>>> never really used them for debugging. The job_scripts though are super
>>> useful and not that much overhead.
>>>
>>> In summary my recommendation is to only store job_scripts. job_envs add
>>> too much storage for little gain, unless your job_envs are basically the
>>> same for each user in each location.
>>>
>>> Also it should be noted that there is no way to prune out job_scripts or
>>> job_envs right now. So the only way to get rid of them if they get large is
>>> to 0 out the column in the table. You can ask SchedMD for the mysql command
>>> to do this as we had to do it here to our job_envs.
>>>
>>> -Paul Edmon-
>>>
>>> On 9/28/2023 1:40 PM, Davide DelVento wrote:
>>>
>>> In my current slurm installation, (recently upgraded to slurm v23.02.3),
>>> I only have
>>>
>>> AccountingStoreFlags=job_comment
>>>
>>> I now intend to add both
>>>
>>> AccountingStoreFlags=job_script
>>> AccountingStoreFlags=job_env
>>>
>>> leaving the default 4MB value for max_script_size
>>>
>>> Do I need to do anything on the DB myself, or will slurm take care of
>>> the additional tables if needed?
>>>
>>> Any comments/suggestions/gotcha/pitfalls/horror_stories to share? I know
>>> about the additional diskspace and potentially load needed, and with our
>>> resources and typical workload I should be okay with that.
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231002/98e8ad02/attachment-0001.htm>