[slurm-users] Slurm Jobscript Archiver

Mon Jun 17 08:25:20 UTC 2019

Hi Chris,

you’ll find the patch for our version attached. Integrate it as you see fit, personally I’d recommend a branch since the two log files approach isn’t really reconcilable with the idea of having separate job files accessible to the respective owner.
All filenames and directories are defined with „#define“ pragmas, as it was more convenient to have them all in one place.

Kind regards,
Lech

-------------- next part --------------
A non-text attachment was scrubbed...
Name: job_archive.patch.gz
Type: application/x-gzip
Size: 8718 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190617/440fbf3d/attachment.bin>
-------------- next part --------------

> Am 15.06.2019 um 00:47 schrieb Christopher Benjamin Coffey <Chris.Coffey at nau.edu>:
> 
> Hi Lech,
> 
> I'm glad that it is working out well with the modifications you've put in place! Yes, there can be a huge volume of jobscripts out there. That’s a pretty good way of dealing with it! . We've backed up 1.1M jobscripts since its inception 1.5 months ago and aren't too worried yet about the inode/space usage. We haven't settled in to what we will do to keep the archive clean yet. My thought was:
> 
> - keep two months (directories) of jobscripts for each user, leaving the jobscripts intact for easy user access
> - tar up the month directories that are older than two months
> - keep four tarred months
> 
> That way there would be 6 months of jobscript archive to match our 6 month job accounting retention in the slurm db.
> 
> I'd be interested in your version however, please do send it along! And please keep in touch with how everything goes!
> 
> Best,
> Chris
> —
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
> 
> 
> On 6/14/19, 2:22 AM, "slurm-users on behalf of Lech Nieroda" <slurm-users-bounces at lists.schedmd.com on behalf of lech.nieroda at uni-koeln.de> wrote:
> 
>    Hello Chris,
> 
>    we’ve tried out your archiver and adapted it to our needs, it works quite well.
>    The changes:
>    - we get lots of jobs per day, ca. 3k-5k, so storing them as individual files would waste too much inodes and 4k-blocks. Instead everything is written into two log files (job_script.log and job_env.log) with the prefix „<timestamp> <user> <jobid>“ in each line. In this way one can easily grep and cut the corresponding job script or environment. Long term storage and compression is handled by logrotate, with standard compression settings
>    - the parsing part can fail to produce a username, thus we have introduced a customized environment variable that stores the username and can be read directly by the archiver 
>    - most of the program’s output, including debug output, is handled by the logger and stored in a jobarchive.log file with an appropriate timestamp
>    - the logger uses a va_list to make multi-argument log-oneliners possible
>    - signal handling is reduced to the debug-level incease/decrease
>    - file handling is mostly relegated to HelperFn, directory trees are now created automatically
>    - the binary header of the env-file and the binary footer of the script-file are filtered, thus the resulting files are recognized as ascii files
> 
>    If you are interested in our modified version, let me know.
> 
>    Kind regards,
>    Lech
> 
> 
>> Am 09.05.2019 um 17:37 schrieb Christopher Benjamin Coffey <Chris.Coffey at nau.edu>:
>> 
>> Hi All,
>> 
>> We created a slurm job script archiver which you may find handy. We initially attempted to do this through slurm with a slurmctld prolog but it really bogged the scheduler down. This new solution is a custom c++ program that uses inotify to watch for job scripts and environment files to show up out in /var/spool/slurm/hash.* on the head node. When they do, the program copies the jobscript and environment out to a local archive directory. The program is multithreaded and has a dedicated thread watching each hash directory. The program is super-fast and lightweight and has no side effects on the scheduler. The program by default will apply ACLs to the archived job scripts so that only the owner of the jobscript can read the files. Feel free to try it out and let us know how it works for you!
>> 
>> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnauhpc%2Fjob_archive&data=02%7C01%7Cchris.coffey%40nau.edu%7Cce8cb62264b84a21e32608d6f0a9d9be%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C1%7C636961009635679145&sdata=k2%2BdZ90EE78r5PQz9GdblaEWIrPoY79T6gwkIcNrxNE%3D&reserved=0
>> 
>> Best,
>> Chris
>> 
>> —
>> Christopher Coffey
>> High-Performance Computing
>> Northern Arizona University
>> 928-523-1167