Hey Jeffrey,
thanks for this suggestion! This is probably the way to go if one can find a way to access GRES in the prolog. I read somewhere that people were calling scontrol to get this information, but this seems a bit unclean. Anyway, if I find some time I will try it out.
Best,
Tim
On 2/6/24 16:30, Jeffrey T Frey wrote:
Most of my ideas have revolved around creating file systems on-the-fly as part of the job prolog and destroying them in the epilog. The issue with that mechanism is that formatting a file system (e.g. mkfs.<type>) can be time-consuming. E.g. formatting your local scratch SSD as an LVM PV+VG and allocating per-job volumes, you'd still need to run a e.g. mkfs.xfs and mount the new file system.
ZFS file system creation is much quicker (basically combines the LVM + mkfs steps above) but I don't know of any clusters using ZFS to manage local file systems on the compute nodes :-)
One /could/ leverage XFS project quotas. E.g. for Slurm job 2147483647:
*[root@r00n00 /]# mkdir /tmp-alloc/slurm-2147483647* *[root@r00n00 /]# xfs_quota -x -c 'project -s -p /tmp-alloc/slurm-2147483647 2147483647' /tmp-alloc* Setting up project 2147483647 (path /tmp-alloc/slurm-2147483647)... Processed 1 (/etc/projects and cmdline) paths for project 2147483647 with recursion depth infinite (-1). *[root@r00n00 /]# xfs_quota -x -c 'limit -p bhard=1g 2147483647' /tmp-alloc* *[root@r00n00 /]# cd /tmp-alloc/slurm-2147483647* *[root@r00n00 slurm-2147483647]# dd if=/dev/zero of=zeroes bs=5M count=1000* dd: error writing ‘zeroes’: No space left on device 205+0 records in 204+0 records out 1073741824 bytes (1.1 GB) copied, 2.92232 s, 367 MB/s : [root@r00n00 /]# rm -rf /tmp-alloc/*slurm-2147483647* [root@r00n00 /]# *xfs_quota -x -c 'limit -p bhard=0 2147483647' /tmp-alloc*
Since Slurm jobids max out at 0x03FFFFFF (and 2147483647 = 0x7FFFFFFF) we have an easy on-demand project id to use on the file system. Slurm tmpfs plugins have to do a mkdir to create the per-job directory, adding two xfs_quota commands (which run in more or less O(1) time) won't extend the prolog by much. Likewise, Slurm tmpfs plugins have to scrub the directory at job cleanup, so adding another xfs_quota command will not do much to change their epilog execution times. The main question is "where does the tmpfs plugin find the quota limit for the job?"
On Feb 6, 2024, at 08:39, Tim Schneider via slurm-users slurm-users@lists.schedmd.com wrote:
Hi,
In our SLURM cluster, we are using the job_container/tmpfs plugin to ensure that each user can use /tmp and it gets cleaned up after them. Currently, we are mapping /tmp into the nodes RAM, which means that the cgroups make sure that users can only use a certain amount of storage inside /tmp.
Now we would like to use of the node's local SSD instead of its RAM to hold the files in /tmp. I have seen people define local storage as GRES, but I am wondering how to make sure that users do not exceed the storage space they requested in a job. Does anyone have an idea how to configure local storage as a proper tracked resource?
Thanks a lot in advance!
Best,
Tim
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
The native job_container/tmpfs would certainly have access to the job record, so modification to it (or a forked variant) would be possible. A SPANK plugin should be able to fetch the full job record [1] and is then able to inspect the "gres" list (as a C string), which means I could modify UD's auto_tmpdir accordingly. Having a compiled plugin executing xfs_quota to effect the commands illustrated wouldn't be a great idea -- luckily Linux XFS has an API. Seemingly not the simplest one, but xfsprogs is a working example.
[1] https://gitlab.hpc.cineca.it/dcesari1/slurm-msrsafe
On Feb 7, 2024, at 05:25, Tim Schneider via slurm-users slurm-users@lists.schedmd.com wrote:
Hey Jeffrey, thanks for this suggestion! This is probably the way to go if one can find a way to access GRES in the prolog. I read somewhere that people were calling scontrol to get this information, but this seems a bit unclean. Anyway, if I find some time I will try it out. Best, Tim On 2/6/24 16:30, Jeffrey T Frey wrote:
Most of my ideas have revolved around creating file systems on-the-fly as part of the job prolog and destroying them in the epilog. The issue with that mechanism is that formatting a file system (e.g. mkfs.<type>) can be time-consuming. E.g. formatting your local scratch SSD as an LVM PV+VG and allocating per-job volumes, you'd still need to run a e.g. mkfs.xfs and mount the new file system.
ZFS file system creation is much quicker (basically combines the LVM + mkfs steps above) but I don't know of any clusters using ZFS to manage local file systems on the compute nodes :-)
One could leverage XFS project quotas. E.g. for Slurm job 2147483647:
[root@r00n00 /]# mkdir /tmp-alloc/slurm-2147483647 [root@r00n00 /]# xfs_quota -x -c 'project -s -p /tmp-alloc/slurm-2147483647 2147483647' /tmp-alloc Setting up project 2147483647 (path /tmp-alloc/slurm-2147483647)... Processed 1 (/etc/projects and cmdline) paths for project 2147483647 with recursion depth infinite (-1). [root@r00n00 /]# xfs_quota -x -c 'limit -p bhard=1g 2147483647' /tmp-alloc [root@r00n00 /]# cd /tmp-alloc/slurm-2147483647 [root@r00n00 slurm-2147483647]# dd if=/dev/zero of=zeroes bs=5M count=1000 dd: error writing ‘zeroes’: No space left on device 205+0 records in 204+0 records out 1073741824 bytes (1.1 GB) copied, 2.92232 s, 367 MB/s
:
[root@r00n00 /]# rm -rf /tmp-alloc/slurm-2147483647 [root@r00n00 /]# xfs_quota -x -c 'limit -p bhard=0 2147483647' /tmp-alloc
Since Slurm jobids max out at 0x03FFFFFF (and 2147483647 = 0x7FFFFFFF) we have an easy on-demand project id to use on the file system. Slurm tmpfs plugins have to do a mkdir to create the per-job directory, adding two xfs_quota commands (which run in more or less O(1) time) won't extend the prolog by much. Likewise, Slurm tmpfs plugins have to scrub the directory at job cleanup, so adding another xfs_quota command will not do much to change their epilog execution times. The main question is "where does the tmpfs plugin find the quota limit for the job?"
On Feb 6, 2024, at 08:39, Tim Schneider via slurm-users slurm-users@lists.schedmd.com wrote:
Hi,
In our SLURM cluster, we are using the job_container/tmpfs plugin to ensure that each user can use /tmp and it gets cleaned up after them. Currently, we are mapping /tmp into the nodes RAM, which means that the cgroups make sure that users can only use a certain amount of storage inside /tmp.
Now we would like to use of the node's local SSD instead of its RAM to hold the files in /tmp. I have seen people define local storage as GRES, but I am wondering how to make sure that users do not exceed the storage space they requested in a job. Does anyone have an idea how to configure local storage as a proper tracked resource?
Thanks a lot in advance!
Best,
Tim
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com