[slurm-users] Custom Gres for SSD

Mon Jul 24 08:06:57 UTC 2023

On 2023-07-24 09:50, Matthias Loose wrote:

Hi Shunran,

just read your question again. If you dont want users to share the SSD, 
like at all even if both have requested it you can basically skip the 
quota part of my awnser.

If you really only want one user per SSD per node you should set the 
gres variable in the node configuration to 1 just like you did and then 
implement the prolog/epilog solution (without quotas). If the mounted 
SSD can only be written to by root no one else can use it and the job 
that requested it get a folder created by the prolog.

What we also do ist export the folder name in the user/task prolog to 
the environment so he can easely use it.

Out task prolog:

   #!/bin/bash
   #PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

   local_dir="/local"

   SLURM_TMPDIR="${local_dir}/job_${SLURM_JOBID}"

   # check for /local job dir
   if [[ -d ${SLURM_TMPDIR} ]]; then
     # set tempdir env vars
     echo "export SLURM_TMPDIR=${SLURM_TMPDIR}"
     echo "export TMPDIR=${SLURM_TMPDIR}"
     echo "export JAVA_TOOL_OPTIONS=\"-Djava.io.tmpdir=${SLURM_TMPDIR}\""
   fi

Kind regards, Matt

> Hi Shunran,
> 
> we do something very similar. I have nodes with 2 SSDs in a Raid1
> mounted on /local. We defined a gres ressource just like you and
> called it local. We define the ressource in the gres.conf like this:
> 
>   # LOCAL
>   NodeName=hpc-node[01-10] Name=local
> 
> and add the ressource in counts of GB to the slurm.nodes.conf:
> 
>   NodeName=hpc-node01  CPUs=256 RealMemory=... Gres=local:3370
> 
> So in this case the node01 has 3370 counts or GB of the gres "local"
> available for reservation. Now slurm tracks that resource for you and
> users can reserve counts of /local space. But there is still one big
> problem, SLURM hast no idea what local is and as u correctly noted,
> others can just use it. I solved this the following way:
> 
> - /local ist owned by root, so no user can just write to it
> - the node prolog creates a folder in /local in this form:
> /local/job_<SLURM_JOB_ID> and makes the job owner of it
> - the node epilog deletes that folder
> 
> This way you have already solved the problem of people/jobs not having
> reserved any local using it. But there ist still no enforcement of
> limits. For that I use quotas.
> My /local is XFS formatted and XFS has a nifty feature called project
> quotas, where you can set the quota for a folder.
> 
> This is my node prolog script for this purpose:
> 
>   #!/bin/bash
>   PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
> 
>   local_dir="/local"
>   local_job=0
> 
>   ## DETERMINE GRES:LOCAL
>   # get job gres
>   JOB_TRES=$(scontrol show JobID=${SLURM_JOBID} | grep "TresPerNode="
> | cut -d '=' -f 2 | tr ',' ' ')
> 
>   # parse for local
>   for gres in ${JOB_TRES}; do
>     key=$(echo ${gres} | cut -d ':' -f 2 | tr '[:upper:]' '[:lower:]')
>     if [[ ${key} == "local" ]]; then
>       local_job=$(echo ${gres} | cut -d ':' -f 3)
>       break
>     fi
>   done
> 
>   # make job local-dir if requested
>   if [[ ${local_job} -ne 0 ]]; then
>     # make local-dir for job
>     SLURM_TMPDIR="${local_dir}/job_${SLURM_JOBID}"
>     mkdir ${SLURM_TMPDIR}
> 
>     # conversion
>     local_job=$((local_job * 1024 * 1024))
> 
>     # set hard limit to requested size + 5%
>     hard_limit=$((local_job * 105 / 100))
> 
>     # create project quota and set limits
>     xfs_quota -x -c "project -s -p ${SLURM_TMPDIR} ${SLURM_JOBID}" 
> ${local_dir}
>     xfs_quota -x -c "limit -p bsoft=${local_job}k bhard=${hard_limit}k
> ${SLURM_JOBID}" ${local_dir}
> 
>     chown ${SLURM_JOB_USER}:0 ${SLURM_TMPDIR}
>     chmod 750 ${SLURM_TMPDIR}
>   fi
> 
>   exit 0
> 
> This is my epilog:
> 
>   #!/bin/bash
>   PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
> 
>   local_dir="/local"
>   SLURM_TMPDIR="${local_dir}/job_${SLURM_JOBID}"
> 
>   # remove the quota
>   xfs_quota -x -c "limit -p bsoft=0m bhard=0m ${SLURM_JOBID}" 
> ${local_dir}
> 
>   # remove the folder
>   if [[ -d ${SLURM_TMPDIR} ]]; then
>     rm -rf --one-file-system ${SLURM_TMPDIR}
>   fi
> 
>   exit 0
> 
> In order to use project quota you would need to activate it by using
> this mount flag: pquota in the fstab.
> I give the user 5% more than he requested. You just have to make sure
> that you configure available space - 5% in the nodes.conf.
> 
> This is what we do and it works great.
> 
> Kind regards, Matt
> 
> 
> On 2023-07-24 05:48, Shunran Zhang wrote:
>> Hi all,
>> 
>> I am attempting to setup a gres to manage jobs that need a
>> scratch space, but only a few of our computational nodes are
>> equipped with SSD for such scratch space. Originally I setup a new
>> partition for those IO-bound jobs, but it ended up that those jobs
>> might be allocated to the same node thus fighting each other for
>> IO.
>> 
>> With a look over other settings it appears that the gres setting
>> looks promising. However I was having some difficulties figuring
>> out how to limit access to such space to those who requested
>> --gres=ssd:1.
>> 
>> For now I am using Flags=CountOnly to trust users who uses SSD
>> request for it, but apparently any job submitted to a node with
>> SSD can just use such space. Our scratch space implementation is 2
>> disks (sda and sdb) formatted to btrfs and RAID 0. What should I
>> do to enforce such limit on which job can use such space?
>> 
>> Related configurations for ref:
>> gres.conf: NodeName=scratch-1 Name=ssd Flags=CountOnly cgroup.conf:
>> ConstrainDevices=yes slurm.conf: GresTypes=gpu,ssd
>> NodeName=scratch-1 CPUs=88 Sockets=2 CoresPerSocket=22
>> ThreadsPerCore=2  RealMemory=180000 Gres=ssd:1 State=UNKNOWN
>> Sincerely,
>> S. Zhang