[slurm-users] Custom Gres for SSD

Mon Jul 24 08:26:01 UTC 2023

Hi Matthias,

Thank you for your info. The prolog/epilog way of managing it does look 
quite promising.

Indeed in my setup I only want one job per node per SSD-set. Our tasks 
that require the scratch space are more IO bound - we are more worried 
about the IO usage than the actual disk space usage, and that is the 
reason why we only have ssd with count of 1 per 2-disk RAID 0. For those 
IO bound operations, even if each job only use 5% of the disk space 
available, the IO on the disk would become the bottleneck, resulting in 
both jobs running 2x slower and processes in D state, which is what I am 
trying to prevent. Also as those IO bound jobs are usually submitted by 
one single user in a batch, a user-based approach might also not be 
adequate.

I am considering to modify your script so that by default, the scratch 
space is world writable but everyone except root have a quota of 0, and 
the prolog lifts such quota. This way when the user forgot to specify 
the --gres=ssd:1 the job would fail with IO error and he would 
immediately know what went wrong.

I am also thinking of a gpu-like cgroup based solution. Maybe if I limit 
the file access to lets say /dev/sda, it would also stop the user from 
accessing the mount point of /dev/sda - I am not sure so I would also 
test this approach out...

Will investigate into it for a little bit more.

Sincerely,

S. Zhang

On 2023/07/24 17:06, Matthias Loose wrote:
> On 2023-07-24 09:50, Matthias Loose wrote:
>
> Hi Shunran,
>
> just read your question again. If you dont want users to share the 
> SSD, like at all even if both have requested it you can basically skip 
> the quota part of my awnser.
>
> If you really only want one user per SSD per node you should set the 
> gres variable in the node configuration to 1 just like you did and 
> then implement the prolog/epilog solution (without quotas). If the 
> mounted SSD can only be written to by root no one else can use it and 
> the job that requested it get a folder created by the prolog.
>
> What we also do ist export the folder name in the user/task prolog to 
> the environment so he can easely use it.
>
> Out task prolog:
>
>   #!/bin/bash
> #PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
>
>   local_dir="/local"
>
>   SLURM_TMPDIR="${local_dir}/job_${SLURM_JOBID}"
>
>   # check for /local job dir
>   if [[ -d ${SLURM_TMPDIR} ]]; then
>     # set tempdir env vars
>     echo "export SLURM_TMPDIR=${SLURM_TMPDIR}"
>     echo "export TMPDIR=${SLURM_TMPDIR}"
>     echo "export JAVA_TOOL_OPTIONS=\"-Djava.io.tmpdir=${SLURM_TMPDIR}\""
>   fi
>
> Kind regards, Matt
>
>> Hi Shunran,
>>
>> we do something very similar. I have nodes with 2 SSDs in a Raid1
>> mounted on /local. We defined a gres ressource just like you and
>> called it local. We define the ressource in the gres.conf like this:
>>
>>   # LOCAL
>>   NodeName=hpc-node[01-10] Name=local
>>
>> and add the ressource in counts of GB to the slurm.nodes.conf:
>>
>>   NodeName=hpc-node01  CPUs=256 RealMemory=... Gres=local:3370
>>
>> So in this case the node01 has 3370 counts or GB of the gres "local"
>> available for reservation. Now slurm tracks that resource for you and
>> users can reserve counts of /local space. But there is still one big
>> problem, SLURM hast no idea what local is and as u correctly noted,
>> others can just use it. I solved this the following way:
>>
>> - /local ist owned by root, so no user can just write to it
>> - the node prolog creates a folder in /local in this form:
>> /local/job_<SLURM_JOB_ID> and makes the job owner of it
>> - the node epilog deletes that folder
>>
>> This way you have already solved the problem of people/jobs not having
>> reserved any local using it. But there ist still no enforcement of
>> limits. For that I use quotas.
>> My /local is XFS formatted and XFS has a nifty feature called project
>> quotas, where you can set the quota for a folder.
>>
>> This is my node prolog script for this purpose:
>>
>>   #!/bin/bash
>> PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
>>
>>   local_dir="/local"
>>   local_job=0
>>
>>   ## DETERMINE GRES:LOCAL
>>   # get job gres
>>   JOB_TRES=$(scontrol show JobID=${SLURM_JOBID} | grep "TresPerNode="
>> | cut -d '=' -f 2 | tr ',' ' ')
>>
>>   # parse for local
>>   for gres in ${JOB_TRES}; do
>>     key=$(echo ${gres} | cut -d ':' -f 2 | tr '[:upper:]' '[:lower:]')
>>     if [[ ${key} == "local" ]]; then
>>       local_job=$(echo ${gres} | cut -d ':' -f 3)
>>       break
>>     fi
>>   done
>>
>>   # make job local-dir if requested
>>   if [[ ${local_job} -ne 0 ]]; then
>>     # make local-dir for job
>>     SLURM_TMPDIR="${local_dir}/job_${SLURM_JOBID}"
>>     mkdir ${SLURM_TMPDIR}
>>
>>     # conversion
>>     local_job=$((local_job * 1024 * 1024))
>>
>>     # set hard limit to requested size + 5%
>>     hard_limit=$((local_job * 105 / 100))
>>
>>     # create project quota and set limits
>>     xfs_quota -x -c "project -s -p ${SLURM_TMPDIR} ${SLURM_JOBID}" 
>> ${local_dir}
>>     xfs_quota -x -c "limit -p bsoft=${local_job}k bhard=${hard_limit}k
>> ${SLURM_JOBID}" ${local_dir}
>>
>>     chown ${SLURM_JOB_USER}:0 ${SLURM_TMPDIR}
>>     chmod 750 ${SLURM_TMPDIR}
>>   fi
>>
>>   exit 0
>>
>> This is my epilog:
>>
>>   #!/bin/bash
>> PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
>>
>>   local_dir="/local"
>>   SLURM_TMPDIR="${local_dir}/job_${SLURM_JOBID}"
>>
>>   # remove the quota
>>   xfs_quota -x -c "limit -p bsoft=0m bhard=0m ${SLURM_JOBID}" 
>> ${local_dir}
>>
>>   # remove the folder
>>   if [[ -d ${SLURM_TMPDIR} ]]; then
>>     rm -rf --one-file-system ${SLURM_TMPDIR}
>>   fi
>>
>>   exit 0
>>
>> In order to use project quota you would need to activate it by using
>> this mount flag: pquota in the fstab.
>> I give the user 5% more than he requested. You just have to make sure
>> that you configure available space - 5% in the nodes.conf.
>>
>> This is what we do and it works great.
>>
>> Kind regards, Matt
>>
>>
>> On 2023-07-24 05:48, Shunran Zhang wrote:
>>> Hi all,
>>>
>>> I am attempting to setup a gres to manage jobs that need a
>>> scratch space, but only a few of our computational nodes are
>>> equipped with SSD for such scratch space. Originally I setup a new
>>> partition for those IO-bound jobs, but it ended up that those jobs
>>> might be allocated to the same node thus fighting each other for
>>> IO.
>>>
>>> With a look over other settings it appears that the gres setting
>>> looks promising. However I was having some difficulties figuring
>>> out how to limit access to such space to those who requested
>>> --gres=ssd:1.
>>>
>>> For now I am using Flags=CountOnly to trust users who uses SSD
>>> request for it, but apparently any job submitted to a node with
>>> SSD can just use such space. Our scratch space implementation is 2
>>> disks (sda and sdb) formatted to btrfs and RAID 0. What should I
>>> do to enforce such limit on which job can use such space?
>>>
>>> Related configurations for ref:
>>> gres.conf: NodeName=scratch-1 Name=ssd Flags=CountOnly cgroup.conf:
>>> ConstrainDevices=yes slurm.conf: GresTypes=gpu,ssd
>>> NodeName=scratch-1 CPUs=88 Sockets=2 CoresPerSocket=22
>>> ThreadsPerCore=2  RealMemory=180000 Gres=ssd:1 State=UNKNOWN
>>> Sincerely,
>>> S. Zhang
>