[slurm-users] Custom Gres for SSD
Shunran Zhang
szhang at ngs.gen-info.osaka-u.ac.jp
Mon Jul 24 08:26:01 UTC 2023
Hi Matthias,
Thank you for your info. The prolog/epilog way of managing it does look
quite promising.
Indeed in my setup I only want one job per node per SSD-set. Our tasks
that require the scratch space are more IO bound - we are more worried
about the IO usage than the actual disk space usage, and that is the
reason why we only have ssd with count of 1 per 2-disk RAID 0. For those
IO bound operations, even if each job only use 5% of the disk space
available, the IO on the disk would become the bottleneck, resulting in
both jobs running 2x slower and processes in D state, which is what I am
trying to prevent. Also as those IO bound jobs are usually submitted by
one single user in a batch, a user-based approach might also not be
adequate.
I am considering to modify your script so that by default, the scratch
space is world writable but everyone except root have a quota of 0, and
the prolog lifts such quota. This way when the user forgot to specify
the --gres=ssd:1 the job would fail with IO error and he would
immediately know what went wrong.
I am also thinking of a gpu-like cgroup based solution. Maybe if I limit
the file access to lets say /dev/sda, it would also stop the user from
accessing the mount point of /dev/sda - I am not sure so I would also
test this approach out...
Will investigate into it for a little bit more.
Sincerely,
S. Zhang
On 2023/07/24 17:06, Matthias Loose wrote:
> On 2023-07-24 09:50, Matthias Loose wrote:
>
> Hi Shunran,
>
> just read your question again. If you dont want users to share the
> SSD, like at all even if both have requested it you can basically skip
> the quota part of my awnser.
>
> If you really only want one user per SSD per node you should set the
> gres variable in the node configuration to 1 just like you did and
> then implement the prolog/epilog solution (without quotas). If the
> mounted SSD can only be written to by root no one else can use it and
> the job that requested it get a folder created by the prolog.
>
> What we also do ist export the folder name in the user/task prolog to
> the environment so he can easely use it.
>
> Out task prolog:
>
> #!/bin/bash
> #PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
>
> local_dir="/local"
>
> SLURM_TMPDIR="${local_dir}/job_${SLURM_JOBID}"
>
> # check for /local job dir
> if [[ -d ${SLURM_TMPDIR} ]]; then
> # set tempdir env vars
> echo "export SLURM_TMPDIR=${SLURM_TMPDIR}"
> echo "export TMPDIR=${SLURM_TMPDIR}"
> echo "export JAVA_TOOL_OPTIONS=\"-Djava.io.tmpdir=${SLURM_TMPDIR}\""
> fi
>
> Kind regards, Matt
>
>> Hi Shunran,
>>
>> we do something very similar. I have nodes with 2 SSDs in a Raid1
>> mounted on /local. We defined a gres ressource just like you and
>> called it local. We define the ressource in the gres.conf like this:
>>
>> # LOCAL
>> NodeName=hpc-node[01-10] Name=local
>>
>> and add the ressource in counts of GB to the slurm.nodes.conf:
>>
>> NodeName=hpc-node01 CPUs=256 RealMemory=... Gres=local:3370
>>
>> So in this case the node01 has 3370 counts or GB of the gres "local"
>> available for reservation. Now slurm tracks that resource for you and
>> users can reserve counts of /local space. But there is still one big
>> problem, SLURM hast no idea what local is and as u correctly noted,
>> others can just use it. I solved this the following way:
>>
>> - /local ist owned by root, so no user can just write to it
>> - the node prolog creates a folder in /local in this form:
>> /local/job_<SLURM_JOB_ID> and makes the job owner of it
>> - the node epilog deletes that folder
>>
>> This way you have already solved the problem of people/jobs not having
>> reserved any local using it. But there ist still no enforcement of
>> limits. For that I use quotas.
>> My /local is XFS formatted and XFS has a nifty feature called project
>> quotas, where you can set the quota for a folder.
>>
>> This is my node prolog script for this purpose:
>>
>> #!/bin/bash
>> PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
>>
>> local_dir="/local"
>> local_job=0
>>
>> ## DETERMINE GRES:LOCAL
>> # get job gres
>> JOB_TRES=$(scontrol show JobID=${SLURM_JOBID} | grep "TresPerNode="
>> | cut -d '=' -f 2 | tr ',' ' ')
>>
>> # parse for local
>> for gres in ${JOB_TRES}; do
>> key=$(echo ${gres} | cut -d ':' -f 2 | tr '[:upper:]' '[:lower:]')
>> if [[ ${key} == "local" ]]; then
>> local_job=$(echo ${gres} | cut -d ':' -f 3)
>> break
>> fi
>> done
>>
>> # make job local-dir if requested
>> if [[ ${local_job} -ne 0 ]]; then
>> # make local-dir for job
>> SLURM_TMPDIR="${local_dir}/job_${SLURM_JOBID}"
>> mkdir ${SLURM_TMPDIR}
>>
>> # conversion
>> local_job=$((local_job * 1024 * 1024))
>>
>> # set hard limit to requested size + 5%
>> hard_limit=$((local_job * 105 / 100))
>>
>> # create project quota and set limits
>> xfs_quota -x -c "project -s -p ${SLURM_TMPDIR} ${SLURM_JOBID}"
>> ${local_dir}
>> xfs_quota -x -c "limit -p bsoft=${local_job}k bhard=${hard_limit}k
>> ${SLURM_JOBID}" ${local_dir}
>>
>> chown ${SLURM_JOB_USER}:0 ${SLURM_TMPDIR}
>> chmod 750 ${SLURM_TMPDIR}
>> fi
>>
>> exit 0
>>
>> This is my epilog:
>>
>> #!/bin/bash
>> PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
>>
>> local_dir="/local"
>> SLURM_TMPDIR="${local_dir}/job_${SLURM_JOBID}"
>>
>> # remove the quota
>> xfs_quota -x -c "limit -p bsoft=0m bhard=0m ${SLURM_JOBID}"
>> ${local_dir}
>>
>> # remove the folder
>> if [[ -d ${SLURM_TMPDIR} ]]; then
>> rm -rf --one-file-system ${SLURM_TMPDIR}
>> fi
>>
>> exit 0
>>
>> In order to use project quota you would need to activate it by using
>> this mount flag: pquota in the fstab.
>> I give the user 5% more than he requested. You just have to make sure
>> that you configure available space - 5% in the nodes.conf.
>>
>> This is what we do and it works great.
>>
>> Kind regards, Matt
>>
>>
>> On 2023-07-24 05:48, Shunran Zhang wrote:
>>> Hi all,
>>>
>>> I am attempting to setup a gres to manage jobs that need a
>>> scratch space, but only a few of our computational nodes are
>>> equipped with SSD for such scratch space. Originally I setup a new
>>> partition for those IO-bound jobs, but it ended up that those jobs
>>> might be allocated to the same node thus fighting each other for
>>> IO.
>>>
>>> With a look over other settings it appears that the gres setting
>>> looks promising. However I was having some difficulties figuring
>>> out how to limit access to such space to those who requested
>>> --gres=ssd:1.
>>>
>>> For now I am using Flags=CountOnly to trust users who uses SSD
>>> request for it, but apparently any job submitted to a node with
>>> SSD can just use such space. Our scratch space implementation is 2
>>> disks (sda and sdb) formatted to btrfs and RAID 0. What should I
>>> do to enforce such limit on which job can use such space?
>>>
>>> Related configurations for ref:
>>> gres.conf: NodeName=scratch-1 Name=ssd Flags=CountOnly cgroup.conf:
>>> ConstrainDevices=yes slurm.conf: GresTypes=gpu,ssd
>>> NodeName=scratch-1 CPUs=88 Sockets=2 CoresPerSocket=22
>>> ThreadsPerCore=2 RealMemory=180000 Gres=ssd:1 State=UNKNOWN
>>> Sincerely,
>>> S. Zhang
>
More information about the slurm-users
mailing list