Hi all,
On our setup we are using job_container/tmpfs to give each job it's own
temp space. Since our compute nodes have reasonably sized disks for
tasks that do a lot of disk I/O on user's data we have asked users to
copy their data to the local disk at the beginning of the task and (if
needed) copy it back at the end. This saves lots of NFS thrashing
slowing down both the task and the NFS servers.
However some of our users are having problems with this, their initial
sbatch script …
[View More]will create a temp directory in their private /tmp copy
their data to it and then try to srun a program. The srun will fall over
as it doesn't seem to have have access to the copied data. I suspect
this is because the srun task is getting it's own private /tmp.
So my question is, is there a way to have the srun task inherit the /tmp
of the initial sbatch?
I'll include a sample of the script our user is using below.
If any further information is required please feel free to ask.
Cheers.
Phill.
#!/usr/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:00:10
#SBATCH --mem-per-cpu=3999
#SBATCH --output=script_out.log
#SBATCH --error=script_error.log
# The above options puts the STDOUT and STDERR of sbatch in
# log files prefixed with 'script_'.
# Create a randomly-named directory under /tmp
jobtmpdir=$(mktemp -d)
# Register a function to try and cleanup in case of job failure
cleanup_handler()
{
echo "Cleaning up ${jobtmpdir}"
rm -rf ${jobtmpdir}
}
trap 'cleanup_handler' SIGTERM EXIT
# Change working directory to this directory
cd ${jobtmpdir}
# Copy the executable and input files from
# where the job was submitted to the temporary directory.
cp ${SLURM_SUBMIT_DIR}/a.out .
cp ${SLURM_SUBMIT_DIR}/input.txt .
# Run the executable, handling the collection of stdout
# and stderr ourselves by redirecting to file
srun ./a.out 2> task_error.log > task_out.log
# Copy output data back to the submit directory.
cp output.txt ${SLURM_SUBMIT_DIR}
cp task_out.log ${SLURM_SUBMIT_DIR}
cp task_error.log ${SLURM_SUBMIT_DIR}
# Cleanup
cd ${SLURM_SUBMIT_DIR}
cleanup_handler
[View Less]
Hello,
Long time SGE admin, new SLURM admin here.
I recently started the transition of all my clusters from SGE to SLURM and everything was great until I hit the "Taco Bell" cluster (fake name).
Taco Bell supports 4 projects and under SGE we had a priority system setup using projects to balance the queue.
For the life of me I have been unable to replicate this in SLURM.
We are looking to configure guaranteed resources based on the project.
I had thought we could accomplish this with QOS and …
[View More]accounts but so far we have failed.
What we would like to end up with is;
When project Gordita is running uncontested 100% of the cluster is available.
While Gordita is running, if Crunchwrap submits their jobs we want the scheduler to prioritize those jobs until a 75% Gordita, 25% Crunchwrap balance of jobs is reached.
No preempting or priority overriding, just as a Gordita job finishes, if Crunchwrap is less than 25%, start a Crunchwrap job. And then maintain that balance until one of the projects jobs are 100% completed.
Any assistance or guidance is greatly appreciated.
[View Less]
Hello,
Is there an existing Slurm plugin for FPGA allocation? If not, can someone
please point me in the right direction for how to approach it.
Many thanks
Hi all,
I have a problem with sending mails on rocky 9 via Slurm.
One needs to install s-nail to have "/bin/mail" being available.
There are some caveats in smail. In the second part (for the message,
when the job began) one need to pipe ( eg echo "") into $MAIL, even in a
script with no input, s-nail wants to be interactive. but it suffices to
echo an empty text to snail.
Nonetheless, I don't get any mail through. it seems the, mailprog for
some reason gets killed or errors out …
[View More]for some other reason. While it is
perfectionally working if run from the console :/
I all the time get in the slurmctld.log the following:
27212:[2024-12-19T15:54:54.935] slurmscriptd: error: run_command:
killing MailProg operation on shutdown
27213:[2024-12-19T15:54:54.945] slurmscriptd: _run_script: JobId=0
MailProg killed by signal 9
27214:[2024-12-19T15:54:54.945] error: MailProg returned error, it's
output was ''
27395:[2024-12-19T15:55:55.540] slurmscriptd: error: run_command:
killing MailProg operation on shutdown
27396:[2024-12-19T15:55:55.551] slurmscriptd: _run_script: JobId=0
MailProg killed by signal 9
27397:[2024-12-19T15:55:55.551] error: MailProg returned error, it's
output was ''
27438:[2024-12-19T15:56:55.981] slurmscriptd: error: run_command:
killing MailProg operation on shutdown
27439:[2024-12-19T15:56:55.981] slurmscriptd: error: run_command:
killing MailProg operation on shutdown
27440:[2024-12-19T15:56:55.992] slurmscriptd: _run_script: JobId=0
MailProg killed by signal 9
27441:[2024-12-19T15:56:55.992] slurmscriptd: _run_script: JobId=0
MailProg killed by signal 9
27442:[2024-12-19T15:56:55.992] error: MailProg returned error, it's
output was ''
27443:[2024-12-19T15:56:55.992] error: MailProg returned error, it's
output was ''
27450:[2024-12-19T15:56:58.849] slurmscriptd: error: run_command:
killing MailProg operation on shutdown
27451:[2024-12-19T15:56:58.859] slurmscriptd: _run_script: JobId=0
MailProg killed by signal 0
any hints?
Best
Marcus
--
Dipl.-Inf. Marcus Wagner
stellv. Gruppenleitung
IT Center
Gruppe: Server, Storage, HPC
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80 24383
wagner(a)itc.rwth-aachen.de
www.itc.rwth-aachen.de
Social-Media-Kanäle des IT Centers:
https://blog.rwth-aachen.de/itc/https://www.facebook.com/itcenterrwthhttps://www.linkedin.com/company/itcenterrwthhttps://twitter.com/ITCenterRWTHhttps://www.youtube.com/c/ITCenterRWTHAachen
[View Less]
Dear all,
i tried to rpmbuild slurm-24.11.0 for Alma Linux 8. Build failed
because some installed Packages are not found by slurms configure script:
rdkafka, glib, gtp and lua
But all these packages are installed and they are found by
slurm-24.05.x:
librdkafka-1.6.1-1.el8.x86_64
librdkafka-devel-1.6.1-1.el8.x86_64
lua-5.3.4-12.el8.x86_64
lua-devel-5.3.4-12.el8.x86_64
glib2-2.56.4-165.el8_10.x86_64
glib2-devel-2.56.4-165.el8_10.x86_64
gtk2-2.24.32-5.el8.x86_64
gtk2-devel-2.24.32-5.el8.x86_64
…
[View More]gtk3-3.22.30-12.el8_10.x86_64
gtk3-devel-3.22.30-12.el8_10.x86_64
Mit freundlichen Grüßen
Bernd Melchers
--
Archiv- und Backup-Service | fab-service(a)zedat.fu-berlin.de
Freie Universität Berlin | Tel. +49-30-838-55905
[View Less]
Hello,
I have multiple questions about the usage of job_container/tmpfs, and
the TmpFS and TmpDisk variables
1) If my job_container.conf files contains:
```
BasePath=/mnt/slurm_tmp Shared=true
```
is it important what I set TmpFS to in slurm.conf ? Should I set it to
'/mnt/slurm_tmp' or '/tmp' ?
2) What size should I put in TmpDisk ? the size advertised by df ?
3) Finally, is there any recommended file system for the partition used
as the job_container/tmpfs BasePath ?
Best regards,
…
[View More]Paul Musset
Max Planck Institute for Brain Research
[View Less]
Hi all,
I have observed a significant discrepancy in CPU usage time calculations
between sreport and sacct, and I would like to understand the underlying
cause. Let me share the specific case I encountered when calculating CPU
usage time for user zt23132881r from November 1, 2024, to November 30, 2024.
1. sreport Results (995,171 minutes):
--------------------------------------------------------------------------------
*[root@master ~]# sreport Cluster UserUtilizationByAccount user=…
[View More]zt23132881r
start=2024-11-01
end=2024-11-30--------------------------------------------------------------------------------Cluster/User/Account
Utilization 2024-11-01T00:00:00 - 2024-11-29T23:59:59 (2505600 secs)Usage
reported in CPU
Minutes--------------------------------------------------------------------------------
Cluster Login Proper Name Account Used
Energy--------- --------- --------------- --------------- --------
---------djhpc-po+ zt231328+ zt23132881r zt+ zt23132881r_ba+ 995171
6294875*
2. sacct Results:
# Without truncate (1,019,927 minutes / 61,195,668 seconds)
*[root@master ~]# sacct -u zt23132881r -S 2024-11-01 -E 2024-11-30 -o
"jobid,partition,account,user,alloccpus,cputimeraw,state" -X |awk
'BEGIN{total=0}{total+=$6}END{print total}'61195668*
# With truncate (967,165 minutes / 58,029,908 seconds)
*[root@master ~]# sacct -u zt23132881r -S 2024-11-01 -E 2024-11-30 -o
"jobid,partition,account,user,alloccpus,cputimeraw,state" -X --truncate
|awk 'BEGIN{total=0}{total+=$6}END{print total}'58029908*
# No -X
*[root@master ~]# sacct -u zt23132881r -S 2024-11-01 -E 2024-11-30 -o
"jobid,partition,account,user,alloccpus,cputimeraw,state" |awk
'BEGIN{total=0}{total+=$6}END{print total}'61195668*
The results show three different values:
- *sreport: 995,171 minutes*
- *sacct (without truncate): 1,019,927 minutes*
- *sacct (with truncate): 967,165 minutes*
I would appreciate if someone could explain:
- Which of these results is more accurate?
- How does sreport calculate CPU usage time?
- Why does the --truncate option in sacct lead to different results?
Thank you for your assistance in clarifying these discrepancies.
Best regards
[View Less]
Hi all,
I'm seeing some odd behavior when using the --mem-per-gpu flag instead of
the --mem flag to request memory when also requesting all available CPUs on
a node (in this example, all available nodes have 32 CPUs):
$ srun --ntasks-per-node=8 --cpus-per-task=4 --gpus-per-node=gtx1080ti:1
--mem-per-gpu=1g --pty bash
srun: error: Unable to allocate resources: Requested node configuration is
not available
$ srun --ntasks-per-node=8 --cpus-per-task=4 --gpus-per-node=gtx1080ti:1
--mem=1g --pty …
[View More]bash
srun: job 3479971 queued and waiting for resources
srun: job 3479971 has been allocated resources
$
The nodes in this partition have a mix of gtx1080ti and rtx2080ti GPUs, but
only one type of GPU is in any one node. The same behavior does not occur
when requesting a (node with a) rtx2080ti instead.
Is there something I'm missing that would cause the --mem-per-gpu flag to
not be working in this example?
Thanks,
Matthew
[View Less]