On Fri, May 24, 2024 at 9:32 PM Hongyi Zhao hongyi.zhao@gmail.com wrote:
Dear Slurm Users,
I am experiencing a significant performance discrepancy when running the same VASP job through the Slurm scheduler compared to running it directly with mpirun. I am hoping for some insights or advice on how to resolve this issue.
System Information:
Slurm Version: 21.08.5 OS: Ubuntu 22.04.4 LTS (Jammy)
Job Submission Script:
#!/usr/bin/env bash #SBATCH -N 1 #SBATCH -D . #SBATCH --output=%j.out #SBATCH --error=%j.err ##SBATCH --time=2-00:00:00 #SBATCH --ntasks=36 #SBATCH --mem=64G
echo '#######################################################' echo "date = $(date)" echo "hostname = $(hostname -s)" echo "pwd = $(pwd)" echo "sbatch = $(which sbatch | xargs realpath -e)" echo "" echo "WORK_DIR = $WORK_DIR" echo "SLURM_SUBMIT_DIR = $SLURM_SUBMIT_DIR" echo "SLURM_JOB_NUM_NODES = $SLURM_JOB_NUM_NODES" echo "SLURM_NTASKS = $SLURM_NTASKS" echo "SLURM_NTASKS_PER_NODE = $SLURM_NTASKS_PER_NODE" echo "SLURM_CPUS_PER_TASK = $SLURM_CPUS_PER_TASK" echo "SLURM_JOBID = $SLURM_JOBID" echo "SLURM_JOB_NODELIST = $SLURM_JOB_NODELIST" echo "SLURM_NNODES = $SLURM_NNODES" echo "SLURMTMPDIR = $SLURMTMPDIR" echo '#######################################################' echo ""
module purge > /dev/null 2>&1 module load vasp ulimit -s unlimited mpirun vasp_std
Performance Observation:
When running the job through Slurm:
werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$ grep LOOP OUTCAR LOOP: cpu time 14.4893: real time 14.5049 LOOP: cpu time 14.3538: real time 14.3621 LOOP: cpu time 14.3870: real time 14.3568 LOOP: cpu time 15.9722: real time 15.9018 LOOP: cpu time 16.4527: real time 16.4370 LOOP: cpu time 16.7918: real time 16.7781 LOOP: cpu time 16.9797: real time 16.9961 LOOP: cpu time 15.9762: real time 16.0124 LOOP: cpu time 16.8835: real time 16.9008 LOOP: cpu time 15.2828: real time 15.2921 LOOP+: cpu time 176.0917: real time 176.0755
When running the job directly with mpirun:
werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$ mpirun -n 36 vasp_std werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$ grep LOOP OUTCAR LOOP: cpu time 9.0072: real time 9.0074 LOOP: cpu time 9.0515: real time 9.0524 LOOP: cpu time 9.1896: real time 9.1907 LOOP: cpu time 10.1467: real time 10.1479 LOOP: cpu time 10.2691: real time 10.2705 LOOP: cpu time 10.4330: real time 10.4340 LOOP: cpu time 10.9049: real time 10.9055 LOOP: cpu time 9.9718: real time 9.9714 LOOP: cpu time 10.4511: real time 10.4470 LOOP: cpu time 9.4621: real time 9.4584 LOOP+: cpu time 110.0790: real time 110.0739
Could you provide any insights or suggestions on what might be causing this performance issue? Are there any specific configurations or settings in Slurm that I should check or adjust to align the performance more closely with the direct mpirun execution?
Thank you for your time and assistance.
The attachment is the test example used above.
Best regards, Zhao -- Assoc. Prof. Hongsheng Zhao hongyi.zhao@gmail.com Theory and Simulation of Materials Hebei Vocational University of Technology and Engineering No. 473, Quannan West Street, Xindu District, Xingtai, Hebei province