[slurm-users] [slurm-dev] SLURM 16.05.10-2 jobacct_gather/linux inconsistencies?

Tue Dec 12 11:11:11 MST 2017

Hello all,

> A closer look revealed that jag_common_poll_data isn't consistent in
> logging the total memory usage, depending on the application in
> question.  

Ugh.  I'm embarrassed to say that the issue was tracked back to
incorrectly using proctrack/pgid vs. proctrack/linuxproc;  sadly this
is an artifact from our testing days - and was never changed after our
move to production!

RTFM, dude!

John DeSantis

On Tue, 5 Sep 2017 11:40:15 -0600
John DeSantis <desantis at usf.edu> wrote:

> 
> Hello all,
> 
> We were recently alerted by a user whose long running jobs (>= 6
> days) were killed by oom.
> 
> A closer look revealed that jag_common_poll_data isn't consistent in
> logging the total memory usage, depending on the application in
> question.  For example, consider the following job submission:
> 
> #!/bin/bash
> #
> #SBATCH --job-name=sif14_1co2
> #SBATCH --output=runlog.log
> #SBATCH --time=48:00:00
> #SBATCH --nodes=8 --ntasks-per-node=6
> #SBATCH --partition=p2016 --qos=p16
> #SBATCH --mail-type=END
> #SBATCH --mail-user=user
> 
> module purge
> module add apps/vasp/5.4.1
> 
> # For standard vasp
> mpirun vasp_std
> 
> 
> Since no memory syntax was offered, the DefMemPerCPU value would take
> over, which is 512 MB.
> 
> Looking at the batch host stats, here is what is actually being
> reported:
> 
>   PPID    LWP PSR     TIME S   RSS     ELAPSED COMMAND            VSZ
>  28388  28394   1 00:00:00 S  1332    17:47:24 slurm_script    106116
>  28394  28403   1 00:00:00 S  1404    17:47:24 mpirun          106248
>  28403  28416  10 00:00:00 S  1664    17:47:24 mpiexec.hydra    17180
>  28416  28417   5 00:00:00 S  4996    17:47:24 srun            708424
>  28417  28421   3 00:00:00 S   668    17:47:24 srun             37712
>  28453  28458   0 00:00:00 S  1928    17:47:24 pmi_proxy        16384
>  28458  28471   4 15:55:13 R 1059004  17:47:24 vasp_std        2953636
>  28458  28472  10 17:01:43 R 1057672  17:47:24 vasp_std        2950168
>  28458  28473  20 17:35:54 R 1055412  17:47:24 vasp_std        2947896
>  28458  28474   3 17:53:39 R 1060060  17:47:24 vasp_std        2952712
>  28458  28475  11 17:51:52 R 1055168  17:47:24 vasp_std        2947772
>  28458  28476  23 17:53:42 R 1063888  17:47:24 vasp_std        2956572
> 
> [2017-09-05T13:11:09.124] [10212120] debug:  jag_common_poll_data:
> Task average frequency = 2197 pid 28394 mem size 4400 229544 time
> 0.020000(0+0) [2017-09-05T13:11:09.280] [10212120.0] debug:
> jag_common_poll_data: Task average frequency = 2197 pid 28458 mem
> size 1928 205988 time 0.020000(0+0) [2017-09-05T13:11:31.729]
> [10212120] debug:  jag_common_poll_data: Task average frequency =
> 2197 pid 28394 mem size 4400 229544 time 0.020000(0+0)
> [2017-09-05T13:11:31.757] [10212120.0] debug:  jag_common_poll_data:
> Task average frequency = 2197 pid 28458 mem size 1928 205988 time
> 0.020000(0+0) [2017-09-05T13:11:39.126] [10212120] debug:
> jag_common_poll_data: Task average frequency = 2197 pid 28394 mem
> size 4400 229544 time 0.020000(0+0) [2017-09-05T13:11:39.283]
> [10212120.0] debug:  jag_common_poll_data: Task average frequency =
> 2197 pid 28458 mem size 1928 205988 time 0.020000(0+0)
> 
> However, if a slightly older version of the same software is used,
> memory seems to be reported correctly.  Consider the submission
> script below.
> 
> #!/bin/sh
> 
> #SBATCH --time=04:00:00
> #SBATCH -N 2 --ntasks-per-node=2 -J "vasp_run"
> #SBATCH --mem-per-cpu=1800
> #SBATCH --reservation=blah
> #SBATCH -p blah
> #SBATCH -w nodes-[1-2]
> 
> module purge
> module load apps/vasp/5.3.3
> 
> mpirun vasp
> 
> And, the reported stats:
> 
>  PPID   LWP PSR     TIME S   RSS     ELAPSED COMMAND
> 22046 22050   0 00:00:00 S  1324       03:32 slurm_script
> 22050 22053   0 00:00:00 S  1476       03:32 mpirun
> 22053 22111   1 00:00:00 S  1428       03:32 mpiexec.hydra
> 22111 22112   2 00:00:00 S  4800       03:32 srun
> 22112 22113   3 00:00:00 S   668       03:32 srun
> 22121 22126   3 00:00:00 S  1352       03:32 pmi_proxy
> 22126 22127   5 00:03:30 R 1781544     03:32 vasp
> 22126 22128   0 00:03:30 R 1778052     03:32 vasp
> 
> 
> [2017-09-05T13:17:58.097] [10215464] debug:  jag_common_poll_data:
> Task average frequency = 2527 pid 22050 mem size 4228 230384 time
> 0.010000(0+0) [2017-09-05T13:17:58.265] [10215464.0] debug:
> jag_common_poll_data: Task average frequency = 2527 pid 22126 mem
> size 3538576 7561020 time 59.630000(56+2) [2017-09-05T13:18:28.101]
> [10215464] debug:  jag_common_poll_data: Task average frequency =
> 2527 pid 22050 mem size 4228 230384 time 0.010000(0+0)
> [2017-09-05T13:18:28.270] [10215464.0] debug:  jag_common_poll_data:
> Task average frequency = 2505 pid 22126 mem size 3560928 7655316 time
> 119.590000(116+3) [2017-09-05T13:18:58.105] [10215464] debug:
> jag_common_poll_data: Task average frequency = 2527 pid 22050 mem
> size 4228 230384 time 0.010000(0+0) [2017-09-05T13:18:58.275]
> [10215464.0] debug:  jag_common_poll_data: Task average frequency =
> 2498 pid 22126 mem size 3560932 7655316 time 179.570000(175+3)
> [2017-09-05T13:19:28.109] [10215464] debug:  jag_common_poll_data:
> Task average frequency = 2527 pid 22050 mem size 4228 230384 time
> 0.010000(0+0) [2017-09-05T13:19:28.281] [10215464.0] debug:
> jag_common_poll_data: Task average frequency = 2495 pid 22126 mem
> size 3560932 7655316 time 239.550000(234+4) [2017-09-05T13:19:58.113]
> [10215464] debug:  jag_common_poll_data: Task average frequency =
> 2527 pid 22050 mem size 4228 230384 time 0.010000(0+0)
> [2017-09-05T13:19:58.286] [10215464.0] debug:  jag_common_poll_data:
> Task average frequency = 2493 pid 22126 mem size 3560936 7655316 time
> 299.510000(294+5)
> 
> 
> 
> Has anyone else noticed anything similar on their cluster(s)?  I
> cannot confirm if this was happening before we upgraded from 15.08.4
> to 16.05.10-2.
> 
> Thanks,
> John DeSantis