[slurm-users] An observation on SLURM's logging

Tue Nov 27 04:09:36 MST 2018

https://twitter.com/systemdsucks
Is that my coat? Why thankyou.

On Tue, 27 Nov 2018 at 11:21, Kevin Buckley <Kevin.Buckley at pawsey.org.au>
wrote:

> Here are our usual Slurm log-related config settings
>
> (where we've increased the default "Syslog" level as we're trying
>   to get the Crays to funnel the slurmd log messages all the way up
>   to the machine that syslog messages go to on Crays)
>
>
> SlurmctldDebug=info
> SlurmctldLogFile=/var/spool/slurm/slurmctld.log
> SlurmctldSyslogDebug=info
> SlurmdDebug=info
> SlurmdLogFile=/var/spool/slurmd/%h.log
> SlurmdSyslogDebug=info
>
>
> and here are the "logged" outputs to both, the systemd journal,
> and slurmd's dedicated log file, channels, for a particular job:
>
>
> SYSTEMD JOURNAL
>
>
> Nov 27 16:57:32 nid00024 slurmd[18994]: task_p_slurmd_batch_request: 796
> Nov 27 16:57:32 nid00024 slurmd[18994]: task/affinity: job 796 CPU input
> mask for node: 0x00FF
> Nov 27 16:57:32 nid00024 slurmd[18994]: task/affinity: job 796 CPU final
> HW mask for node: 0x0F0F
> Nov 27 16:57:32 nid00024 slurmd[18994]: _run_prolog: run job script took
> usec=10517
> Nov 27 16:57:32 nid00024 slurmd[18994]: _run_prolog: prolog with lock for
> job 796 ran for 0 seconds
> Nov 27 16:57:33 nid00024 slurmd[18994]: Launching batch job 796 for UID
> 20480
> Nov 27 16:57:33 nid00024 slurmd[18994]: launch task 796.0 request from
> 20480.20480 at 10.128.0.25 (port 25773)
> Nov 27 16:57:33 nid00024 slurmd[18994]: lllp_distribution jobid [796] auto
> binding off: mask_cpu,one_thread
> Nov 27 16:57:34 nid00024 slurmd[18994]: launch task 796.1 request from
> 20480.20480 at 10.128.0.25 (port 30381)
> Nov 27 16:57:34 nid00024 slurmd[18994]: lllp_distribution jobid [796] auto
> binding off: mask_cpu,one_thread
>
>
> SLURMD LOG FILE
>
>
> [2018-11-27T16:57:32.850] task_p_slurmd_batch_request: 796
> [2018-11-27T16:57:32.850] task/affinity: job 796 CPU input mask for node:
> 0x00FF
> [2018-11-27T16:57:32.850] task/affinity: job 796 CPU final HW mask for
> node: 0x0F0F
> [2018-11-27T16:57:32.862] _run_prolog: run job script took usec=10517
> [2018-11-27T16:57:32.862] _run_prolog: prolog with lock for job 796 ran
> for 0 seconds
> [2018-11-27T16:57:32.874] [796.extern] core_spec/cray: init
> [2018-11-27T16:57:32.898] [796.extern] task/cgroup:
> /slurm/uid_20480/job_796: alloc=28000MB mem.limit=26600MB
> memsw.limit=26600MB
> [2018-11-27T16:57:32.898] [796.extern] task/cgroup:
> /slurm/uid_20480/job_796/step_extern: alloc=28000MB mem.limit=26600MB
> memsw.limit=26600MB
> [2018-11-27T16:57:33.666] Launching batch job 796 for UID 20480
> [2018-11-27T16:57:33.680] [796.batch] core_spec/cray: init
> [2018-11-27T16:57:33.690] [796.batch] task/cgroup:
> /slurm/uid_20480/job_796: alloc=28000MB mem.limit=26600MB
> memsw.limit=26600MB
> [2018-11-27T16:57:33.690] [796.batch] task/cgroup:
> /slurm/uid_20480/job_796/step_batch: alloc=28000MB mem.limit=26600MB
> memsw.limit=26600MB
> [2018-11-27T16:57:33.696] [796.batch] task_p_pre_launch: Using
> sched_affinity for tasks
> [2018-11-27T16:57:33.871] launch task 796.0 request from
> 20480.20480 at 10.128.0.25 (port 25773)
> [2018-11-27T16:57:33.871] lllp_distribution jobid [796] auto binding off:
> mask_cpu,one_thread
> [2018-11-27T16:57:33.882] [796.0] core_spec/cray: init
> [2018-11-27T16:57:33.918] [796.0] (switch_cray.c: 654: switch_p_job_init)
> gres_cnt: 0 1
> [2018-11-27T16:57:33.934] [796.0] task/cgroup: /slurm/uid_20480/job_796:
> alloc=28000MB mem.limit=26600MB memsw.limit=26600MB
> [2018-11-27T16:57:33.934] [796.0] task/cgroup:
> /slurm/uid_20480/job_796/step_0: alloc=28000MB mem.limit=26600MB
> memsw.limit=26600MB
> [2018-11-27T16:57:33.940] [796.0] task_p_pre_launch: Using sched_affinity
> for tasks
> [2018-11-27T16:57:33.940] [796.0] task_p_pre_launch: Using sched_affinity
> for tasks
> [2018-11-27T16:57:33.940] [796.0] task_p_pre_launch: Using sched_affinity
> for tasks
> [2018-11-27T16:57:33.940] [796.0] task_p_pre_launch: Using sched_affinity
> for tasks
> [2018-11-27T16:57:34.580] [796.0] done with job
> [2018-11-27T16:57:34.940] launch task 796.1 request from
> 20480.20480 at 10.128.0.25 (port 30381)
> [2018-11-27T16:57:34.941] lllp_distribution jobid [796] auto binding off:
> mask_cpu,one_thread
> [2018-11-27T16:57:34.951] [796.1] core_spec/cray: init
> [2018-11-27T16:57:34.994] [796.1] (switch_cray.c: 654: switch_p_job_init)
> gres_cnt: 0 1
> [2018-11-27T16:57:35.006] [796.1] task/cgroup: /slurm/uid_20480/job_796:
> alloc=28000MB mem.limit=26600MB memsw.limit=26600MB
> [2018-11-27T16:57:35.006] [796.1] task/cgroup:
> /slurm/uid_20480/job_796/step_1: alloc=28000MB mem.limit=26600MB
> memsw.limit=26600MB
> [2018-11-27T16:57:35.012] [796.1] task_p_pre_launch: Using sched_affinity
> for tasks
> [2018-11-27T16:57:35.012] [796.1] task_p_pre_launch: Using sched_affinity
> for tasks
> [2018-11-27T16:57:35.012] [796.1] task_p_pre_launch: Using sched_affinity
> for tasks
> [2018-11-27T16:57:35.013] [796.1] task_p_pre_launch: Using sched_affinity
> for tasks
> [2018-11-27T16:57:46.460] [796.1] done with job
> [2018-11-27T16:57:46.602] [796.batch] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
> [2018-11-27T16:57:46.608] [796.batch] done with job
> [2018-11-27T16:57:46.760] [796.extern] done with job
>
>
> where we see that SOME, but not ALL, of the log messages appear on both of
> the channels.
>
>
>
> Now let's comment out the use of a dedicated slurmd log file and run
> exactly
> the same job,
>
> FWIW:
>
> SlurmctldDebug=info
> SlurmctldLogFile=/var/spool/slurm/slurmctld.log
> SlurmctldSyslogDebug=info
> SlurmdDebug=info
> #lurmdLogFile=/var/spool/slurmd/%h.log
> SlurmdSyslogDebug=info
>
>
> SYSTEMD JOURNAL
>
>
> Nov 27 16:51:51 nid00024 slurmd[16415]: task_p_slurmd_batch_request: 794
> Nov 27 16:51:51 nid00024 slurmd[16415]: task/affinity: job 794 CPU input
> mask for node: 0x00FF
> Nov 27 16:51:51 nid00024 slurmd[16415]: task/affinity: job 794 CPU final
> HW mask for node: 0x0F0F
> Nov 27 16:51:51 nid00024 slurmd[16415]: _run_prolog: run job script took
> usec=10472
> Nov 27 16:51:51 nid00024 slurmd[16415]: _run_prolog: prolog with lock for
> job 794 ran for 0 seconds
> Nov 27 16:51:51 nid00024 slurmstepd[18143]: core_spec/cray: init
> Nov 27 16:51:51 nid00024 slurmstepd[18143]: task/cgroup:
> /slurm/uid_20480/job_794: alloc=28000MB mem.limit=26600MB
> memsw.limit=26600MB
> Nov 27 16:51:51 nid00024 slurmstepd[18143]: task/cgroup:
> /slurm/uid_20480/job_794/step_extern: alloc=28000MB mem.limit=26600MB
> memsw.limit=26600MB
> Nov 27 16:51:52 nid00024 slurmd[16415]: Launching batch job 794 for UID
> 20480
> Nov 27 16:51:52 nid00024 slurmstepd[18411]: core_spec/cray: init
> Nov 27 16:51:52 nid00024 slurmstepd[18411]: task/cgroup:
> /slurm/uid_20480/job_794: alloc=28000MB mem.limit=26600MB
> memsw.limit=26600MB
> Nov 27 16:51:52 nid00024 slurmstepd[18411]: task/cgroup:
> /slurm/uid_20480/job_794/step_batch: alloc=28000MB mem.limit=26600MB
> memsw.limit=26600MB
> Nov 27 16:51:52 nid00024 slurmstepd[18419]: task_p_pre_launch: Using
> sched_affinity for tasks
> Nov 27 16:51:52 nid00024 slurmd[16415]: launch task 794.0 request from
> 20480.20480 at 10.128.0.25 (port 13485)
> Nov 27 16:51:52 nid00024 slurmd[16415]: lllp_distribution jobid [794] auto
> binding off: mask_cpu,one_thread
> Nov 27 16:51:52 nid00024 slurmstepd[18520]: core_spec/cray: init
> Nov 27 16:51:52 nid00024 slurmstepd[18520]: (switch_cray.c: 654:
> switch_p_job_init) gres_cnt: 0 1
> Nov 27 16:51:52 nid00024 slurmstepd[18520]: task/cgroup:
> /slurm/uid_20480/job_794: alloc=28000MB mem.limit=26600MB
> memsw.limit=26600MB
> Nov 27 16:51:52 nid00024 slurmstepd[18520]: task/cgroup:
> /slurm/uid_20480/job_794/step_0: alloc=28000MB mem.limit=26600MB
> memsw.limit=26600MB
> Nov 27 16:51:52 nid00024 slurmstepd[18527]: task_p_pre_launch: Using
> sched_affinity for tasks
> Nov 27 16:51:52 nid00024 slurmstepd[18529]: task_p_pre_launch: Using
> sched_affinity for tasks
> Nov 27 16:51:52 nid00024 slurmstepd[18528]: task_p_pre_launch: Using
> sched_affinity for tasks
> Nov 27 16:51:52 nid00024 slurmstepd[18530]: task_p_pre_launch: Using
> sched_affinity for tasks
> Nov 27 16:51:53 nid00024 slurmstepd[18520]: done with job
> Nov 27 16:51:53 nid00024 slurmd[16415]: launch task 794.1 request from
> 20480.20480 at 10.128.0.25 (port 18093)
> Nov 27 16:51:53 nid00024 slurmd[16415]: lllp_distribution jobid [794] auto
> binding off: mask_cpu,one_thread
> Nov 27 16:51:53 nid00024 slurmstepd[18551]: core_spec/cray: init
> Nov 27 16:51:53 nid00024 slurmstepd[18551]: (switch_cray.c: 654:
> switch_p_job_init) gres_cnt: 0 1
> Nov 27 16:51:53 nid00024 slurmstepd[18551]: task/cgroup:
> /slurm/uid_20480/job_794: alloc=28000MB mem.limit=26600MB
> memsw.limit=26600MB
> Nov 27 16:51:53 nid00024 slurmstepd[18551]: task/cgroup:
> /slurm/uid_20480/job_794/step_1: alloc=28000MB mem.limit=26600MB
> memsw.limit=26600MB
> Nov 27 16:51:53 nid00024 slurmstepd[18558]: task_p_pre_launch: Using
> sched_affinity for tasks
> Nov 27 16:51:53 nid00024 slurmstepd[18559]: task_p_pre_launch: Using
> sched_affinity for tasks
> Nov 27 16:51:53 nid00024 slurmstepd[18560]: task_p_pre_launch: Using
> sched_affinity for tasks
> Nov 27 16:51:53 nid00024 slurmstepd[18561]: task_p_pre_launch: Using
> sched_affinity for tasks
> Nov 27 16:52:04 nid00024 slurmstepd[18551]: done with job
> Nov 27 16:52:05 nid00024 slurmstepd[18411]: sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
> Nov 27 16:52:05 nid00024 slurmstepd[18411]: done with job
> Nov 27 16:52:05 nid00024 slurmstepd[18143]: done with job
>
>
> SLURMD LOG FILE
>
>
> Well, nothing: obviously.
>
>
> The thing to note here is that log messages that ONLY appeared in the
> dedicated log file, when there was one, now show up in the systemd
> journal (and of course, have that nice "slurmstepd" tag that makes
> identifying them much easier) along with the log messages that ALWAYS
> appear on BOTH channels.
>
>
>
> So, whilst it seems to be the case that,
>
>     you will get SOME of the messages on ALL of the channels, ALL of the
> time
>
> and even though it seems that,
>
>     you can get ALL of the messages on SOME of the channels, SOME of the
> time
>
> sadly it would appear that,
>
>    you can't get ALL of the messages on ALL of the channels, ALL of the
> time
>
>
>
> Am I a fool for thinking that you should be able to ?
>
>
> Kevin Buckley
> --
> Supercomputing Systems Administrator
> Pawsey Supercomputing Centre
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181127/c9ea5d4c/attachment-0001.html>