<div dir="ltr"><div dir="ltr"><a href="https://twitter.com/systemdsucks">https://twitter.com/systemdsucks</a><br></div><div>Is that my coat? Why thankyou.</div></div><br><div class="gmail_quote"><div dir="ltr">On Tue, 27 Nov 2018 at 11:21, Kevin Buckley <<a href="mailto:Kevin.Buckley@pawsey.org.au">Kevin.Buckley@pawsey.org.au</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Here are our usual Slurm log-related config settings<br>
<br>
(where we've increased the default "Syslog" level as we're trying<br>
  to get the Crays to funnel the slurmd log messages all the way up<br>
  to the machine that syslog messages go to on Crays)<br>
<br>
<br>
SlurmctldDebug=info<br>
SlurmctldLogFile=/var/spool/slurm/slurmctld.log<br>
SlurmctldSyslogDebug=info<br>
SlurmdDebug=info<br>
SlurmdLogFile=/var/spool/slurmd/%h.log<br>
SlurmdSyslogDebug=info<br>
<br>
<br>
and here are the "logged" outputs to both, the systemd journal,<br>
and slurmd's dedicated log file, channels, for a particular job:<br>
<br>
<br>
SYSTEMD JOURNAL<br>
<br>
<br>
Nov 27 16:57:32 nid00024 slurmd[18994]: task_p_slurmd_batch_request: 796<br>
Nov 27 16:57:32 nid00024 slurmd[18994]: task/affinity: job 796 CPU input mask for node: 0x00FF<br>
Nov 27 16:57:32 nid00024 slurmd[18994]: task/affinity: job 796 CPU final HW mask for node: 0x0F0F<br>
Nov 27 16:57:32 nid00024 slurmd[18994]: _run_prolog: run job script took usec=10517<br>
Nov 27 16:57:32 nid00024 slurmd[18994]: _run_prolog: prolog with lock for job 796 ran for 0 seconds<br>
Nov 27 16:57:33 nid00024 slurmd[18994]: Launching batch job 796 for UID 20480<br>
Nov 27 16:57:33 nid00024 slurmd[18994]: launch task 796.0 request from <a href="mailto:20480.20480@10.128.0.25" target="_blank">20480.20480@10.128.0.25</a> (port 25773)<br>
Nov 27 16:57:33 nid00024 slurmd[18994]: lllp_distribution jobid [796] auto binding off: mask_cpu,one_thread<br>
Nov 27 16:57:34 nid00024 slurmd[18994]: launch task 796.1 request from <a href="mailto:20480.20480@10.128.0.25" target="_blank">20480.20480@10.128.0.25</a> (port 30381)<br>
Nov 27 16:57:34 nid00024 slurmd[18994]: lllp_distribution jobid [796] auto binding off: mask_cpu,one_thread<br>
<br>
<br>
SLURMD LOG FILE<br>
<br>
<br>
[2018-11-27T16:57:32.850] task_p_slurmd_batch_request: 796<br>
[2018-11-27T16:57:32.850] task/affinity: job 796 CPU input mask for node: 0x00FF<br>
[2018-11-27T16:57:32.850] task/affinity: job 796 CPU final HW mask for node: 0x0F0F<br>
[2018-11-27T16:57:32.862] _run_prolog: run job script took usec=10517<br>
[2018-11-27T16:57:32.862] _run_prolog: prolog with lock for job 796 ran for 0 seconds<br>
[2018-11-27T16:57:32.874] [796.extern] core_spec/cray: init<br>
[2018-11-27T16:57:32.898] [796.extern] task/cgroup: /slurm/uid_20480/job_796: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>
[2018-11-27T16:57:32.898] [796.extern] task/cgroup: /slurm/uid_20480/job_796/step_extern: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>
[2018-11-27T16:57:33.666] Launching batch job 796 for UID 20480<br>
[2018-11-27T16:57:33.680] [796.batch] core_spec/cray: init<br>
[2018-11-27T16:57:33.690] [796.batch] task/cgroup: /slurm/uid_20480/job_796: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>
[2018-11-27T16:57:33.690] [796.batch] task/cgroup: /slurm/uid_20480/job_796/step_batch: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>
[2018-11-27T16:57:33.696] [796.batch] task_p_pre_launch: Using sched_affinity for tasks<br>
[2018-11-27T16:57:33.871] launch task 796.0 request from <a href="mailto:20480.20480@10.128.0.25" target="_blank">20480.20480@10.128.0.25</a> (port 25773)<br>
[2018-11-27T16:57:33.871] lllp_distribution jobid [796] auto binding off: mask_cpu,one_thread<br>
[2018-11-27T16:57:33.882] [796.0] core_spec/cray: init<br>
[2018-11-27T16:57:33.918] [796.0] (switch_cray.c: 654: switch_p_job_init) gres_cnt: 0 1<br>
[2018-11-27T16:57:33.934] [796.0] task/cgroup: /slurm/uid_20480/job_796: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>
[2018-11-27T16:57:33.934] [796.0] task/cgroup: /slurm/uid_20480/job_796/step_0: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>
[2018-11-27T16:57:33.940] [796.0] task_p_pre_launch: Using sched_affinity for tasks<br>
[2018-11-27T16:57:33.940] [796.0] task_p_pre_launch: Using sched_affinity for tasks<br>
[2018-11-27T16:57:33.940] [796.0] task_p_pre_launch: Using sched_affinity for tasks<br>
[2018-11-27T16:57:33.940] [796.0] task_p_pre_launch: Using sched_affinity for tasks<br>
[2018-11-27T16:57:34.580] [796.0] done with job<br>
[2018-11-27T16:57:34.940] launch task 796.1 request from <a href="mailto:20480.20480@10.128.0.25" target="_blank">20480.20480@10.128.0.25</a> (port 30381)<br>
[2018-11-27T16:57:34.941] lllp_distribution jobid [796] auto binding off: mask_cpu,one_thread<br>
[2018-11-27T16:57:34.951] [796.1] core_spec/cray: init<br>
[2018-11-27T16:57:34.994] [796.1] (switch_cray.c: 654: switch_p_job_init) gres_cnt: 0 1<br>
[2018-11-27T16:57:35.006] [796.1] task/cgroup: /slurm/uid_20480/job_796: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>
[2018-11-27T16:57:35.006] [796.1] task/cgroup: /slurm/uid_20480/job_796/step_1: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>
[2018-11-27T16:57:35.012] [796.1] task_p_pre_launch: Using sched_affinity for tasks<br>
[2018-11-27T16:57:35.012] [796.1] task_p_pre_launch: Using sched_affinity for tasks<br>
[2018-11-27T16:57:35.012] [796.1] task_p_pre_launch: Using sched_affinity for tasks<br>
[2018-11-27T16:57:35.013] [796.1] task_p_pre_launch: Using sched_affinity for tasks<br>
[2018-11-27T16:57:46.460] [796.1] done with job<br>
[2018-11-27T16:57:46.602] [796.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0<br>
[2018-11-27T16:57:46.608] [796.batch] done with job<br>
[2018-11-27T16:57:46.760] [796.extern] done with job<br>
<br>
<br>
where we see that SOME, but not ALL, of the log messages appear on both of the channels.<br>
<br>
<br>
<br>
Now let's comment out the use of a dedicated slurmd log file and run exactly<br>
the same job,<br>
<br>
FWIW:<br>
<br>
SlurmctldDebug=info<br>
SlurmctldLogFile=/var/spool/slurm/slurmctld.log<br>
SlurmctldSyslogDebug=info<br>
SlurmdDebug=info<br>
#lurmdLogFile=/var/spool/slurmd/%h.log<br>
SlurmdSyslogDebug=info<br>
<br>
<br>
SYSTEMD JOURNAL<br>
<br>
<br>
Nov 27 16:51:51 nid00024 slurmd[16415]: task_p_slurmd_batch_request: 794<br>
Nov 27 16:51:51 nid00024 slurmd[16415]: task/affinity: job 794 CPU input mask for node: 0x00FF<br>
Nov 27 16:51:51 nid00024 slurmd[16415]: task/affinity: job 794 CPU final HW mask for node: 0x0F0F<br>
Nov 27 16:51:51 nid00024 slurmd[16415]: _run_prolog: run job script took usec=10472<br>
Nov 27 16:51:51 nid00024 slurmd[16415]: _run_prolog: prolog with lock for job 794 ran for 0 seconds<br>
Nov 27 16:51:51 nid00024 slurmstepd[18143]: core_spec/cray: init<br>
Nov 27 16:51:51 nid00024 slurmstepd[18143]: task/cgroup: /slurm/uid_20480/job_794: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>
Nov 27 16:51:51 nid00024 slurmstepd[18143]: task/cgroup: /slurm/uid_20480/job_794/step_extern: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>
Nov 27 16:51:52 nid00024 slurmd[16415]: Launching batch job 794 for UID 20480<br>
Nov 27 16:51:52 nid00024 slurmstepd[18411]: core_spec/cray: init<br>
Nov 27 16:51:52 nid00024 slurmstepd[18411]: task/cgroup: /slurm/uid_20480/job_794: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>
Nov 27 16:51:52 nid00024 slurmstepd[18411]: task/cgroup: /slurm/uid_20480/job_794/step_batch: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>
Nov 27 16:51:52 nid00024 slurmstepd[18419]: task_p_pre_launch: Using sched_affinity for tasks<br>
Nov 27 16:51:52 nid00024 slurmd[16415]: launch task 794.0 request from <a href="mailto:20480.20480@10.128.0.25" target="_blank">20480.20480@10.128.0.25</a> (port 13485)<br>
Nov 27 16:51:52 nid00024 slurmd[16415]: lllp_distribution jobid [794] auto binding off: mask_cpu,one_thread<br>
Nov 27 16:51:52 nid00024 slurmstepd[18520]: core_spec/cray: init<br>
Nov 27 16:51:52 nid00024 slurmstepd[18520]: (switch_cray.c: 654: switch_p_job_init) gres_cnt: 0 1<br>
Nov 27 16:51:52 nid00024 slurmstepd[18520]: task/cgroup: /slurm/uid_20480/job_794: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>
Nov 27 16:51:52 nid00024 slurmstepd[18520]: task/cgroup: /slurm/uid_20480/job_794/step_0: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>
Nov 27 16:51:52 nid00024 slurmstepd[18527]: task_p_pre_launch: Using sched_affinity for tasks<br>
Nov 27 16:51:52 nid00024 slurmstepd[18529]: task_p_pre_launch: Using sched_affinity for tasks<br>
Nov 27 16:51:52 nid00024 slurmstepd[18528]: task_p_pre_launch: Using sched_affinity for tasks<br>
Nov 27 16:51:52 nid00024 slurmstepd[18530]: task_p_pre_launch: Using sched_affinity for tasks<br>
Nov 27 16:51:53 nid00024 slurmstepd[18520]: done with job<br>
Nov 27 16:51:53 nid00024 slurmd[16415]: launch task 794.1 request from <a href="mailto:20480.20480@10.128.0.25" target="_blank">20480.20480@10.128.0.25</a> (port 18093)<br>
Nov 27 16:51:53 nid00024 slurmd[16415]: lllp_distribution jobid [794] auto binding off: mask_cpu,one_thread<br>
Nov 27 16:51:53 nid00024 slurmstepd[18551]: core_spec/cray: init<br>
Nov 27 16:51:53 nid00024 slurmstepd[18551]: (switch_cray.c: 654: switch_p_job_init) gres_cnt: 0 1<br>
Nov 27 16:51:53 nid00024 slurmstepd[18551]: task/cgroup: /slurm/uid_20480/job_794: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>
Nov 27 16:51:53 nid00024 slurmstepd[18551]: task/cgroup: /slurm/uid_20480/job_794/step_1: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>
Nov 27 16:51:53 nid00024 slurmstepd[18558]: task_p_pre_launch: Using sched_affinity for tasks<br>
Nov 27 16:51:53 nid00024 slurmstepd[18559]: task_p_pre_launch: Using sched_affinity for tasks<br>
Nov 27 16:51:53 nid00024 slurmstepd[18560]: task_p_pre_launch: Using sched_affinity for tasks<br>
Nov 27 16:51:53 nid00024 slurmstepd[18561]: task_p_pre_launch: Using sched_affinity for tasks<br>
Nov 27 16:52:04 nid00024 slurmstepd[18551]: done with job<br>
Nov 27 16:52:05 nid00024 slurmstepd[18411]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0<br>
Nov 27 16:52:05 nid00024 slurmstepd[18411]: done with job<br>
Nov 27 16:52:05 nid00024 slurmstepd[18143]: done with job<br>
<br>
<br>
SLURMD LOG FILE<br>
<br>
<br>
Well, nothing: obviously.<br>
<br>
<br>
The thing to note here is that log messages that ONLY appeared in the<br>
dedicated log file, when there was one, now show up in the systemd<br>
journal (and of course, have that nice "slurmstepd" tag that makes<br>
identifying them much easier) along with the log messages that ALWAYS<br>
appear on BOTH channels.<br>
<br>
<br>
<br>
So, whilst it seems to be the case that,<br>
<br>
    you will get SOME of the messages on ALL of the channels, ALL of the time<br>
<br>
and even though it seems that,<br>
<br>
    you can get ALL of the messages on SOME of the channels, SOME of the time<br>
<br>
sadly it would appear that,<br>
<br>
   you can't get ALL of the messages on ALL of the channels, ALL of the time<br>
<br>
<br>
<br>
Am I a fool for thinking that you should be able to ?<br>
<br>
<br>
Kevin Buckley<br>
-- <br>
Supercomputing Systems Administrator<br>
Pawsey Supercomputing Centre<br>
<br>
</blockquote></div>