<div dir="ltr"><div dir="ltr"><a href="https://twitter.com/systemdsucks">https://twitter.com/systemdsucks</a><br></div><div>Is that my coat? Why thankyou.</div></div><br><div class="gmail_quote"><div dir="ltr">On Tue, 27 Nov 2018 at 11:21, Kevin Buckley <<a href="mailto:Kevin.Buckley@pawsey.org.au">Kevin.Buckley@pawsey.org.au</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Here are our usual Slurm log-related config settings<br>

<br>

(where we've increased the default "Syslog" level as we're trying<br>

  to get the Crays to funnel the slurmd log messages all the way up<br>

  to the machine that syslog messages go to on Crays)<br>

<br>

<br>

SlurmctldDebug=info<br>

SlurmctldLogFile=/var/spool/slurm/slurmctld.log<br>

SlurmctldSyslogDebug=info<br>

SlurmdDebug=info<br>

SlurmdLogFile=/var/spool/slurmd/%h.log<br>

SlurmdSyslogDebug=info<br>

<br>

<br>

and here are the "logged" outputs to both, the systemd journal,<br>

and slurmd's dedicated log file, channels, for a particular job:<br>

<br>

<br>

SYSTEMD JOURNAL<br>

<br>

<br>

Nov 27 16:57:32 nid00024 slurmd[18994]: task_p_slurmd_batch_request: 796<br>

Nov 27 16:57:32 nid00024 slurmd[18994]: task/affinity: job 796 CPU input mask for node: 0x00FF<br>

Nov 27 16:57:32 nid00024 slurmd[18994]: task/affinity: job 796 CPU final HW mask for node: 0x0F0F<br>

Nov 27 16:57:32 nid00024 slurmd[18994]: _run_prolog: run job script took usec=10517<br>

Nov 27 16:57:32 nid00024 slurmd[18994]: _run_prolog: prolog with lock for job 796 ran for 0 seconds<br>

Nov 27 16:57:33 nid00024 slurmd[18994]: Launching batch job 796 for UID 20480<br>

Nov 27 16:57:33 nid00024 slurmd[18994]: launch task 796.0 request from <a href="mailto:20480.20480@10.128.0.25" target="_blank">20480.20480@10.128.0.25</a> (port 25773)<br>

Nov 27 16:57:33 nid00024 slurmd[18994]: lllp_distribution jobid [796] auto binding off: mask_cpu,one_thread<br>

Nov 27 16:57:34 nid00024 slurmd[18994]: launch task 796.1 request from <a href="mailto:20480.20480@10.128.0.25" target="_blank">20480.20480@10.128.0.25</a> (port 30381)<br>

Nov 27 16:57:34 nid00024 slurmd[18994]: lllp_distribution jobid [796] auto binding off: mask_cpu,one_thread<br>

<br>

<br>

SLURMD LOG FILE<br>

<br>

<br>

[2018-11-27T16:57:32.850] task_p_slurmd_batch_request: 796<br>

[2018-11-27T16:57:32.850] task/affinity: job 796 CPU input mask for node: 0x00FF<br>

[2018-11-27T16:57:32.850] task/affinity: job 796 CPU final HW mask for node: 0x0F0F<br>

[2018-11-27T16:57:32.862] _run_prolog: run job script took usec=10517<br>

[2018-11-27T16:57:32.862] _run_prolog: prolog with lock for job 796 ran for 0 seconds<br>

[2018-11-27T16:57:32.874] [796.extern] core_spec/cray: init<br>

[2018-11-27T16:57:32.898] [796.extern] task/cgroup: /slurm/uid_20480/job_796: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>

[2018-11-27T16:57:32.898] [796.extern] task/cgroup: /slurm/uid_20480/job_796/step_extern: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>

[2018-11-27T16:57:33.666] Launching batch job 796 for UID 20480<br>

[2018-11-27T16:57:33.680] [796.batch] core_spec/cray: init<br>

[2018-11-27T16:57:33.690] [796.batch] task/cgroup: /slurm/uid_20480/job_796: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>

[2018-11-27T16:57:33.690] [796.batch] task/cgroup: /slurm/uid_20480/job_796/step_batch: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>

[2018-11-27T16:57:33.696] [796.batch] task_p_pre_launch: Using sched_affinity for tasks<br>

[2018-11-27T16:57:33.871] launch task 796.0 request from <a href="mailto:20480.20480@10.128.0.25" target="_blank">20480.20480@10.128.0.25</a> (port 25773)<br>

[2018-11-27T16:57:33.871] lllp_distribution jobid [796] auto binding off: mask_cpu,one_thread<br>

[2018-11-27T16:57:33.882] [796.0] core_spec/cray: init<br>

[2018-11-27T16:57:33.918] [796.0] (switch_cray.c: 654: switch_p_job_init) gres_cnt: 0 1<br>

[2018-11-27T16:57:33.934] [796.0] task/cgroup: /slurm/uid_20480/job_796: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>

[2018-11-27T16:57:33.934] [796.0] task/cgroup: /slurm/uid_20480/job_796/step_0: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>

[2018-11-27T16:57:33.940] [796.0] task_p_pre_launch: Using sched_affinity for tasks<br>

[2018-11-27T16:57:33.940] [796.0] task_p_pre_launch: Using sched_affinity for tasks<br>

[2018-11-27T16:57:33.940] [796.0] task_p_pre_launch: Using sched_affinity for tasks<br>

[2018-11-27T16:57:33.940] [796.0] task_p_pre_launch: Using sched_affinity for tasks<br>

[2018-11-27T16:57:34.580] [796.0] done with job<br>

[2018-11-27T16:57:34.940] launch task 796.1 request from <a href="mailto:20480.20480@10.128.0.25" target="_blank">20480.20480@10.128.0.25</a> (port 30381)<br>

[2018-11-27T16:57:34.941] lllp_distribution jobid [796] auto binding off: mask_cpu,one_thread<br>

[2018-11-27T16:57:34.951] [796.1] core_spec/cray: init<br>

[2018-11-27T16:57:34.994] [796.1] (switch_cray.c: 654: switch_p_job_init) gres_cnt: 0 1<br>

[2018-11-27T16:57:35.006] [796.1] task/cgroup: /slurm/uid_20480/job_796: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>

[2018-11-27T16:57:35.006] [796.1] task/cgroup: /slurm/uid_20480/job_796/step_1: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>

[2018-11-27T16:57:35.012] [796.1] task_p_pre_launch: Using sched_affinity for tasks<br>

[2018-11-27T16:57:35.012] [796.1] task_p_pre_launch: Using sched_affinity for tasks<br>

[2018-11-27T16:57:35.012] [796.1] task_p_pre_launch: Using sched_affinity for tasks<br>

[2018-11-27T16:57:35.013] [796.1] task_p_pre_launch: Using sched_affinity for tasks<br>

[2018-11-27T16:57:46.460] [796.1] done with job<br>

[2018-11-27T16:57:46.602] [796.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0<br>

[2018-11-27T16:57:46.608] [796.batch] done with job<br>

[2018-11-27T16:57:46.760] [796.extern] done with job<br>

<br>

<br>

where we see that SOME, but not ALL, of the log messages appear on both of the channels.<br>

<br>

<br>

<br>

Now let's comment out the use of a dedicated slurmd log file and run exactly<br>

the same job,<br>

<br>

FWIW:<br>

<br>

SlurmctldDebug=info<br>

SlurmctldLogFile=/var/spool/slurm/slurmctld.log<br>

SlurmctldSyslogDebug=info<br>

SlurmdDebug=info<br>

#lurmdLogFile=/var/spool/slurmd/%h.log<br>

SlurmdSyslogDebug=info<br>

<br>

<br>

SYSTEMD JOURNAL<br>

<br>

<br>

Nov 27 16:51:51 nid00024 slurmd[16415]: task_p_slurmd_batch_request: 794<br>

Nov 27 16:51:51 nid00024 slurmd[16415]: task/affinity: job 794 CPU input mask for node: 0x00FF<br>

Nov 27 16:51:51 nid00024 slurmd[16415]: task/affinity: job 794 CPU final HW mask for node: 0x0F0F<br>

Nov 27 16:51:51 nid00024 slurmd[16415]: _run_prolog: run job script took usec=10472<br>

Nov 27 16:51:51 nid00024 slurmd[16415]: _run_prolog: prolog with lock for job 794 ran for 0 seconds<br>

Nov 27 16:51:51 nid00024 slurmstepd[18143]: core_spec/cray: init<br>

Nov 27 16:51:51 nid00024 slurmstepd[18143]: task/cgroup: /slurm/uid_20480/job_794: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>

Nov 27 16:51:51 nid00024 slurmstepd[18143]: task/cgroup: /slurm/uid_20480/job_794/step_extern: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>

Nov 27 16:51:52 nid00024 slurmd[16415]: Launching batch job 794 for UID 20480<br>

Nov 27 16:51:52 nid00024 slurmstepd[18411]: core_spec/cray: init<br>

Nov 27 16:51:52 nid00024 slurmstepd[18411]: task/cgroup: /slurm/uid_20480/job_794: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>

Nov 27 16:51:52 nid00024 slurmstepd[18411]: task/cgroup: /slurm/uid_20480/job_794/step_batch: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>

Nov 27 16:51:52 nid00024 slurmstepd[18419]: task_p_pre_launch: Using sched_affinity for tasks<br>

Nov 27 16:51:52 nid00024 slurmd[16415]: launch task 794.0 request from <a href="mailto:20480.20480@10.128.0.25" target="_blank">20480.20480@10.128.0.25</a> (port 13485)<br>

Nov 27 16:51:52 nid00024 slurmd[16415]: lllp_distribution jobid [794] auto binding off: mask_cpu,one_thread<br>

Nov 27 16:51:52 nid00024 slurmstepd[18520]: core_spec/cray: init<br>

Nov 27 16:51:52 nid00024 slurmstepd[18520]: (switch_cray.c: 654: switch_p_job_init) gres_cnt: 0 1<br>

Nov 27 16:51:52 nid00024 slurmstepd[18520]: task/cgroup: /slurm/uid_20480/job_794: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>

Nov 27 16:51:52 nid00024 slurmstepd[18520]: task/cgroup: /slurm/uid_20480/job_794/step_0: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>

Nov 27 16:51:52 nid00024 slurmstepd[18527]: task_p_pre_launch: Using sched_affinity for tasks<br>

Nov 27 16:51:52 nid00024 slurmstepd[18529]: task_p_pre_launch: Using sched_affinity for tasks<br>

Nov 27 16:51:52 nid00024 slurmstepd[18528]: task_p_pre_launch: Using sched_affinity for tasks<br>

Nov 27 16:51:52 nid00024 slurmstepd[18530]: task_p_pre_launch: Using sched_affinity for tasks<br>

Nov 27 16:51:53 nid00024 slurmstepd[18520]: done with job<br>

Nov 27 16:51:53 nid00024 slurmd[16415]: launch task 794.1 request from <a href="mailto:20480.20480@10.128.0.25" target="_blank">20480.20480@10.128.0.25</a> (port 18093)<br>

Nov 27 16:51:53 nid00024 slurmd[16415]: lllp_distribution jobid [794] auto binding off: mask_cpu,one_thread<br>

Nov 27 16:51:53 nid00024 slurmstepd[18551]: core_spec/cray: init<br>

Nov 27 16:51:53 nid00024 slurmstepd[18551]: (switch_cray.c: 654: switch_p_job_init) gres_cnt: 0 1<br>

Nov 27 16:51:53 nid00024 slurmstepd[18551]: task/cgroup: /slurm/uid_20480/job_794: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>

Nov 27 16:51:53 nid00024 slurmstepd[18551]: task/cgroup: /slurm/uid_20480/job_794/step_1: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB<br>

Nov 27 16:51:53 nid00024 slurmstepd[18558]: task_p_pre_launch: Using sched_affinity for tasks<br>

Nov 27 16:51:53 nid00024 slurmstepd[18559]: task_p_pre_launch: Using sched_affinity for tasks<br>

Nov 27 16:51:53 nid00024 slurmstepd[18560]: task_p_pre_launch: Using sched_affinity for tasks<br>

Nov 27 16:51:53 nid00024 slurmstepd[18561]: task_p_pre_launch: Using sched_affinity for tasks<br>

Nov 27 16:52:04 nid00024 slurmstepd[18551]: done with job<br>

Nov 27 16:52:05 nid00024 slurmstepd[18411]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0<br>

Nov 27 16:52:05 nid00024 slurmstepd[18411]: done with job<br>

Nov 27 16:52:05 nid00024 slurmstepd[18143]: done with job<br>

<br>

<br>

SLURMD LOG FILE<br>

<br>

<br>

Well, nothing: obviously.<br>

<br>

<br>

The thing to note here is that log messages that ONLY appeared in the<br>

dedicated log file, when there was one, now show up in the systemd<br>

journal (and of course, have that nice "slurmstepd" tag that makes<br>

identifying them much easier) along with the log messages that ALWAYS<br>

appear on BOTH channels.<br>

<br>

<br>

<br>

So, whilst it seems to be the case that,<br>

<br>

    you will get SOME of the messages on ALL of the channels, ALL of the time<br>

<br>

and even though it seems that,<br>

<br>

    you can get ALL of the messages on SOME of the channels, SOME of the time<br>

<br>

sadly it would appear that,<br>

<br>

   you can't get ALL of the messages on ALL of the channels, ALL of the time<br>

<br>

<br>

<br>

Am I a fool for thinking that you should be able to ?<br>

<br>

<br>

Kevin Buckley<br>

-- <br>

Supercomputing Systems Administrator<br>

Pawsey Supercomputing Centre<br>

<br>

</blockquote></div>