[slurm-users] [Long] Why are tasks started on a 30 second clock?
Kirill Katsnelson
kkm at pobox.com
Thu Jul 25 14:02:41 UTC 2019
I am setting up and debugging a little (up to 100 nodes) elastic cluster in
the Google Compute Engine (GCE). Our compute load is embarrassingly
parallelizable, and I am just packing nodes with a either a task per core
for CPU, or task per node for GPU loads, and the node VMs are started and
deleted on demand. I learned many of Slurm timing features by reading its
source, but here is one that baffles me. Not a problem, just something I am
striving to understand.
When I start an entirely homogeneous array batch, GCE starts a set
identical VM nodes (I do it with the ResumeProgram script). There is a
natural variance in the time when a node becomes available to the
controller, of course, but normally it's just seconds apart, quite tight.
My topology is flat, and the GCE network is super-consistent. The
interesting Slurm behavior there is that when the jobs reach the R state,
some of the nodes run tasks from the same batch exactly 30 seconds ahead of
the others (unless they are in perfect sync, that also happens). It's as if
slurmctld ran some loop once every 30 seconds to start tasks on the nodes
that become available. And I cannot understand where this 30 seconds figure
is coming from.
Here's an example from my debug run, with log snippets (I'm leaving in only
what is hopefully relevant). The job 1648 launched on xa-node-p100-3 at
14:14:21, and 1646 xa-node-p100-3 on at 14:14:51. xa-control-1 is the only
controller. Here are 2 batches submitted simultaneously, with different
GRES (no GRES plugins, just generic resource IDs), so that Slurm stages 5
nodes of each 2 types.
Recap, so it's easier to track:
Job: 1648, Start: 14:14:21, Node: xa-node-p100-3
Job: 1646, Start: 14:14:51, Node: xa-node-p100-1
Node log snippets first. Clearly, the controller RPCs in the requests 30s
apart, so the delta-t comes from the controller:
Jul 24 14:14:20 xa-node-p100-3 slurmd[587]: debug: _handle_node_reg_resp:
slurmctld sent back 8 TRES.
Jul 24 14:14:21 xa-node-p100-3 slurmd[587]: task_p_slurmd_batch_request:
1648
Jul 24 14:14:27 xa-node-p100-1 slurmd[573]: debug: _handle_node_reg_resp:
slurmctld sent back 8 TRES.
Jul 24 14:14:47 xa-node-p100-1 slurmd[573]: debug: _handle_node_reg_resp:
slurmctld sent back 8 TRES.
Jul 24 14:14:51 xa-node-p100-1 slurmd[573]: task_p_slurmd_batch_request:
1646
Let's now focus on these 2 jobs and respective nodes on the controller:
Jul 24 14:13:37 xa-control-1 slurmctld[361]: Power save mode: 52 nodes
Jul 24 14:13:37 xa-control-1 slurmctld[361]: power_save: pid 866 waking
nodes xa-node-p100-[1-5],xa-node-v100-[1-5]
Jul 24 14:13:46 xa-control-1 slurmctld[361]: debug: sched: Running job
scheduler
. . .
Jul 24 14:14:20 xa-control-1 slurmctld[361]: backfill: beginning
Jul 24 14:14:20 xa-control-1 slurmctld[361]: backfill: no jobs to backfill
Jul 24 14:14:20 xa-control-1 slurmctld[361]: Node xa-node-p100-3 now
responding <<a>>
Jul 24 14:14:21 xa-control-1 slurmctld[361]: job_time_limit: Configuration
for JobId=1641_3(1648) complete <<b>>
Jul 24 14:14:21 xa-control-1 slurmctld[361]: Resetting JobId=1641_3(1648)
start time for node power up
. . . (snip same as above for already responded nodes)
Jul 24 14:14:21 xa-control-1 slurmctld[361]: debug: sched: Running job
scheduler
Jul 24 14:14:22 xa-control-1 slurmctld[361]: Node xa-node-p100-2 now
responding
Jul 24 14:14:22 xa-control-1 slurmctld[361]: Node xa-node-p100-1 now
responding <<c>>
Jul 24 14:14:25 xa-control-1 slurmctld[361]: backfill: beginning
Jul 24 14:14:25 xa-control-1 slurmctld[361]: backfill: no jobs to backfill
So, our xa-node-p100-3 responded <<a>> and got its slice of work after just
1 second <<b>>. The xa-node-p100-1 is late to the party by just 1 second
<<c>>, but still late. Omitting a few log lines:
Jul 24 14:14:45 xa-control-1 slurmctld[361]: debug: Spawning ping agent
for xa-node-p100-[1-5],xa-node-v100-[1,4-5]
Jul 24 14:14:46 xa-control-1 slurmctld[361]: debug: sched: Running job
scheduler
Jul 24 14:14:47 xa-control-1 slurmctld[361]: debug: Spawning registration
agent for xa-node-p100-[1-5],xa-node-v100-[1-5] 10 hosts
Jul 24 14:14:50 xa-control-1 slurmctld[361]: backfill: beginning
Jul 24 14:14:50 xa-control-1 slurmctld[361]: backfill: no jobs to backfill
Jul 24 14:14:51 xa-control-1 slurmctld[361]: job_time_limit: Configuration
for JobId=1641_1(1646) complete <<d>>
Jul 24 14:14:51 xa-control-1 slurmctld[361]: Resetting JobId=1641_1(1646)
start time for node power up
... (a bunch of messages like the 2 above, for other jobs from the two
arrays) ...
This is where Slurmctld started the job 1646 on xa-node-p100-1 <<d>>.
Exactly on the mysterious 30 second clock tick!
Reading agent.c code, I understand that it tosses the waiting requests
between the defer_list and retry_list queues. The REQUEST_BATCH_JOB_LAUNCH
RPC is considered once in at least 10 seconds:
<
https://github.com/SchedMD/slurm/blob/3a82eeaa3/src/slurmctld/agent.c#L2141-L2143
>
if (difftime(now, queued_req_ptr->last_attempt) < 10) {
/* Reduce overhead by only testing once every 10 secs */
return 1;
}
But for the life of me I cannot grok where the additional x3 multiplier is
coming from, assuming I am looking at the right code. Debug=Agent output
shows that the agent is triggered once per second when nothing interesting
happens, and more often when there is some activity.
I started my software engineering life back in the days collecting realtime
data from a physics experiment on a mini-PDP[1], then doing robotics and
embedded control; although mostly doing data science now[2], I still love
the near-realtime stuff. I am in awe trying to comprehend the sheer Slurm
scale, and my question is driven just by curiosity.
I'll be very grateful if anyone can explain where does the 30-second clock
hide!
-kkm
_____
[1] Did I just reveal my age? Ah, those PDP days!
[2] There is no science in "data science"; it should have been termed "data
witchcraft."
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190725/07337f6e/attachment.htm>
More information about the slurm-users
mailing list