[slurm-users] [Long] Why are tasks started on a 30 second clock?

Thu Jul 25 18:11:07 UTC 2019

On Thu, Jul 25, 2019 at 8:16 AM Mark Hahn <hahn at mcmaster.ca> wrote:

> how about a timeout from elsewhere?  for instance, when I see a 30s delay,
> I normally at least check DNS, which can introduce such quantized delays.
>

Thanks, it's a good guess, but is very unlikely the case.

The Google Cloud is quite different from a physical cluster. Speaking of
DNS specifically, it is provided by what is called the "metadata service",
which is tightly integrated with the rest of the platform. When a VM
transitions into the STARTED state (as if "powered on" and begins the
bootload sequence), this service immediately starts responding to DNS
queries for its name (on other machines in the same "network"). It's kinda
magic, as its IP is one and the same across the GCE, has no route to it,
responds even if not whitelisted in the firewall, and is in in the
link-local RFC3927 169.254.0.0/16 range. I think the VM's own "hardware"
network stack IP offloader emulator or maybe the driver responds to these
queries without sending any actual DNS packets anywhere. It never fails and
responds unmeasurably fast (<1ms). (The same IP also pretends to be a DHCP
server, and in addition can return a lot of information about the VM's own
configuration over HTTP port 80 with curl. The common theme here is it
responds only from inside the VM, and with different data depending what
this VM is).

All the internal communication in the virtual "LAN" is extremely reliable.
The only type of hiccups I am aware of is when a VM freezes for give or
take 500ms when migrated to a different physical host--and these normally
happen 1-2 times a month.

Besides, the logs show the node is talking to the controller during this
timeout. I'll post these 3 lines from the node with the "slacking" job
start again:

Jul 24 14:14:27 xa-node-p100-1 slurmd[573]: debug:  _handle_node_reg_resp:
slurmctld sent back 8 TRES.
Jul 24 14:14:47 xa-node-p100-1 slurmd[573]: debug:  _handle_node_reg_resp:
slurmctld sent back 8 TRES.
Jul 24 14:14:51 xa-node-p100-1 slurmd[573]: task_p_slurmd_batch_request:
1646

So in summary, the time sequence looks like:
14:14:20 Controller logs that node ...-3 is online
14:14:21 Node ...-3 starts its task (and logs the
'task_p_slurmd_batch_request:' message)
14:14:22 Controller logs that node ...-1 is online
14:14:27,  14:14:47 - some background periodic exchanges happen with the
node ...-1 (above snippet)
14:14:51 Exactly in 30s since the other node was given a task, the
controller sends a similar request to the node ...-1 (and other nodes that
came up during this time)

I never seen a discrepancy on the idle cluster if all nodes allocated to
the batch are already up and running. But when they come online at random
times over the 10-20 seconds they usually do, I always see the exact 30s
difference between bunches of jobs of the same array in the R state
(provided there is a difference, but most often than not there is if 5 or
more nodes are staged at the same time).

 -kkm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190725/5445dcdf/attachment.htm>