[slurm-users] Regression with srun and task/affinity

Tue May 14 14:49:23 UTC 2019

On 2019-05-14 09:24, Jason Bacon wrote:
> On 2018-12-16 09:02, Jason Bacon wrote:
>>
>> Good morning,
>>
>> We've been running 17.02.11 for a long time and upon testing an 
>> upgrade to the 18 series, we discovered a regression.  It appeared 
>> somewhere between 17.02.11 and 17.11.7.
>>
>> Everything works fine under 17.02.11.
>>
>> Under later versions, everything is fine if I don't use srun or if I 
>> use TaskPlugin=task/none.
>>
>> Just wondering if someone can suggest where to look in the source 
>> code for this. If I can just pinpoint where the problem is, I'm sure 
>> I can come up with a solution pretty quickly.  I've poked around a 
>> bit but have not spotted anything yet.  If this doesn't look familiar 
>> to anyone, I'll dig deeper and figure it out eventually. Just don't 
>> want to duplicate someone's effort if this is something that's been 
>> fixed already on other platforms.
>>
>> Below is output from a failed srun and successful sbatch --array and 
>> openmpi jobs.
>>
>> Thanks,
>>
>>     Jason
>>
>> Failing job:
>>
>> FreeBSD login.wren  bacon ~ 474: srun hostname
>> srun: error: slurm_receive_msgs: Zero Bytes were transmitted or received
>> srun: error: Task launch for 82.0 failed on node compute-001: Zero 
>> Bytes were transmitted or received
>> srun: error: Application launch failed: Zero Bytes were transmitted 
>> or received
>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>> srun: error: Timed out waiting for job step to complete
>>
>> Tail of slurmctld log:
>>
>> [2018-12-01T16:29:09.873] debug2: got 1 threads to send out
>> [2018-12-01T16:29:09.874] debug2: Tree head got back 0 looking for 2
>> [2018-12-01T16:29:09.874] debug3: Tree sending to compute-001
>> [2018-12-01T16:29:09.874] debug3: Tree sending to compute-002
>> [2018-12-01T16:29:09.874] debug2: slurm_connect failed: Connection 
>> refused
>> [2018-12-01T16:29:09.874] debug2: Error connecting slurm stream 
>> socket at 192.168.1.13:6818: Connection refused
>> [2018-12-01T16:29:09.874] debug3: connect refused, retrying
>> [2018-12-01T16:29:09.874] debug4: orig_timeout was 10000 we have 0 
>> steps and a timeout of 10000
>> [2018-12-01T16:29:10.087] debug2: Processing RPC: 
>> MESSAGE_NODE_REGISTRATION_STATUS from uid=0
>> [2018-12-01T16:29:10.087] debug2: Tree head got back 1
>> [2018-12-01T16:29:10.087] debug2: _slurm_rpc_node_registration 
>> complete for compute-002 usec=97
>> [2018-12-01T16:29:10.917] debug2: slurm_connect failed: Connection 
>> refused
>> [2018-12-01T16:29:10.917] debug2: Error connecting slurm stream 
>> socket at 192.168.1.13:6818: Connection refused
>> [2018-12-01T16:29:11.976] debug2: slurm_connect failed: Connection 
>> refused
>> [2018-12-01T16:29:11.976] debug2: Error connecting slurm stream 
>> socket at 192.168.1.13:6818: Connection refused
>> [2018-12-01T16:29:12.007] debug2: Testing job time limits and 
>> checkpoints
>> [2018-12-01T16:29:13.011] debug2: slurm_connect failed: Connection 
>> refused
>> [2018-12-01T16:29:13.011] debug2: Error connecting slurm stream 
>> socket at 192.168.1.13:6818: Connection refused
>>
>> Successful sbatch --array:
>>
>> #!/bin/sh -e
>>
>> #SBATCH --array=1-8
>>
>> hostname
>>
>> FreeBSD login.wren  bacon ~ 462: more slurm-69_8.out
>> cpu-bind=MASK - compute-002, task  0  0 [64261]: mask 0x8 set
>> compute-002.wren
>>
>> Successful openmpi:
>>
>> #!/bin/sh -e
>>
>> #SBATCH --ntasks=8
>>
>> mpirun --report-bindings ./mpi-bench 3
>>
>> FreeBSD login.wren  bacon ~/Data/mpi-bench/trunk 468: more slurm-81.out
>> cpu-bind=MASK - compute-001, task  0  0 [64589]: mask 0xf set
>> CPU 0 is set
>> CPU 1 is set
>> CPU 2 is set
>> CPU 3 is set
>> CPU 0 is set
>> CPU 1 is set
>> [compute-001.wren:64590] MCW rank 0 bound to socket 0[core 0[hwt 0]], 
>> socket 0[core 1[hwt 0]]: [B/B][./.]
>> [compute-001.wren:64590] MCW rank 1 bound to socket 1[core 2[hwt 0]], 
>> socket 1[core 3[hwt 0]]: [./.][B/B]
>> [compute-001.wren:64590] MCW rank 2 bound to socket 0[core 0[hwt 0]], 
>> socket 0[core 1[hwt 0]]: [B/B][./.]
>> [compute-001.wren:64590] MCW rank 3 bound to socket 1[core 2[hwt 0]], 
>> socket 1[core 3[hwt 0]]: [./.][B/B]
>> FreeBSD login.wren  bacon ~ 474: srun hostname
>> srun: error: slurm_receive_msgs: Zero Bytes were transmitted or received
>> srun: error: Task launch for 82.0 failed on node compute-001: Zero 
>> Bytes were transmitted or received
>> srun: error: Application launch failed: Zero Bytes were transmitted 
>> or received
>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>> srun: error: Timed out waiting for job step to complete
>>
> 'Been busy upgrading our CentOS clusters, but finally got a chance to 
> dig into this.
>
> I couldn't find any clues in the logs, but I noticed that slurmd was 
> dying every time I use srun, so I manually ran it under GDB:
>
> root at compute-001:~ # gdb slurmd
> GNU gdb 6.1.1 [FreeBSD]
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and 
> you are
> welcome to change it and/or distribute copies of it under certain 
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for 
> details.
> This GDB was configured as "amd64-marcel-freebsd"...(no debugging 
> symbols found)...
> (gdb) run -D
> Starting program: /usr/local/sbin/slurmd -D
> (no debugging symbols found)...(no debugging symbols found)...slurmd: 
> debug:  Log file re-opened
> slurmd: debug:  CPUs:4 Boards:1 Sockets:2 CoresPerSocket:2 
> ThreadsPerCore:1
> slurmd: Message aggregation disabled
> slurmd: debug:  CPUs:4 Boards:1 Sockets:2 CoresPerSocket:2 
> ThreadsPerCore:1
> slurmd: topology NONE plugin loaded
> slurmd: route default plugin loaded
> slurmd: CPU frequency setting not configured for this node
> slurmd: debug:  Resource spec: No specialized cores configured by 
> default on this node
> slurmd: debug:  Resource spec: Reserved system memory limit not 
> configured for this node
> slurmd: task affinity plugin loaded with CPU mask 
> 000000000000000000000000000000000000000000000000000000000000000f
> slurmd: debug:  Munge authentication plugin loaded
> slurmd: debug:  spank: opening plugin stack /usr/local/etc/plugstack.conf
> slurmd: Munge cryptographic signature plugin loaded
> slurmd: slurmd version 18.08.7 started
> slurmd: debug:  Job accounting gather LINUX plugin loaded
> slurmd: debug:  job_container none plugin loaded
> slurmd: debug:  switch NONE plugin loaded
> slurmd: slurmd started on Tue, 14 May 2019 09:17:12 -0500
> slurmd: CPUs=4 Boards=1 Sockets=2 Cores=2 Threads=1 Memory=16344 
> TmpDisk=15853 Uptime=2815493 CPUSpecList=(null) FeaturesAvail=(null) 
> FeaturesActive=(null)
> slurmd: debug:  AcctGatherEnergy NONE plugin loaded
> slurmd: debug:  AcctGatherProfile NONE plugin loaded
> slurmd: debug:  AcctGatherInterconnect NONE plugin loaded
> slurmd: debug:  AcctGatherFilesystem NONE plugin loaded
> slurmd: debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.
> slurmd: launch task 33.0 request from UID:2001 GID:2001 
> HOST:192.168.0.20 PORT:244
> slurmd: debug:  Checking credential with 336 bytes of sig data
> slurmd: task affinity : enforcing 'verbose,cores' cpu bind method
> slurmd: debug:  task affinity : before lllp distribution cpu bind 
> method is 'verbose,cores' ((null))
> slurmd: lllp_distribution jobid [33] binding: 
> verbose,cores,one_thread, dist 1
> slurmd: _task_layout_lllp_cyclic
> /usr/local/lib/slurm/task_affinity.so: Undefined symbol "slurm_strlcpy"
>
> Program exited with code 01.
> (gdb)
>
> Looks like a simply build issue.  Seems a little odd that the build 
> succeeded with an undefined symbol, but should be pretty easy to track 
> down in any case.
>
Here's the culprit:

In src/common/slurm_xlator.h, strlcpy is unconditionally defined as 
slurm_strlcpy:

/* strlcpy.[ch] functions */
#define      strlcpy                 slurm_strlcpy

But in src/common/strlcpy.c, the definition of strlcpy() and the 
slurm_strlcpy alias are masked by

#if (!HAVE_STRLCPY)

So this will cause failures on platforms that already have an strlcpy() 
function.

Here's a quick fix:

--- src/common/slurm_xlator.h.orig      2019-04-12 04:20:25 UTC
+++ src/common/slurm_xlator.h
@@ -299,7 +299,9 @@
   * The header file used only for #define values. */

  /* strlcpy.[ch] functions */
+#if (!HAVE_STRLCPY)    // Match this to src/common/strlcpy.c
  #define        strlcpy                 slurm_strlcpy
+#endif

  /* switch.[ch] functions
   * None exported today.

-- 
Earth is a beta site.