[slurm-users] Regression with srun and task/affinity
Jason Bacon
bacon4000 at gmail.com
Tue May 14 14:49:23 UTC 2019
On 2019-05-14 09:24, Jason Bacon wrote:
> On 2018-12-16 09:02, Jason Bacon wrote:
>>
>> Good morning,
>>
>> We've been running 17.02.11 for a long time and upon testing an
>> upgrade to the 18 series, we discovered a regression. It appeared
>> somewhere between 17.02.11 and 17.11.7.
>>
>> Everything works fine under 17.02.11.
>>
>> Under later versions, everything is fine if I don't use srun or if I
>> use TaskPlugin=task/none.
>>
>> Just wondering if someone can suggest where to look in the source
>> code for this. If I can just pinpoint where the problem is, I'm sure
>> I can come up with a solution pretty quickly. I've poked around a
>> bit but have not spotted anything yet. If this doesn't look familiar
>> to anyone, I'll dig deeper and figure it out eventually. Just don't
>> want to duplicate someone's effort if this is something that's been
>> fixed already on other platforms.
>>
>> Below is output from a failed srun and successful sbatch --array and
>> openmpi jobs.
>>
>> Thanks,
>>
>> Jason
>>
>> Failing job:
>>
>> FreeBSD login.wren bacon ~ 474: srun hostname
>> srun: error: slurm_receive_msgs: Zero Bytes were transmitted or received
>> srun: error: Task launch for 82.0 failed on node compute-001: Zero
>> Bytes were transmitted or received
>> srun: error: Application launch failed: Zero Bytes were transmitted
>> or received
>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>> srun: error: Timed out waiting for job step to complete
>>
>> Tail of slurmctld log:
>>
>> [2018-12-01T16:29:09.873] debug2: got 1 threads to send out
>> [2018-12-01T16:29:09.874] debug2: Tree head got back 0 looking for 2
>> [2018-12-01T16:29:09.874] debug3: Tree sending to compute-001
>> [2018-12-01T16:29:09.874] debug3: Tree sending to compute-002
>> [2018-12-01T16:29:09.874] debug2: slurm_connect failed: Connection
>> refused
>> [2018-12-01T16:29:09.874] debug2: Error connecting slurm stream
>> socket at 192.168.1.13:6818: Connection refused
>> [2018-12-01T16:29:09.874] debug3: connect refused, retrying
>> [2018-12-01T16:29:09.874] debug4: orig_timeout was 10000 we have 0
>> steps and a timeout of 10000
>> [2018-12-01T16:29:10.087] debug2: Processing RPC:
>> MESSAGE_NODE_REGISTRATION_STATUS from uid=0
>> [2018-12-01T16:29:10.087] debug2: Tree head got back 1
>> [2018-12-01T16:29:10.087] debug2: _slurm_rpc_node_registration
>> complete for compute-002 usec=97
>> [2018-12-01T16:29:10.917] debug2: slurm_connect failed: Connection
>> refused
>> [2018-12-01T16:29:10.917] debug2: Error connecting slurm stream
>> socket at 192.168.1.13:6818: Connection refused
>> [2018-12-01T16:29:11.976] debug2: slurm_connect failed: Connection
>> refused
>> [2018-12-01T16:29:11.976] debug2: Error connecting slurm stream
>> socket at 192.168.1.13:6818: Connection refused
>> [2018-12-01T16:29:12.007] debug2: Testing job time limits and
>> checkpoints
>> [2018-12-01T16:29:13.011] debug2: slurm_connect failed: Connection
>> refused
>> [2018-12-01T16:29:13.011] debug2: Error connecting slurm stream
>> socket at 192.168.1.13:6818: Connection refused
>>
>> Successful sbatch --array:
>>
>> #!/bin/sh -e
>>
>> #SBATCH --array=1-8
>>
>> hostname
>>
>> FreeBSD login.wren bacon ~ 462: more slurm-69_8.out
>> cpu-bind=MASK - compute-002, task 0 0 [64261]: mask 0x8 set
>> compute-002.wren
>>
>> Successful openmpi:
>>
>> #!/bin/sh -e
>>
>> #SBATCH --ntasks=8
>>
>> mpirun --report-bindings ./mpi-bench 3
>>
>> FreeBSD login.wren bacon ~/Data/mpi-bench/trunk 468: more slurm-81.out
>> cpu-bind=MASK - compute-001, task 0 0 [64589]: mask 0xf set
>> CPU 0 is set
>> CPU 1 is set
>> CPU 2 is set
>> CPU 3 is set
>> CPU 0 is set
>> CPU 1 is set
>> [compute-001.wren:64590] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>> socket 0[core 1[hwt 0]]: [B/B][./.]
>> [compute-001.wren:64590] MCW rank 1 bound to socket 1[core 2[hwt 0]],
>> socket 1[core 3[hwt 0]]: [./.][B/B]
>> [compute-001.wren:64590] MCW rank 2 bound to socket 0[core 0[hwt 0]],
>> socket 0[core 1[hwt 0]]: [B/B][./.]
>> [compute-001.wren:64590] MCW rank 3 bound to socket 1[core 2[hwt 0]],
>> socket 1[core 3[hwt 0]]: [./.][B/B]
>> FreeBSD login.wren bacon ~ 474: srun hostname
>> srun: error: slurm_receive_msgs: Zero Bytes were transmitted or received
>> srun: error: Task launch for 82.0 failed on node compute-001: Zero
>> Bytes were transmitted or received
>> srun: error: Application launch failed: Zero Bytes were transmitted
>> or received
>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>> srun: error: Timed out waiting for job step to complete
>>
> 'Been busy upgrading our CentOS clusters, but finally got a chance to
> dig into this.
>
> I couldn't find any clues in the logs, but I noticed that slurmd was
> dying every time I use srun, so I manually ran it under GDB:
>
> root at compute-001:~ # gdb slurmd
> GNU gdb 6.1.1 [FreeBSD]
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and
> you are
> welcome to change it and/or distribute copies of it under certain
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB. Type "show warranty" for
> details.
> This GDB was configured as "amd64-marcel-freebsd"...(no debugging
> symbols found)...
> (gdb) run -D
> Starting program: /usr/local/sbin/slurmd -D
> (no debugging symbols found)...(no debugging symbols found)...slurmd:
> debug: Log file re-opened
> slurmd: debug: CPUs:4 Boards:1 Sockets:2 CoresPerSocket:2
> ThreadsPerCore:1
> slurmd: Message aggregation disabled
> slurmd: debug: CPUs:4 Boards:1 Sockets:2 CoresPerSocket:2
> ThreadsPerCore:1
> slurmd: topology NONE plugin loaded
> slurmd: route default plugin loaded
> slurmd: CPU frequency setting not configured for this node
> slurmd: debug: Resource spec: No specialized cores configured by
> default on this node
> slurmd: debug: Resource spec: Reserved system memory limit not
> configured for this node
> slurmd: task affinity plugin loaded with CPU mask
> 000000000000000000000000000000000000000000000000000000000000000f
> slurmd: debug: Munge authentication plugin loaded
> slurmd: debug: spank: opening plugin stack /usr/local/etc/plugstack.conf
> slurmd: Munge cryptographic signature plugin loaded
> slurmd: slurmd version 18.08.7 started
> slurmd: debug: Job accounting gather LINUX plugin loaded
> slurmd: debug: job_container none plugin loaded
> slurmd: debug: switch NONE plugin loaded
> slurmd: slurmd started on Tue, 14 May 2019 09:17:12 -0500
> slurmd: CPUs=4 Boards=1 Sockets=2 Cores=2 Threads=1 Memory=16344
> TmpDisk=15853 Uptime=2815493 CPUSpecList=(null) FeaturesAvail=(null)
> FeaturesActive=(null)
> slurmd: debug: AcctGatherEnergy NONE plugin loaded
> slurmd: debug: AcctGatherProfile NONE plugin loaded
> slurmd: debug: AcctGatherInterconnect NONE plugin loaded
> slurmd: debug: AcctGatherFilesystem NONE plugin loaded
> slurmd: debug: _handle_node_reg_resp: slurmctld sent back 8 TRES.
> slurmd: launch task 33.0 request from UID:2001 GID:2001
> HOST:192.168.0.20 PORT:244
> slurmd: debug: Checking credential with 336 bytes of sig data
> slurmd: task affinity : enforcing 'verbose,cores' cpu bind method
> slurmd: debug: task affinity : before lllp distribution cpu bind
> method is 'verbose,cores' ((null))
> slurmd: lllp_distribution jobid [33] binding:
> verbose,cores,one_thread, dist 1
> slurmd: _task_layout_lllp_cyclic
> /usr/local/lib/slurm/task_affinity.so: Undefined symbol "slurm_strlcpy"
>
> Program exited with code 01.
> (gdb)
>
> Looks like a simply build issue. Seems a little odd that the build
> succeeded with an undefined symbol, but should be pretty easy to track
> down in any case.
>
Here's the culprit:
In src/common/slurm_xlator.h, strlcpy is unconditionally defined as
slurm_strlcpy:
/* strlcpy.[ch] functions */
#define strlcpy slurm_strlcpy
But in src/common/strlcpy.c, the definition of strlcpy() and the
slurm_strlcpy alias are masked by
#if (!HAVE_STRLCPY)
So this will cause failures on platforms that already have an strlcpy()
function.
Here's a quick fix:
--- src/common/slurm_xlator.h.orig 2019-04-12 04:20:25 UTC
+++ src/common/slurm_xlator.h
@@ -299,7 +299,9 @@
* The header file used only for #define values. */
/* strlcpy.[ch] functions */
+#if (!HAVE_STRLCPY) // Match this to src/common/strlcpy.c
#define strlcpy slurm_strlcpy
+#endif
/* switch.[ch] functions
* None exported today.
--
Earth is a beta site.
More information about the slurm-users
mailing list