[slurm-users] Task launch failure on cloud nodes (Address family '0' not supported)
Weaver, Christopher
weave132 at msu.edu
Tue May 23 20:02:37 UTC 2023
I'm working on setting up a cloud partition, and running into some communications problems between my nodes. This looks like something I have misconfigured, or information I haven't correctly supplied to slurm, but the low-level nature of the error has made it hard for me to figure out what I've done wrong.
I have a batch script which is essentially:
#!/bin/sh
#SBATCH --time=2
#SBATCH --partition=cloud
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=1
srun -vvvvv --slurmd-debug=verbose singularity exec my-image.sif some args
Submitting this with `sbatch`, two 4 core VM nodes are started up as expected, the batch script is sent to one, and begins executing the `srun`. That seems to allocate the necessary job steps, but then fails when trying to communicate with the nodes in the allocation to start the tasks:
srun: jobid 320: nodes(2):`ec[0-1]', cpu counts: 4(x2)
srun: debug2: creating job with 8 tasks
srun: debug: requesting job 320, user 1000, nodes 2 including ((null))
srun: debug: cpus 8, tasks 8, name singularity, relative 65534
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: debug: Entering slurm_step_launch
srun: debug: mpi type = (null)
srun: debug: mpi/none: p_mpi_hook_client_prelaunch: Using mpi/none
srun: debug: Entering _msg_thr_create()
srun: debug4: eio: handling events for 2 objects
srun: debug3: eio_message_socket_readable: shutdown 0 fd 9
srun: debug3: eio_message_socket_readable: shutdown 0 fd 5
srun: debug: initialized stdio listening socket, port 43793
srun: debug: Started IO server thread (139796182816512)
srun: debug: Entering _launch_tasks
srun: debug3: IO thread pid = 1507
srun: debug4: eio: handling events for 4 objects
srun: debug2: Called _file_readable
srun: debug3: false, all ioservers not yet initialized
srun: launching StepId=320.0 on host ec0, 4 tasks: [0-3]
srun: debug3: uid:1000 gid:1000 cwd:/tmp/job-320 0
srun: launching StepId=320.0 on host ec1, 4 tasks: [4-7]
srun: debug3: uid:1000 gid:1000 cwd:/tmp/job-320 1
srun: debug2: Called _file_writable
srun: debug3: false
srun: debug3: eof is false
srun: debug2: Called _file_writable
srun: debug3: false
srun: debug3: eof is false
srun: debug3: Called _listening_socket_readable
srun: debug3: Trying to load plugin /usr/lib64/slurm/route_default.so
srun: route/default: init: route default plugin loaded
srun: debug3: Success.
srun: debug3: Tree sending to ec0
srun: debug2: Tree head got back 0 looking for 2
srun: debug3: Tree sending to ec1
srun: error: slurm_get_port: Address family '0' not supported
srun: error: Error connecting, bad data: family = 0, port = 0
srun: debug3: problems with ec1
srun: error: slurm_get_port: Address family '0' not supported
srun: error: Error connecting, bad data: family = 0, port = 0
srun: debug3: problems with ec0
srun: debug2: Tree head got back 2
srun: debug: launch returned msg_rc=1001 err=1001 type=9001
srun: error: Task launch for StepId=320.0 failed on node ec1: Communication connection failure
srun: debug: launch returned msg_rc=1001 err=1001 type=9001
srun: error: Task launch for StepId=320.0 failed on node ec0: Communication connection failure
srun: error: Application launch failed: Communication connection failure
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
It looks like the problem is inability to get correct addresses for the nodes in order to send data to them, but rather than a failure to translate the hostnames to addresses with DNS (which should work on these nodes), it appears that the slurm code in `srun` thinks it already has addresses, and attempts to use them even though they are in some uninitialized or partially initialized state (`ss_family` == 0).
More information about the slurm-users
mailing list