[slurm-users] Task launch failure on cloud nodes (Address family '0' not supported)

Tue May 23 20:02:37 UTC 2023

I'm working on setting up a cloud partition, and running into some communications problems between my nodes. This looks like something I have misconfigured, or information I haven't correctly supplied to slurm, but the low-level nature of the error has made it hard for me to figure out what I've done wrong.

I have a batch script which is essentially:

	#!/bin/sh
	#SBATCH --time=2
	#SBATCH --partition=cloud
	#SBATCH --ntasks=8
	#SBATCH --cpus-per-task=1
	srun -vvvvv --slurmd-debug=verbose singularity exec my-image.sif some args

Submitting this with `sbatch`, two 4 core VM nodes are started up as expected, the batch script is sent to one, and begins executing the `srun`. That seems to allocate the necessary job steps, but then fails when trying to communicate with the nodes in the allocation to start the tasks:

	srun: jobid 320: nodes(2):`ec[0-1]', cpu counts: 4(x2)
	srun: debug2: creating job with 8 tasks
	srun: debug:  requesting job 320, user 1000, nodes 2 including ((null))
	srun: debug:  cpus 8, tasks 8, name singularity, relative 65534
	srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
	srun: debug:  Entering slurm_step_launch
	srun: debug:  mpi type = (null)
	srun: debug:  mpi/none: p_mpi_hook_client_prelaunch: Using mpi/none
	srun: debug:  Entering _msg_thr_create()
	srun: debug4: eio: handling events for 2 objects
	srun: debug3: eio_message_socket_readable: shutdown 0 fd 9
	srun: debug3: eio_message_socket_readable: shutdown 0 fd 5
	srun: debug:  initialized stdio listening socket, port 43793
	srun: debug:  Started IO server thread (139796182816512)
	srun: debug:  Entering _launch_tasks
	srun: debug3: IO thread pid = 1507
	srun: debug4: eio: handling events for 4 objects
	srun: debug2: Called _file_readable
	srun: debug3:   false, all ioservers not yet initialized
	srun: launching StepId=320.0 on host ec0, 4 tasks: [0-3]
	srun: debug3: uid:1000 gid:1000 cwd:/tmp/job-320 0
	srun: launching StepId=320.0 on host ec1, 4 tasks: [4-7]
	srun: debug3: uid:1000 gid:1000 cwd:/tmp/job-320 1
	srun: debug2: Called _file_writable
	srun: debug3:   false
	srun: debug3:   eof is false
	srun: debug2: Called _file_writable
	srun: debug3:   false
	srun: debug3:   eof is false
	srun: debug3: Called _listening_socket_readable
	srun: debug3: Trying to load plugin /usr/lib64/slurm/route_default.so
	srun: route/default: init: route default plugin loaded
	srun: debug3: Success.
	srun: debug3: Tree sending to ec0
	srun: debug2: Tree head got back 0 looking for 2
	srun: debug3: Tree sending to ec1
	srun: error: slurm_get_port: Address family '0' not supported
	srun: error: Error connecting, bad data: family = 0, port = 0
	srun: debug3: problems with ec1
	srun: error: slurm_get_port: Address family '0' not supported
	srun: error: Error connecting, bad data: family = 0, port = 0
	srun: debug3: problems with ec0
	srun: debug2: Tree head got back 2
	srun: debug:  launch returned msg_rc=1001 err=1001 type=9001
	srun: error: Task launch for StepId=320.0 failed on node ec1: Communication connection failure
	srun: debug:  launch returned msg_rc=1001 err=1001 type=9001
	srun: error: Task launch for StepId=320.0 failed on node ec0: Communication connection failure
	srun: error: Application launch failed: Communication connection failure
	srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
	srun: error: Timed out waiting for job step to complete

It looks like the problem is inability to get correct addresses for the nodes in order to send data to them, but rather than a failure to translate the hostnames to addresses with DNS (which should work on these nodes), it appears that the slurm code in `srun` thinks it already has addresses, and attempts to use them even though they are in some uninitialized or partially initialized state (`ss_family` == 0).