[slurm-users] Problem launching interactive jobs using srun

Nicholas McCollum nmccollum at asc.edu
Fri Mar 9 13:58:04 MST 2018


Connection refused makes me think a firewall issue.

Assuming this is a test environment, could you try on the compute node:

# iptables-save > iptables.bak
# iptables -F && iptables -X

Then test to see if it works.  To restore the firewall use:

# iptables-restore < iptables.bak

You may have to use...

# systemctl stop firewalld
# systemctl start firewalld

If you use firewalld.

---

Nicholas McCollum - HPC Systems Expert
Alabama Supercomputer Authority - CSRA

On 03/09/2018 02:45 PM, Andy Georges wrote:
> Hi all,
> 
> Cranked up the debug level a bit
> 
> Job was not started when using:
> 
> vsc40075 at test2802 (banette) ~> /bin/salloc -N1 -n1 /bin/srun --pty bash -i
> salloc: Granted job allocation 42
> salloc: Waiting for resource configuration
> salloc: Nodes node2801 are ready for job
> 
> For comparison purposes, running this on the master (head?) node:
> 
> vsc40075 at master23 () ~> /bin/salloc -N1 -n1 /bin/srun --pty bash -i
> salloc: Granted job allocation 43
> salloc: Waiting for resource configuration
> salloc: Nodes node2801 are ready for job
> vsc40075 at node2801 () ~>
> 
> 
> Below some more debug output from the hanging job.
> 
> Kind regards,
> — Andy
> 
> [2018-03-09T21:27:52.251] [42.0] debug:  _oom_event_monitor: started.
> [2018-03-09T21:27:52.251] [42.0] debug4: eio: handling events for 1 objects
> [2018-03-09T21:27:52.251] [42.0] debug3: Called _msg_socket_readable
> [2018-03-09T21:27:52.251] [42.0] debug4: eio: handling events for 1 objects
> [2018-03-09T21:27:52.251] [42.0] debug3: Called _msg_socket_readable
> [2018-03-09T21:27:52.251] [42.0] debug2: Entering _setup_normal_io
> [2018-03-09T21:27:52.251] [42.0] debug:    stdin uses a pty object
> [2018-03-09T21:27:52.251] [42.0] debug:  init pty size 23:119
> [2018-03-09T21:27:52.251] [42.0] debug4: eio: handling events for 1 objects
> [2018-03-09T21:27:52.251] [42.0] debug3: Called _msg_socket_readable
> [2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection refused
> [2018-03-09T21:27:52.252] [42.0] debug2: Error connecting slurm stream socket at 10.141.21.202:33698: Connection refused
> [2018-03-09T21:27:52.252] [42.0] error: slurm_open_msg_conn(pty_conn) 10.141.21.202,33698: Connection refused
> [2018-03-09T21:27:52.252] [42.0] debug4: adding IO connection (logical node rank 0)
> [2018-03-09T21:27:52.252] [42.0] debug4: connecting IO back to 10.141.21.202:33759
> [2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection refused
> [2018-03-09T21:27:52.252] [42.0] debug3: Error connecting, picking new stream port
> [2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection refused
> [2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection refused
> [2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection refused
> [2018-03-09T21:27:52.252] [42.0] debug2: Error connecting slurm stream socket at 10.141.21.202:33759: Connection refused
> [2018-03-09T21:27:52.252] [42.0] error: connect io: Connection refused
> [2018-03-09T21:27:52.252] [42.0] debug2: Leaving  _setup_normal_io
> [2018-03-09T21:27:52.253] [42.0] error: IO setup failed: Connection refused
> [2018-03-09T21:27:52.253] [42.0] debug3: xcgroup_set_param: parameter 'freezer.state' set to 'THAWED' for '/sys/fs/cgroup/freezer/slurm/uid_2540075/job_42/step_0'
> [2018-03-09T21:27:52.253] [42.0] debug3: xcgroup_set_uint32_param: parameter 'cgroup.procs' set to '6414' for '/sys/fs/cgroup/freezer'
> [2018-03-09T21:27:52.253] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/freezer/slurm/uid_2540075/job_42): Device or resource busy
> [2018-03-09T21:27:52.253] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/freezer/slurm/uid_2540075): Device or resource busy
> [2018-03-09T21:27:52.253] [42.0] debug:  step_terminate_monitor_stop signaling condition
> [2018-03-09T21:27:52.253] [42.0] debug4: eio: handling events for 1 objects
> [2018-03-09T21:27:52.253] [42.0] debug3: Called _msg_socket_readable
> [2018-03-09T21:27:52.253] [42.0] debug2: step_terminate_monitor will run for 60 secs
> [2018-03-09T21:27:52.253] [42.0] debug2: step_terminate_monitor is stopping
> [2018-03-09T21:27:52.253] [42.0] debug2: Sending SIGKILL to pgid 6414
> [2018-03-09T21:27:52.253] [42.0] debug3: xcgroup_set_uint32_param: parameter 'cgroup.procs' set to '6414' for '/sys/fs/cgroup/cpuset'
> [2018-03-09T21:27:52.265] [42.0] debug3: Took 1038 checks before stepd pid was removed from the step cgroup.
> [2018-03-09T21:27:52.265] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuset/slurm/uid_2540075/job_42): Device or resource busy
> [2018-03-09T21:27:52.265] [42.0] debug2: task/cgroup: not removing job cpuset : Device or resource busy
> [2018-03-09T21:27:52.265] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuset/slurm/uid_2540075): Device or resource busy
> [2018-03-09T21:27:52.265] [42.0] debug2: task/cgroup: not removing user cpuset : Device or resource busy
> [2018-03-09T21:27:52.315] [42.0] debug3: _oom_event_monitor: res: 1
> [2018-03-09T21:27:52.315] [42.0] _oom_event_monitor: oom-kill event count: 1
> [2018-03-09T21:27:52.315] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm/uid_2540075/job_42): Device or resource busy
> [2018-03-09T21:27:52.315] [42.0] debug2: task/cgroup: not removing job memcg : Device or resource busy
> [2018-03-09T21:27:52.315] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm/uid_2540075): Device or resource busy
> [2018-03-09T21:27:52.315] [42.0] debug2: task/cgroup: not removing user memcg : Device or resource busy
> [2018-03-09T21:27:52.315] [42.0] debug2: Before call to spank_fini()
> [2018-03-09T21:27:52.315] [42.0] debug2: After call to spank_fini()
> [2018-03-09T21:27:52.315] [42.0] error: job_manager exiting abnormally, rc = 4021
> [2018-03-09T21:27:52.315] [42.0] debug:  Sending launch resp rc=4021
> [2018-03-09T21:27:52.315] [42.0] debug2: slurm_connect failed: Connection refused
> [2018-03-09T21:27:52.315] [42.0] debug2: Error connecting slurm stream socket at 10.141.21.202:37053: Connection refused
> [2018-03-09T21:27:52.315] [42.0] error: _send_launch_resp: Failed to send RESPONSE_LAUNCH_TASKS: Connection refused
> [2018-03-09T21:27:52.315] [42.0] debug2: Rank 0 has no children slurmstepd
> [2018-03-09T21:27:52.315] [42.0] debug2: _one_step_complete_msg: first=0, last=0
> [2018-03-09T21:27:52.315] [42.0] debug3: Rank 0 sending complete to slurmctld, range 0 to 0
> [2018-03-09T21:27:52.317] [42.0] debug4: eio: handling events for 1 objects
> [2018-03-09T21:27:52.317] [42.0] debug3: Called _msg_socket_readable
> [2018-03-09T21:27:52.317] [42.0] debug2:   false, shutdown
> [2018-03-09T21:27:52.317] [42.0] debug:  Message thread exited
> [2018-03-09T21:27:52.317] [42.0] done with job
> 



More information about the slurm-users mailing list