[slurm-users] Problem launching interactive jobs using srun

Andy Georges Andy.Georges at UGent.be
Fri Mar 9 13:45:57 MST 2018


Hi all,

Cranked up the debug level a bit

Job was not started when using:

vsc40075 at test2802 (banette) ~> /bin/salloc -N1 -n1 /bin/srun --pty bash -i
salloc: Granted job allocation 42
salloc: Waiting for resource configuration
salloc: Nodes node2801 are ready for job

For comparison purposes, running this on the master (head?) node:

vsc40075 at master23 () ~> /bin/salloc -N1 -n1 /bin/srun --pty bash -i
salloc: Granted job allocation 43
salloc: Waiting for resource configuration
salloc: Nodes node2801 are ready for job
vsc40075 at node2801 () ~>


Below some more debug output from the hanging job.

Kind regards,
— Andy

[2018-03-09T21:27:52.251] [42.0] debug:  _oom_event_monitor: started.
[2018-03-09T21:27:52.251] [42.0] debug4: eio: handling events for 1 objects
[2018-03-09T21:27:52.251] [42.0] debug3: Called _msg_socket_readable
[2018-03-09T21:27:52.251] [42.0] debug4: eio: handling events for 1 objects
[2018-03-09T21:27:52.251] [42.0] debug3: Called _msg_socket_readable
[2018-03-09T21:27:52.251] [42.0] debug2: Entering _setup_normal_io
[2018-03-09T21:27:52.251] [42.0] debug:    stdin uses a pty object
[2018-03-09T21:27:52.251] [42.0] debug:  init pty size 23:119
[2018-03-09T21:27:52.251] [42.0] debug4: eio: handling events for 1 objects
[2018-03-09T21:27:52.251] [42.0] debug3: Called _msg_socket_readable
[2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection refused
[2018-03-09T21:27:52.252] [42.0] debug2: Error connecting slurm stream socket at 10.141.21.202:33698: Connection refused
[2018-03-09T21:27:52.252] [42.0] error: slurm_open_msg_conn(pty_conn) 10.141.21.202,33698: Connection refused
[2018-03-09T21:27:52.252] [42.0] debug4: adding IO connection (logical node rank 0)
[2018-03-09T21:27:52.252] [42.0] debug4: connecting IO back to 10.141.21.202:33759
[2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection refused
[2018-03-09T21:27:52.252] [42.0] debug3: Error connecting, picking new stream port
[2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection refused
[2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection refused
[2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection refused
[2018-03-09T21:27:52.252] [42.0] debug2: Error connecting slurm stream socket at 10.141.21.202:33759: Connection refused
[2018-03-09T21:27:52.252] [42.0] error: connect io: Connection refused
[2018-03-09T21:27:52.252] [42.0] debug2: Leaving  _setup_normal_io
[2018-03-09T21:27:52.253] [42.0] error: IO setup failed: Connection refused
[2018-03-09T21:27:52.253] [42.0] debug3: xcgroup_set_param: parameter 'freezer.state' set to 'THAWED' for '/sys/fs/cgroup/freezer/slurm/uid_2540075/job_42/step_0'
[2018-03-09T21:27:52.253] [42.0] debug3: xcgroup_set_uint32_param: parameter 'cgroup.procs' set to '6414' for '/sys/fs/cgroup/freezer'
[2018-03-09T21:27:52.253] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/freezer/slurm/uid_2540075/job_42): Device or resource busy
[2018-03-09T21:27:52.253] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/freezer/slurm/uid_2540075): Device or resource busy
[2018-03-09T21:27:52.253] [42.0] debug:  step_terminate_monitor_stop signaling condition
[2018-03-09T21:27:52.253] [42.0] debug4: eio: handling events for 1 objects
[2018-03-09T21:27:52.253] [42.0] debug3: Called _msg_socket_readable
[2018-03-09T21:27:52.253] [42.0] debug2: step_terminate_monitor will run for 60 secs
[2018-03-09T21:27:52.253] [42.0] debug2: step_terminate_monitor is stopping
[2018-03-09T21:27:52.253] [42.0] debug2: Sending SIGKILL to pgid 6414
[2018-03-09T21:27:52.253] [42.0] debug3: xcgroup_set_uint32_param: parameter 'cgroup.procs' set to '6414' for '/sys/fs/cgroup/cpuset'
[2018-03-09T21:27:52.265] [42.0] debug3: Took 1038 checks before stepd pid was removed from the step cgroup.
[2018-03-09T21:27:52.265] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuset/slurm/uid_2540075/job_42): Device or resource busy
[2018-03-09T21:27:52.265] [42.0] debug2: task/cgroup: not removing job cpuset : Device or resource busy
[2018-03-09T21:27:52.265] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuset/slurm/uid_2540075): Device or resource busy
[2018-03-09T21:27:52.265] [42.0] debug2: task/cgroup: not removing user cpuset : Device or resource busy
[2018-03-09T21:27:52.315] [42.0] debug3: _oom_event_monitor: res: 1
[2018-03-09T21:27:52.315] [42.0] _oom_event_monitor: oom-kill event count: 1
[2018-03-09T21:27:52.315] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm/uid_2540075/job_42): Device or resource busy
[2018-03-09T21:27:52.315] [42.0] debug2: task/cgroup: not removing job memcg : Device or resource busy
[2018-03-09T21:27:52.315] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm/uid_2540075): Device or resource busy
[2018-03-09T21:27:52.315] [42.0] debug2: task/cgroup: not removing user memcg : Device or resource busy
[2018-03-09T21:27:52.315] [42.0] debug2: Before call to spank_fini()
[2018-03-09T21:27:52.315] [42.0] debug2: After call to spank_fini()
[2018-03-09T21:27:52.315] [42.0] error: job_manager exiting abnormally, rc = 4021
[2018-03-09T21:27:52.315] [42.0] debug:  Sending launch resp rc=4021
[2018-03-09T21:27:52.315] [42.0] debug2: slurm_connect failed: Connection refused
[2018-03-09T21:27:52.315] [42.0] debug2: Error connecting slurm stream socket at 10.141.21.202:37053: Connection refused
[2018-03-09T21:27:52.315] [42.0] error: _send_launch_resp: Failed to send RESPONSE_LAUNCH_TASKS: Connection refused
[2018-03-09T21:27:52.315] [42.0] debug2: Rank 0 has no children slurmstepd
[2018-03-09T21:27:52.315] [42.0] debug2: _one_step_complete_msg: first=0, last=0
[2018-03-09T21:27:52.315] [42.0] debug3: Rank 0 sending complete to slurmctld, range 0 to 0
[2018-03-09T21:27:52.317] [42.0] debug4: eio: handling events for 1 objects
[2018-03-09T21:27:52.317] [42.0] debug3: Called _msg_socket_readable
[2018-03-09T21:27:52.317] [42.0] debug2:   false, shutdown
[2018-03-09T21:27:52.317] [42.0] debug:  Message thread exited
[2018-03-09T21:27:52.317] [42.0] done with job
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: Message signed with OpenPGP
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180309/2f6e7b78/attachment.sig>


More information about the slurm-users mailing list