[slurm-users] 'srun hostname' hangs on the command line

Buckley, Ronan Ronan.Buckley at Dell.com
Tue Jul 17 05:00:01 MDT 2018


Hi Carlos, Is there a way to test that? Are there certain ports that need to be open? Thanks.

From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Carlos Fenoy
Sent: Tuesday, July 17, 2018 11:55 AM
To: Slurm User Community List
Subject: Re: [slurm-users] 'srun hostname' hangs on the command line

The communication from the compute nodes to the login nodes may be block by the firewall. That will prevent srun from running properly
Sent from my iPhone

On 17 Jul 2018, at 10:16, John Hearns <hearnsj at googlemail.com<mailto:hearnsj at googlemail.com>> wrote:
Ronan, as far as I can see this means that you cannot launch a job.

What state are the compute nodes in when you run sinfo?


On 17 July 2018 at 10:08, Buckley, Ronan <Ronan.Buckley at dell.com<mailto:Ronan.Buckley at dell.com>> wrote:
Yes, srun just hangs. Commands like sinfo and squeue run fine.
I also have no slurm logs in /var/log ??

From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com>] On Behalf Of John Hearns
Sent: Tuesday, July 17, 2018 8:57 AM

To: Slurm User Community List
Subject: Re: [slurm-users] 'srun hostname' hangs on the command line

Ronan, sorry to ask but this is a bit unclear.

Are you unable to launch ANY sessions with srun?
In which case you need to look at the logs to see why the job is not being scheduled.

Is it only the hostname command which fails?

I would guess very much you have already run an ssh into a node and run the hostname command manually.



On 17 July 2018 at 09:50, Buckley, Ronan <Ronan.Buckley at dell.com<mailto:Ronan.Buckley at dell.com>> wrote:
Yes I do.

From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com>] On Behalf Of Williams, Gareth (IM&T, Clayton)
Sent: Tuesday, July 17, 2018 12:33 AM
To: Slurm User Community List
Subject: Re: [slurm-users] 'srun hostname' hangs on the command line

Do you get the same problem as a non-root user?

From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Buckley, Ronan
Sent: Tuesday, 17 July 2018 12:53 AM
To: slurm-users at lists.schedmd.com<mailto:slurm-users at lists.schedmd.com>
Subject: [slurm-users] 'srun hostname' hangs on the command line

Hi All,

Verbose mode doesn’t show much.
I hashed out the hostnames.
Any ideas/suggestions?

# srun hostname
^Csrun: interrupt (one more within 1 sec to abort)
srun: task 0: unknown
^Z
[1]+  Stopped                 srun hostname
#

# srun -v hostname
srun: defined options for program `srun'
srun: --------------- ---------------------
srun: user           : `root'
srun: uid            : 0
srun: gid            : 0
srun: cwd            : /root
srun: ntasks         : 1 (default)
srun: nodes          : 1 (default)
srun: jobid          : 4294967294 (default)
srun: partition      : default
srun: profile        : `NotSet'
srun: job name       : `(null)'
srun: reservation    : `(null)'
srun: burst_buffer   : `(null)'
srun: wckey          : `(null)'
srun: cpu_freq_min   : 4294967294
srun: cpu_freq_max   : 4294967294
srun: cpu_freq_gov   : 4294967294
srun: switches       : -1
srun: wait-for-switches : -1
srun: distribution   : unknown
srun: cpu_bind       : default (0)
srun: mem_bind       : default (0)
srun: verbose        : 1
srun: slurmd_debug   : 0
srun: immediate      : false
srun: label output   : false
srun: unbuffered IO  : false
srun: overcommit     : false
srun: threads        : 60
srun: checkpoint_dir : /var/slurm/checkpoint
srun: wait           : 0
srun: nice           : -2
srun: account        : (null)
srun: comment        : (null)
srun: dependency     : (null)
srun: exclusive      : false
srun: bcast          : false
srun: qos            : (null)
srun: constraints    :
srun: geometry       : (null)
srun: reboot         : yes
srun: rotate         : no
srun: preserve_env   : false
srun: network        : (null)
srun: propagate      : NONE
srun: prolog         : (null)
srun: epilog         : (null)
srun: mail_type      : NONE
srun: mail_user      : (null)
srun: task_prolog    : (null)
srun: task_epilog    : (null)
srun: multi_prog     : no
srun: sockets-per-node  : -2
srun: cores-per-socket  : -2
srun: threads-per-core  : -2
srun: ntasks-per-node   : -2
srun: ntasks-per-socket : -2
srun: ntasks-per-core   : -2
srun: plane_size        : 4294967294
srun: core-spec         : NA
srun: power             :
srun: remote command    : `hostname'
srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs x index)
srun: Nodes ####### are ready for job
srun: jobid 50871: nodes(1):`#######', cpu counts: 64(x1)
srun: launching 50871.0 on host #######, 1 tasks: 0
srun: route default plugin loaded
srun: error: timeout waiting for task launch, started 0 of 1 tasks
srun: Job step 50871.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
#

Rgds



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180717/cde07bc8/attachment-0001.html>


More information about the slurm-users mailing list