[slurm-users] 'srun hostname' hangs on the command line

John Hearns hearnsj at googlemail.com
Tue Jul 17 02:16:21 MDT 2018


Ronan, as far as I can see this means that you cannot launch a job.

What state are the compute nodes in when you run sinfo?


On 17 July 2018 at 10:08, Buckley, Ronan <Ronan.Buckley at dell.com> wrote:

> Yes, srun just hangs. Commands like sinfo and squeue run fine.
>
> I also have no slurm logs in /var/log ??
>
>
>
> *From:* slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] *On
> Behalf Of *John Hearns
> *Sent:* Tuesday, July 17, 2018 8:57 AM
>
> *To:* Slurm User Community List
> *Subject:* Re: [slurm-users] 'srun hostname' hangs on the command line
>
>
>
> Ronan, sorry to ask but this is a bit unclear.
>
>
>
> Are you unable to launch ANY sessions with srun?
>
> In which case you need to look at the logs to see why the job is not being
> scheduled.
>
>
>
> Is it only the hostname command which fails?
>
>
>
> I would guess very much you have already run an ssh into a node and run
> the hostname command manually.
>
>
>
>
>
>
>
> On 17 July 2018 at 09:50, Buckley, Ronan <Ronan.Buckley at dell.com> wrote:
>
> Yes I do.
>
>
>
> *From:* slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] *On
> Behalf Of *Williams, Gareth (IM&T, Clayton)
> *Sent:* Tuesday, July 17, 2018 12:33 AM
> *To:* Slurm User Community List
> *Subject:* Re: [slurm-users] 'srun hostname' hangs on the command line
>
>
>
> Do you get the same problem as a non-root user?
>
>
>
> *From:* slurm-users [mailto:slurm-users-bounces at lists.schedmd.com
> <slurm-users-bounces at lists.schedmd.com>] *On Behalf Of *Buckley, Ronan
> *Sent:* Tuesday, 17 July 2018 12:53 AM
> *To:* slurm-users at lists.schedmd.com
> *Subject:* [slurm-users] 'srun hostname' hangs on the command line
>
>
>
> Hi All,
>
>
>
> Verbose mode doesn’t show much.
>
> I hashed out the hostnames.
>
> Any ideas/suggestions?
>
>
>
> *# srun hostname*
>
> *^Csrun: interrupt (one more within 1 sec to abort)*
>
> *srun: task 0: unknown*
>
> *^Z*
>
> *[1]+  Stopped                 srun hostname*
>
> *#*
>
>
>
> *# srun -v hostname*
>
> *srun: defined options for program `srun'*
>
> *srun: --------------- ---------------------*
>
> *srun: user           : `root'*
>
> *srun: uid            : 0*
>
> *srun: gid            : 0*
>
> *srun: cwd            : /root*
>
> *srun: ntasks         : 1 (default)*
>
> *srun: nodes          : 1 (default)*
>
> *srun: jobid          : 4294967294 (default)*
>
> *srun: partition      : default*
>
> *srun: profile        : `NotSet'*
>
> *srun: job name       : `(null)'*
>
> *srun: reservation    : `(null)'*
>
> *srun: burst_buffer   : `(null)'*
>
> *srun: wckey          : `(null)'*
>
> *srun: cpu_freq_min   : 4294967294*
>
> *srun: cpu_freq_max   : 4294967294*
>
> *srun: cpu_freq_gov   : 4294967294*
>
> *srun: switches       : -1*
>
> *srun: wait-for-switches : -1*
>
> *srun: distribution   : unknown*
>
> *srun: cpu_bind       : default (0)*
>
> *srun: mem_bind       : default (0)*
>
> *srun: verbose        : 1*
>
> *srun: slurmd_debug   : 0*
>
> *srun: immediate      : false*
>
> *srun: label output   : false*
>
> *srun: unbuffered IO  : false*
>
> *srun: overcommit     : false*
>
> *srun: threads        : 60*
>
> *srun: checkpoint_dir : /var/slurm/checkpoint*
>
> *srun: wait           : 0*
>
> *srun: nice           : -2*
>
> *srun: account        : (null)*
>
> *srun: comment        : (null)*
>
> *srun: dependency     : (null)*
>
> *srun: exclusive      : false*
>
> *srun: bcast          : false*
>
> *srun: qos            : (null)*
>
> *srun: constraints    :*
>
> *srun: geometry       : (null)*
>
> *srun: reboot         : yes*
>
> *srun: rotate         : no*
>
> *srun: preserve_env   : false*
>
> *srun: network        : (null)*
>
> *srun: propagate      : NONE*
>
> *srun: prolog         : (null)*
>
> *srun: epilog         : (null)*
>
> *srun: mail_type      : NONE*
>
> *srun: mail_user      : (null)*
>
> *srun: task_prolog    : (null)*
>
> *srun: task_epilog    : (null)*
>
> *srun: multi_prog     : no*
>
> *srun: sockets-per-node  : -2*
>
> *srun: cores-per-socket  : -2*
>
> *srun: threads-per-core  : -2*
>
> *srun: ntasks-per-node   : -2*
>
> *srun: ntasks-per-socket : -2*
>
> *srun: ntasks-per-core   : -2*
>
> *srun: plane_size        : 4294967294*
>
> *srun: core-spec         : NA*
>
> *srun: power             :*
>
> *srun: remote command    : `hostname'*
>
> *srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs
> x index)*
>
> *srun: Nodes ####### are ready for job*
>
> *srun: jobid 50871: nodes(1):`#######', cpu counts: 64(x1)*
>
> *srun: launching 50871.0 on host #######, 1 tasks: 0*
>
> *srun: route default plugin loaded*
>
> *srun: error: timeout waiting for task launch, started 0 of 1 tasks*
>
> *srun: Job step 50871.0 aborted before step completely launched.*
>
> *srun: Job step aborted: Waiting up to 32 seconds for job step to finish.*
>
> *srun: error: Timed out waiting for job step to complete*
>
> *#*
>
>
>
> Rgds
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180717/f78431eb/attachment-0001.html>


More information about the slurm-users mailing list