[slurm-users] 'srun hostname' hangs on the command line
John Hearns
hearnsj at googlemail.com
Tue Jul 17 02:16:21 MDT 2018
Ronan, as far as I can see this means that you cannot launch a job.
What state are the compute nodes in when you run sinfo?
On 17 July 2018 at 10:08, Buckley, Ronan <Ronan.Buckley at dell.com> wrote:
> Yes, srun just hangs. Commands like sinfo and squeue run fine.
>
> I also have no slurm logs in /var/log ??
>
>
>
> *From:* slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] *On
> Behalf Of *John Hearns
> *Sent:* Tuesday, July 17, 2018 8:57 AM
>
> *To:* Slurm User Community List
> *Subject:* Re: [slurm-users] 'srun hostname' hangs on the command line
>
>
>
> Ronan, sorry to ask but this is a bit unclear.
>
>
>
> Are you unable to launch ANY sessions with srun?
>
> In which case you need to look at the logs to see why the job is not being
> scheduled.
>
>
>
> Is it only the hostname command which fails?
>
>
>
> I would guess very much you have already run an ssh into a node and run
> the hostname command manually.
>
>
>
>
>
>
>
> On 17 July 2018 at 09:50, Buckley, Ronan <Ronan.Buckley at dell.com> wrote:
>
> Yes I do.
>
>
>
> *From:* slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] *On
> Behalf Of *Williams, Gareth (IM&T, Clayton)
> *Sent:* Tuesday, July 17, 2018 12:33 AM
> *To:* Slurm User Community List
> *Subject:* Re: [slurm-users] 'srun hostname' hangs on the command line
>
>
>
> Do you get the same problem as a non-root user?
>
>
>
> *From:* slurm-users [mailto:slurm-users-bounces at lists.schedmd.com
> <slurm-users-bounces at lists.schedmd.com>] *On Behalf Of *Buckley, Ronan
> *Sent:* Tuesday, 17 July 2018 12:53 AM
> *To:* slurm-users at lists.schedmd.com
> *Subject:* [slurm-users] 'srun hostname' hangs on the command line
>
>
>
> Hi All,
>
>
>
> Verbose mode doesn’t show much.
>
> I hashed out the hostnames.
>
> Any ideas/suggestions?
>
>
>
> *# srun hostname*
>
> *^Csrun: interrupt (one more within 1 sec to abort)*
>
> *srun: task 0: unknown*
>
> *^Z*
>
> *[1]+ Stopped srun hostname*
>
> *#*
>
>
>
> *# srun -v hostname*
>
> *srun: defined options for program `srun'*
>
> *srun: --------------- ---------------------*
>
> *srun: user : `root'*
>
> *srun: uid : 0*
>
> *srun: gid : 0*
>
> *srun: cwd : /root*
>
> *srun: ntasks : 1 (default)*
>
> *srun: nodes : 1 (default)*
>
> *srun: jobid : 4294967294 (default)*
>
> *srun: partition : default*
>
> *srun: profile : `NotSet'*
>
> *srun: job name : `(null)'*
>
> *srun: reservation : `(null)'*
>
> *srun: burst_buffer : `(null)'*
>
> *srun: wckey : `(null)'*
>
> *srun: cpu_freq_min : 4294967294*
>
> *srun: cpu_freq_max : 4294967294*
>
> *srun: cpu_freq_gov : 4294967294*
>
> *srun: switches : -1*
>
> *srun: wait-for-switches : -1*
>
> *srun: distribution : unknown*
>
> *srun: cpu_bind : default (0)*
>
> *srun: mem_bind : default (0)*
>
> *srun: verbose : 1*
>
> *srun: slurmd_debug : 0*
>
> *srun: immediate : false*
>
> *srun: label output : false*
>
> *srun: unbuffered IO : false*
>
> *srun: overcommit : false*
>
> *srun: threads : 60*
>
> *srun: checkpoint_dir : /var/slurm/checkpoint*
>
> *srun: wait : 0*
>
> *srun: nice : -2*
>
> *srun: account : (null)*
>
> *srun: comment : (null)*
>
> *srun: dependency : (null)*
>
> *srun: exclusive : false*
>
> *srun: bcast : false*
>
> *srun: qos : (null)*
>
> *srun: constraints :*
>
> *srun: geometry : (null)*
>
> *srun: reboot : yes*
>
> *srun: rotate : no*
>
> *srun: preserve_env : false*
>
> *srun: network : (null)*
>
> *srun: propagate : NONE*
>
> *srun: prolog : (null)*
>
> *srun: epilog : (null)*
>
> *srun: mail_type : NONE*
>
> *srun: mail_user : (null)*
>
> *srun: task_prolog : (null)*
>
> *srun: task_epilog : (null)*
>
> *srun: multi_prog : no*
>
> *srun: sockets-per-node : -2*
>
> *srun: cores-per-socket : -2*
>
> *srun: threads-per-core : -2*
>
> *srun: ntasks-per-node : -2*
>
> *srun: ntasks-per-socket : -2*
>
> *srun: ntasks-per-core : -2*
>
> *srun: plane_size : 4294967294*
>
> *srun: core-spec : NA*
>
> *srun: power :*
>
> *srun: remote command : `hostname'*
>
> *srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs
> x index)*
>
> *srun: Nodes ####### are ready for job*
>
> *srun: jobid 50871: nodes(1):`#######', cpu counts: 64(x1)*
>
> *srun: launching 50871.0 on host #######, 1 tasks: 0*
>
> *srun: route default plugin loaded*
>
> *srun: error: timeout waiting for task launch, started 0 of 1 tasks*
>
> *srun: Job step 50871.0 aborted before step completely launched.*
>
> *srun: Job step aborted: Waiting up to 32 seconds for job step to finish.*
>
> *srun: error: Timed out waiting for job step to complete*
>
> *#*
>
>
>
> Rgds
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180717/f78431eb/attachment-0001.html>
More information about the slurm-users
mailing list