[slurm-users] 'srun hostname' hangs on the command line
Carlos Fenoy
minibit at gmail.com
Tue Jul 17 04:54:46 MDT 2018
The communication from the compute nodes to the login nodes may be block by the firewall. That will prevent srun from running properly
Sent from my iPhone
> On 17 Jul 2018, at 10:16, John Hearns <hearnsj at googlemail.com> wrote:
>
> Ronan, as far as I can see this means that you cannot launch a job.
>
> What state are the compute nodes in when you run sinfo?
>
>
>> On 17 July 2018 at 10:08, Buckley, Ronan <Ronan.Buckley at dell.com> wrote:
>> Yes, srun just hangs. Commands like sinfo and squeue run fine.
>>
>> I also have no slurm logs in /var/log ??
>>
>>
>>
>> From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of John Hearns
>> Sent: Tuesday, July 17, 2018 8:57 AM
>>
>>
>> To: Slurm User Community List
>> Subject: Re: [slurm-users] 'srun hostname' hangs on the command line
>>
>>
>> Ronan, sorry to ask but this is a bit unclear.
>>
>>
>>
>> Are you unable to launch ANY sessions with srun?
>>
>> In which case you need to look at the logs to see why the job is not being scheduled.
>>
>>
>>
>> Is it only the hostname command which fails?
>>
>>
>>
>> I would guess very much you have already run an ssh into a node and run the hostname command manually.
>>
>>
>>
>>
>>
>>
>>
>> On 17 July 2018 at 09:50, Buckley, Ronan <Ronan.Buckley at dell.com> wrote:
>>
>> Yes I do.
>>
>>
>>
>> From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Williams, Gareth (IM&T, Clayton)
>> Sent: Tuesday, July 17, 2018 12:33 AM
>> To: Slurm User Community List
>> Subject: Re: [slurm-users] 'srun hostname' hangs on the command line
>>
>>
>>
>> Do you get the same problem as a non-root user?
>>
>>
>>
>> From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Buckley, Ronan
>> Sent: Tuesday, 17 July 2018 12:53 AM
>> To: slurm-users at lists.schedmd.com
>> Subject: [slurm-users] 'srun hostname' hangs on the command line
>>
>>
>>
>> Hi All,
>>
>>
>>
>> Verbose mode doesn’t show much.
>>
>> I hashed out the hostnames.
>>
>> Any ideas/suggestions?
>>
>>
>>
>> # srun hostname
>>
>> ^Csrun: interrupt (one more within 1 sec to abort)
>>
>> srun: task 0: unknown
>>
>> ^Z
>>
>> [1]+ Stopped srun hostname
>>
>> #
>>
>>
>>
>> # srun -v hostname
>>
>> srun: defined options for program `srun'
>>
>> srun: --------------- ---------------------
>>
>> srun: user : `root'
>>
>> srun: uid : 0
>>
>> srun: gid : 0
>>
>> srun: cwd : /root
>>
>> srun: ntasks : 1 (default)
>>
>> srun: nodes : 1 (default)
>>
>> srun: jobid : 4294967294 (default)
>>
>> srun: partition : default
>>
>> srun: profile : `NotSet'
>>
>> srun: job name : `(null)'
>>
>> srun: reservation : `(null)'
>>
>> srun: burst_buffer : `(null)'
>>
>> srun: wckey : `(null)'
>>
>> srun: cpu_freq_min : 4294967294
>>
>> srun: cpu_freq_max : 4294967294
>>
>> srun: cpu_freq_gov : 4294967294
>>
>> srun: switches : -1
>>
>> srun: wait-for-switches : -1
>>
>> srun: distribution : unknown
>>
>> srun: cpu_bind : default (0)
>>
>> srun: mem_bind : default (0)
>>
>> srun: verbose : 1
>>
>> srun: slurmd_debug : 0
>>
>> srun: immediate : false
>>
>> srun: label output : false
>>
>> srun: unbuffered IO : false
>>
>> srun: overcommit : false
>>
>> srun: threads : 60
>>
>> srun: checkpoint_dir : /var/slurm/checkpoint
>>
>> srun: wait : 0
>>
>> srun: nice : -2
>>
>> srun: account : (null)
>>
>> srun: comment : (null)
>>
>> srun: dependency : (null)
>>
>> srun: exclusive : false
>>
>> srun: bcast : false
>>
>> srun: qos : (null)
>>
>> srun: constraints :
>>
>> srun: geometry : (null)
>>
>> srun: reboot : yes
>>
>> srun: rotate : no
>>
>> srun: preserve_env : false
>>
>> srun: network : (null)
>>
>> srun: propagate : NONE
>>
>> srun: prolog : (null)
>>
>> srun: epilog : (null)
>>
>> srun: mail_type : NONE
>>
>> srun: mail_user : (null)
>>
>> srun: task_prolog : (null)
>>
>> srun: task_epilog : (null)
>>
>> srun: multi_prog : no
>>
>> srun: sockets-per-node : -2
>>
>> srun: cores-per-socket : -2
>>
>> srun: threads-per-core : -2
>>
>> srun: ntasks-per-node : -2
>>
>> srun: ntasks-per-socket : -2
>>
>> srun: ntasks-per-core : -2
>>
>> srun: plane_size : 4294967294
>>
>> srun: core-spec : NA
>>
>> srun: power :
>>
>> srun: remote command : `hostname'
>>
>> srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs x index)
>>
>> srun: Nodes ####### are ready for job
>>
>> srun: jobid 50871: nodes(1):`#######', cpu counts: 64(x1)
>>
>> srun: launching 50871.0 on host #######, 1 tasks: 0
>>
>> srun: route default plugin loaded
>>
>> srun: error: timeout waiting for task launch, started 0 of 1 tasks
>>
>> srun: Job step 50871.0 aborted before step completely launched.
>>
>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>>
>> srun: error: Timed out waiting for job step to complete
>>
>> #
>>
>>
>>
>> Rgds
>>
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180717/96329640/attachment-0001.html>
More information about the slurm-users
mailing list