[slurm-users] srun from slurmdb system

Brian Andrus toomuchit at gmail.com
Mon Dec 17 13:54:46 MST 2018


All,

I have several clusters that are all connected to a standalone slurmdb 
server. They are not federated.

I can check the queues and do everything from that system using -M 
<cluster> for most commands. However, if I try to get a shell within a 
job (eg: srun -M cluster-a --pty bash), it queues up, but when it tries 
to run, I get an error:

$ srun -M clustera -n16 --pty bash
srun: job 6 has been allocated resources
srun: error: Error connecting, bad data: family = 2, port = 0
srun: error: Task launch for 6.0 failed on node ip-0A312014: 
Communication connection failure
srun: error: Application launch failed: Communication connection failure
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

And in the slurmctld log for the cluster master:
Dec 17 19:50:04 nastran-master slurmctld[54739]: error: 
slurm_receive_msg [10.49.32.20:44022]: Zero Bytes were transmitted or 
received
Dec 17 19:50:07 nastran-master slurmctld[54739]: error: 
slurm_receive_msg [10.49.32.20:44046]: Zero Bytes were transmitted or 
received
Dec 17 19:50:08 nastran-master slurmctld[54739]: update_node: node 
ip-0A312014 state set to DOWN
Dec 17 19:50:08 nastran-master slurmctld[54739]: Node ip-0A312014 now 
responding
Dec 17 19:50:08 nastran-master slurmctld[54739]: node ip-0A312014 
returned to service
Dec 17 19:50:09 nastran-master slurmctld[54739]: sched: Allocate JobId=6 
NodeList=ip-0A312014 #CPUs=16 Partition=debug
Dec 17 19:50:09 nastran-master slurmctld[54739]: job_step_signal JobId=6 
StepId=0 not found
Dec 17 19:50:41 nastran-master slurmctld[54739]: job_step_signal JobId=6 
StepId=0 not found
Dec 17 19:50:41 nastran-master slurmctld[54739]: _job_complete: JobId=6 
WTERMSIG 105
Dec 17 19:50:41 nastran-master slurmctld[54739]: _job_complete: JobId=6 done

Is this something that cannot be done from a system that is outside a 
federated cluster?

Brian Andrus

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181217/f276c9ff/attachment.html>


More information about the slurm-users mailing list