<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Dear slurm user community,</div>
<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
I have a slurm cluster on centos7 installed through yum, I also have mpich installed.</div>
<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
I can ssh into on of the nodes and run an mpi job:</div>
<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<span style="font-family: Consolas, Courier, monospace;"># /usr/lib64/mpich/bin/mpirun --hosts nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469,nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469,nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469 /scratch/mpi-helloworld
</span>
<div><span style="font-family: Consolas, Courier, monospace;">Warning: Permanently added 'nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469,10.233.88.25' (ECDSA) to the list of known hosts.</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">Hello world from processor nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469, rank 2 out of 3 processors</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">Hello world from processor nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469, rank 0 out of 3 processors</span></div>
<span style="font-family: Consolas, Courier, monospace;">Hello world from processor nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469, rank 1 out of 3 processors</span></div>
<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
However I can't make it work through slurm, these are the logs form running the job:</div>
<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<span style="font-family: Consolas, Courier, monospace;"># srun --mpi=pmi2 -N3 -vvv /usr/lib64/mpich/bin/mpirun /scratch/mpi-helloworld
</span>
<div><span style="font-family: Consolas, Courier, monospace;">srun: defined options</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: -------------------- --------------------</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: mpi : pmi2</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: nodes : 3</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: verbose : 3</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: -------------------- --------------------</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: end of defined options</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: propagating RLIMIT_CPU=18446744073709551615</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: propagating RLIMIT_FSIZE=18446744073709551615</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: propagating RLIMIT_DATA=18446744073709551615</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: propagating RLIMIT_STACK=8388608</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: propagating RLIMIT_CORE=18446744073709551615</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: propagating RLIMIT_RSS=18446744073709551615</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: propagating RLIMIT_NPROC=18446744073709551615</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: propagating RLIMIT_NOFILE=1048576</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: propagating RLIMIT_AS=18446744073709551615</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: propagating SLURM_PRIO_PROCESS=0</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: propagating UMASK=0022</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: srun PMI messages to port=33065</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: Entering slurm_allocation_msg_thr_create()</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: port from net_stream_listen is 44387</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: Entering _msg_thr_internal</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: auth/munge: init: Munge authentication plugin loaded</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: jobid 8: nodes(3):`nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469,nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469,nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469', cpu counts: 1(x3)</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: creating job with 3 tasks</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: requesting job 8, user 0, nodes 3 including ((null))</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: cpus 3, tasks 3, name mpirun, relative 65534</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: Entering slurm_step_launch</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: mpi type = (null)</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: mpi/pmi2: p_mpi_hook_client_prelaunch: mpi/pmi2: client_prelaunch</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: mpi/pmi2: _get_proc_mapping: mpi/pmi2: processor mapping: (vector,(0,3,1))</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: mpi/pmi2: _setup_srun_socket: mpi/pmi2: srun pmi port: 37029</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: mpi/pmi2: _tree_listen_readable: mpi/pmi2: _tree_listen_readable</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: mpi/pmi2: pmi2_start_agent: mpi/pmi2: started agent thread</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: Entering _msg_thr_create()</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: initialized stdio listening socket, port 41275</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: Started IO server thread (140538792195840)</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: Entering _launch_tasks</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: launching StepId=8.0 on host nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469, 1 tasks: 0</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Called _file_readable</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Called _file_writable</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Called _file_writable</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: launching StepId=8.0 on host nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469, 1 tasks: 1</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: launching StepId=8.0 on host nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469, 1 tasks: 2</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: route/default: init: route default plugin loaded</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Tree head got back 0 looking for 3</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Tree head got back 1</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Tree head got back 2</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Tree head got back 3</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: launch returned msg_rc=0 err=0 type=8001</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: launch returned msg_rc=0 err=0 type=8001</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug: launch returned msg_rc=0 err=0 type=8001</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Activity on IO listening socket 17</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Entering io_init_msg_read_from_fd</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Leaving io_init_msg_read_from_fd</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Entering io_init_msg_validate</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Leaving io_init_msg_validate</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Validated IO connection from 10.233.88.26:33470, node rank 0, sd=18</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: eio_message_socket_accept: got message connection from 10.233.88.26:53410 19</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: received task launch</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: launch/slurm: _task_start: Node nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469, 1 tasks started</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Called _file_readable</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Called _file_writable</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Called _file_writable</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Activity on IO listening socket 17</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Entering io_init_msg_read_from_fd</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Leaving io_init_msg_read_from_fd</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Entering io_init_msg_validate</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Leaving io_init_msg_validate</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Validated IO connection from 10.233.88.25:52764, node rank 2, sd=19</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Entering io_init_msg_read_from_fd</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Leaving io_init_msg_read_from_fd</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Entering io_init_msg_validate</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Leaving io_init_msg_validate</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Validated IO connection from 10.233.88.27:52768, node rank 1, sd=20</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: eio_message_socket_accept: got message connection from 10.233.88.25:47948 21</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: received task launch</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: launch/slurm: _task_start: Node nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469, 1 tasks started</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: eio_message_socket_accept: got message connection from 10.233.88.27:41996 21</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: received task launch</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: launch/slurm: _task_start: Node nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469, 1 tasks started</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Called _file_readable</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Called _file_writable</span></div>
<span style="font-family: Consolas, Courier, monospace;">srun: debug2: Called _file_writable</span></div>
<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<span style="font-family: Consolas, Courier, monospace;">srun: debug2: Called _file_readable</span>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Called _file_writable</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Called _file_writable</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">srun: debug2: Entering _file_write</span></div>
<span style="font-family: Consolas, Courier, monospace;">srun: Job 8 step creation temporarily disabled, retrying (Requested nodes are busy)</span></div>
<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
The output clearly says the nodes are busy but they are not, actually I can run other jobs:</div>
<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<span style="font-family: Consolas, Courier, monospace;"># squeue</span>
<div><span style="font-family: Consolas, Courier, monospace;"> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)</span></div>
<div><span style="font-family: Consolas, Courier, monospace;"># sinfo</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">PARTITION AVAIL TIMELIMIT NODES STATE NODELIST</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">workq* up infinite 3 idle nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469,nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469,nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">[root@nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469 /]# srun -N3 hostname</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469</span></div>
<div><span style="font-family: Consolas, Courier, monospace;">nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469</span></div>
<span style="font-family: Consolas, Courier, monospace;">nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469</span></div>
<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Any idea what is stopping the mpi job from starting?</div>
<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
thank you very much<br>
</div>
</body>
</html>