[slurm-users] step creation temporarily disabled, retrying (Requested nodes are busy)

Ozeryan, Vladimir Vladimir.Ozeryan at jhuapl.edu
Fri Mar 4 21:08:00 UTC 2022


Try with SBATCH script and use "mpirun" executable without  "--mpi=pmi2".

From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of masber masber
Sent: Tuesday, March 1, 2022 12:54 PM
To: slurm-users at lists.schedmd.com
Subject: [EXT] [slurm-users] step creation temporarily disabled, retrying (Requested nodes are busy)

APL external email warning: Verify sender slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com> before clicking links or attachments



Dear slurm user community,

I have a slurm cluster on centos7 installed through yum, I also have mpich installed.

I can ssh into on of the nodes and run an mpi job:

# /usr/lib64/mpich/bin/mpirun --hosts nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469,nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469,nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469 /scratch/mpi-helloworld
Warning: Permanently added 'nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469,10.233.88.25' (ECDSA) to the list of known hosts.
Hello world from processor nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469, rank 2 out of 3 processors
Hello world from processor nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469, rank 0 out of 3 processors
Hello world from processor nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469, rank 1 out of 3 processors

However I can't make it work through slurm, these are the logs form running the job:

# srun --mpi=pmi2 -N3 -vvv /usr/lib64/mpich/bin/mpirun /scratch/mpi-helloworld
srun: defined options
srun: -------------------- --------------------
srun: mpi                 : pmi2
srun: nodes               : 3
srun: verbose             : 3
srun: -------------------- --------------------
srun: end of defined options
srun: debug:  propagating RLIMIT_CPU=18446744073709551615
srun: debug:  propagating RLIMIT_FSIZE=18446744073709551615
srun: debug:  propagating RLIMIT_DATA=18446744073709551615
srun: debug:  propagating RLIMIT_STACK=8388608
srun: debug:  propagating RLIMIT_CORE=18446744073709551615
srun: debug:  propagating RLIMIT_RSS=18446744073709551615
srun: debug:  propagating RLIMIT_NPROC=18446744073709551615
srun: debug:  propagating RLIMIT_NOFILE=1048576
srun: debug:  propagating RLIMIT_AS=18446744073709551615
srun: debug:  propagating SLURM_PRIO_PROCESS=0
srun: debug:  propagating UMASK=0022
srun: debug2: srun PMI messages to port=33065
srun: debug:  Entering slurm_allocation_msg_thr_create()
srun: debug:  port from net_stream_listen is 44387
srun: debug:  Entering _msg_thr_internal
srun: debug:  auth/munge: init: Munge authentication plugin loaded
srun: jobid 8: nodes(3):`nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469,nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469,nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469', cpu counts: 1(x3)
srun: debug2: creating job with 3 tasks
srun: debug:  requesting job 8, user 0, nodes 3 including ((null))
srun: debug:  cpus 3, tasks 3, name mpirun, relative 65534
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: debug:  Entering slurm_step_launch
srun: debug:  mpi type = (null)
srun: debug:  mpi/pmi2: p_mpi_hook_client_prelaunch: mpi/pmi2: client_prelaunch
srun: debug:  mpi/pmi2: _get_proc_mapping: mpi/pmi2: processor mapping: (vector,(0,3,1))
srun: debug:  mpi/pmi2: _setup_srun_socket: mpi/pmi2: srun pmi port: 37029
srun: debug2: mpi/pmi2: _tree_listen_readable: mpi/pmi2: _tree_listen_readable
srun: debug:  mpi/pmi2: pmi2_start_agent: mpi/pmi2: started agent thread
srun: debug:  Entering _msg_thr_create()
srun: debug:  initialized stdio listening socket, port 41275
srun: debug:  Started IO server thread (140538792195840)
srun: debug:  Entering _launch_tasks
srun: launching StepId=8.0 on host nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469, 1 tasks: 0
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: launching StepId=8.0 on host nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469, 1 tasks: 1
srun: launching StepId=8.0 on host nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469, 1 tasks: 2
srun: route/default: init: route default plugin loaded
srun: debug2: Tree head got back 0 looking for 3
srun: debug2: Tree head got back 1
srun: debug2: Tree head got back 2
srun: debug2: Tree head got back 3
srun: debug:  launch returned msg_rc=0 err=0 type=8001
srun: debug:  launch returned msg_rc=0 err=0 type=8001
srun: debug:  launch returned msg_rc=0 err=0 type=8001
srun: debug2: Activity on IO listening socket 17
srun: debug2: Entering io_init_msg_read_from_fd
srun: debug2: Leaving  io_init_msg_read_from_fd
srun: debug2: Entering io_init_msg_validate
srun: debug2: Leaving  io_init_msg_validate
srun: debug2: Validated IO connection from 10.233.88.26:33470, node rank 0, sd=18
srun: debug2: eio_message_socket_accept: got message connection from 10.233.88.26:53410 19
srun: debug2: received task launch
srun: launch/slurm: _task_start: Node nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469, 1 tasks started
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Activity on IO listening socket 17
srun: debug2: Entering io_init_msg_read_from_fd
srun: debug2: Leaving  io_init_msg_read_from_fd
srun: debug2: Entering io_init_msg_validate
srun: debug2: Leaving  io_init_msg_validate
srun: debug2: Validated IO connection from 10.233.88.25:52764, node rank 2, sd=19
srun: debug2: Entering io_init_msg_read_from_fd
srun: debug2: Leaving  io_init_msg_read_from_fd
srun: debug2: Entering io_init_msg_validate
srun: debug2: Leaving  io_init_msg_validate
srun: debug2: Validated IO connection from 10.233.88.27:52768, node rank 1, sd=20
srun: debug2: eio_message_socket_accept: got message connection from 10.233.88.25:47948 21
srun: debug2: received task launch
srun: launch/slurm: _task_start: Node nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469, 1 tasks started
srun: debug2: eio_message_socket_accept: got message connection from 10.233.88.27:41996 21
srun: debug2: received task launch
srun: launch/slurm: _task_start: Node nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469, 1 tasks started
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
srun: Job 8 step creation temporarily disabled, retrying (Requested nodes are busy)

The output clearly says the nodes are busy but they are not, actually I can run other jobs:

# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
workq*       up   infinite      3   idle nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469,nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469,nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469
[root at nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469 /]# srun -N3 hostname
nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469
nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469
nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469

Any idea what is stopping the mpi job from starting?

thank you very much
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220304/fbcf1b06/attachment-0001.htm>


More information about the slurm-users mailing list