[slurm-users] Job cannot start on slurm v18.08.0pre2

zhangtao102019 at 126.com zhangtao102019 at 126.com
Thu Aug 23 03:23:36 MDT 2018


Hi  Artem Polyakov,
I submitted the same job for testing in the latest 18.08.0rc1 release and found no similar problems in 18.08.0pre2.
The pre2 version mentioned earlier does not need further analysis. 
Thank you for your help.
Best regards  



zhangtao102019 at 126.com
 
From: zhangtao102019 at 126.com
Date: 2018-08-23 10:14
To: Slurm User Community List
CC: slurm-users
Subject: Re: Re: [slurm-users] Job cannot start on slurm v18.08.0pre2
Hi, 
My test script is like this:
=========================
#!/bin/bash
#SBATCH -J LOOP
#SBATCH -p low
#SBATCH --comment test
#SBATCH -N 1
#SBATCH -n 5
#SBATCH -o log/%j.loop
#SBATCH -e log/%j.loop

date
echo "SLURM_JOB_NODELIST=${SLURM_JOB_NODELIST}"
echo "SLURM_NODELIST=${SLURM_NODELIST}"
sleep 2100
echo "step 3 over"
date
=========================
If I get rid of srun and run sleep directly, the phenomenon is the same.
In addition, I did not enable the two parameters of MpiDefault and MpiParams in the configuration file slurm.conf.
so, what is the possible reason for this problem?




zhangtao102019 at 126.com
 
From: Artem Polyakov
Date: 2018-08-22 06:02
To: Slurm User Community List
CC: slurm-users
Subject: Re: [slurm-users] Job cannot start on slurm v18.08.0pre2
Hello,

I can try to tell from PMIx/UCX perspective.
Do you have "MPI=pmix" parameter in your slurm.conf or have you specified "--mpi=pmix" in your srun command? If not - you are not running PMIx and thus UCX (UCX support is only in the PMIx plugin).
I think this is confirmed by the log output that you have provided, I don't see any traces of PMIx plugin.

пт, 17 авг. 2018 г. в 20:43, zhangtao102019 at 126.com <zhangtao102019 at 126.com>:
Hi,
I have installed SLURM 18.08.0-0pre2 on a my cluster based on RHEL7.4 (x86_64).
My configure parameters likes this: 
./configure --prefix=/opt/slurm17 --with-munge=/opt/munge --with-pmix=/opt/pmix --with-ucx=/opt/openucx --with-hwloc=/usr 
(openucx version is 1.5.0, pmix version is 3.0.0, hwloc version is 1.11.8)

After completing the installation and configuration, it looks like slurm is working normally. But when I submitted a simple test job with sbatch sleep.sh(just call srun sleep 30 at single computing node), I found that the job (ID=1032) state was R, but the job did not start normally on the computation node (no process found).

The appendix is the output log of the computing node of the management node.
I can't tell if the cause of this problem is related to the compilation parameters I specify (such as pmix, ucx), and I've never seen anything similar in earlier versions.
Has anyone ever responded to a similar phenomenon with me? How to solve the problem? 

Best regards



zhangtao102019 at 126.com


-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180823/0b181b92/attachment-0002.html>


More information about the slurm-users mailing list