[slurm-users] [EXTERNAL] --no-alloc breaks mpi?

O'Grady, Paul Christopher cpo at slac.stanford.edu
Tue Mar 9 02:36:10 UTC 2021



On Mar 8, 2021, at 1:35 PM, slurm-users-request at lists.schedmd.com<mailto:slurm-users-request at lists.schedmd.com> wrote:

What?s happening is that there?s no SLURM_JOBID (my speculation since I don?t have perms to use ?no-alloc) is set, but SLURM_NODELIST may be set, so its confusing ORTE.
Could you list which SLURM env variables are set in the shell in which your running the srun command?

Howard,

I believe you are correct.  Once I set SLURM_JOBID then ORTE starts functioning again with the --no-alloc option.  Since you asked (and for completeness) I include the list of environment variables that were different with/without --no-alloc below, but my tests show that jobid seems to be the magic one, as you predicted.

I guess I will manufacture an artificial job id for our “--no-alloc” runs, but if anyone is aware of any dangers lurking in the shadows from that approach I would be interested.

Thanks for the guidance ... impressive that you could identify the issue so quickly!

chris

----------------------------------------------------------

SLURM_JOB_CPUS_PER_NODE=1
SLURM_JOB_ID=25300
SLURM_JOBID=25300
SLURM_JOB_NUM_NODES=1
SLURM_JOB_PARTITION=psfehq
SLURM_JOB_QOS=normal
SLURM_CPUS_ON_NODE=1

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210309/fadf267a/attachment.htm>


More information about the slurm-users mailing list