[slurm-users] pam_slurm_adopt always claims now active jobs even when they do

Paul Raines raines at nmr.mgh.harvard.edu
Mon Oct 26 13:35:47 UTC 2020


I have ConstrainRAMSpace=yes in cgroups.conf and PrologFlags=Contain,X11
in slurm.conf

I just tried
$ squeue
              JOBID PARTITION     NAME     USER ST       TIME  NODES
                808    lcnrtx     tcsh   raines  R 1-22:39:17      1 rtx-03

$ srun --jobid 808 --pty /bin/tcsh
^Csrun: Cancelled pending job step with signal 2
srun: error: Unable to create step for job 808: Job/step already completing or 
completed

But it just hung forever till I did a ^C

thank

-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Sat, 24 Oct 2020 3:43am, Juergen Salk wrote:

> Hi Paul,
>
> maybe this is totally unrelated but we also have a similar issue with
> pam_slurm_adopt in case that ConstrainRAMSpace=no is set in
> cgroup.conf and more than one job is running on that node. There is a
> bug report open at:
>
>  https://bugs.schedmd.com/show_bug.cgi?id=9355
>
> As a workaround we currently advise users to not use ssh but attach an
> interactive shell under an already allocated job by running the
> following command:
>
>  srun --jobid <job> --pty /bin/bash
>
> For a single node job the user does not even need to know the node
> that the job is running on. For a multinode job, the user can still
> use '-w <node>' option to specify a specific node.
>
> Best regards
> Jürgen
>
> -- 
> Jürgen Salk
> Scientific Software & Compute Services (SSCS)
> Kommunikations- und Informationszentrum (kiz)
> Universität Ulm
> Telefon: +49 (0)731 50-22478
> Telefax: +49 (0)731 50-22471
>
>
>
> * Paul Raines <raines at nmr.mgh.harvard.edu> [201023 13:13]:
>>
>> I am running Slurm 20.02.3 on CentOS 7 systems.  I have pam_slurm_adopt
>> setup in /etc/pam.d/system-auth and slurm.conf has PrologFlags=Contain,X11
>> I also have masked systemd-logind
>>
>> But pam_slurm_adopt always denies login with "Access denied by
>> pam_slurm_adopt: you have no active jobs on this node" even when the
>> user most definitely has a job running on the node via srun
>>
>> Any clues as to why pam_slurm_adopt thinks there is no job?
>>
>> serena [raines] squeue
>>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>> NODELIST(REASON)
>>                785    lcnrtx     tcsh   raines  R   19:44:51      1 rtx-03
>> serena [raines] ssh rtx-03
>> Access denied by pam_slurm_adopt: you have no active jobs on this node
>> Authentication failed.
>>
>>
>
> -- 
> GPG A997BA7A | 87FC DA31 5F00 C885 0DC3  E28F BD0D 4B33 A997 BA7A
>
>
>


More information about the slurm-users mailing list