[slurm-users] SSH Freeze on extern step with pam_slurm_adopt

Lucio Delelis lucio.delelis at sixninesit.com
Tue Nov 15 19:07:53 UTC 2022


We’re seeing a repeated issue of long-running node allocations eventually disallowing SSH connections.


Our cluster configures the pam_slurm_adopt module, in order to allow users to access nodes they’ve allocated before. However, even if this allocated node is idle, after around 24 hours (we haven’t been able to pinpoint a more precise time frame yet), ssh via said module simply hangs until timeout.


Our admin users can access the same node perfectly, via a pam_listfile exception. Other users with allocations might access as well, until this limit is hit


Something I noticed recently, is that during these times the extern task for said allocation (generated by PrologFlags=Contain) would be stuck at 100% CPU usage, maxing out a single core


Please let us know which logs and/or command outputs to provide to further help debugging


Regards,
Lucio Delelis

Cloud Engineer | lucio.delelis at sixninesit.com<mailto:lucio.delelis at sixninesit.com>
[cid:abfc8ae4-6460-4097-865e-58c3eac23a70]
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221115/65d0b0d3/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Outlook-wzpbcpcb.png
Type: image/png
Size: 4898 bytes
Desc: Outlook-wzpbcpcb.png
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221115/65d0b0d3/attachment-0001.png>


More information about the slurm-users mailing list