[slurm-users] Users Logout when job die or complete

Fri Jul 9 10:50:38 UTC 2021

Dear all,

I've installed an Open hpc cluster 2.3 , Centos 8.4, running Slurm 
20.11.7 (mostly following this guide 
https://github.com/openhpc/ohpc/releases/download/v2.3.GA/Install_guide-CentOS8-Warewulf-SLURM-2.3-aarch64.pdf).

I've a master node and Hybrid nodes that are GPU/CPU execution hosts and 
Login Nodes with X11 running (workstation used by the users). I leaved 
the possibility to the users to ssh to other compute-nodes even if they 
are not running jobs there (I created an ssh-allowed group following 
page 51 of 
https://software.intel.com/content/dam/www/public/us/en/documents/guides/installguide-openhpc2-centos82-6feb21.pdf, 
and did not run this command  ' echo "account required pam_slurm.so" >> 
$CHROOT/etc/pam.d/sshd') . We are few guys using the cluster, so it's 
not a big deal.

The GPUs are in Persistence-Mode OFF and "Default" Compute-Mode. SELINUX 
is disabled. No firewall.

I'm having a strange problem of "connection closed by remote host":

1)when a job is running by user1 under slurm locally (let's say 
hybrid-0-1 where user1 is logged and working in X11) and the job 
finishes (or die, or is canceled), the user is logged out. The GDM login 
window appears

2) when a job is running by user1 under slurm  ( user1 is logged and 
working in X11 on hybrid-0-2) on a remote host e.g. hybrid-0-2) nd the 
job finishes (or die, or is canceled), the user is logged out by 
hybrid-0-1. I can check it by connecting from hybrid-0-2 by ssh on 
hybrid-0-1, and seeing that the terminal is disconnected at the end of 
the job. It happens using both srun and sbatch.

I think that the problem can be related to the Slurm configuration, and 
not the GPU configuration,  because both CPU and GPU jobs lead to the 
logout problem.

Here are  the sbatch test , the slurm.conf and gres.conf

############## sbatch.test #####

#!/bin/bash
#SBATCH --job-name=test   # Job name
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --cpus-per-task=1
#SBATCH --partition=allcpu
#SBATCH --nodelist=hybrid-0-1
#SBATCH --output=serial_test_%j.log   # Standard output and error log
# Usage of this script:
#sbatch job-test.sbatch

# Jobname below is set automatically when using "qsub job-orca.sh -N 
jobname". Can alternatively be set manually here. Should be the name of 
the inputfile without extension (.inp or whatever).
export job=$SLURM_JOB_NAME
    JOB_NAME="$SLURM_JOB_NAME"
      JOB_ID="$SLURM_JOB_ID"

# Here  giving communication protocol

export RSH_COMMAND="/usr/bin/ssh -x"

#######SERIAL COMMANDS HERE

echo "HELLO WORLD"
sleep 10
echo "done"
#########################################

########## slurm.conf ##################

#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=linux
ControlMachine=orthrus
#ControlAddr=
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/spool/slurm/ctld
SlurmdSpoolDir=/var/spool/slurm/d
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir=
#FirstJobId=
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#JobCompLoc=
#
# ACCOUNTING
#JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODES
# OpenHPC default configuration
TaskPlugin=task/affinity
PropagateResourceLimitsExcept=MEMLOCK
JobCompType=jobcomp/filetxt
Epilog=/etc/slurm/slurm.epilog.clean
GresTypes=gpu
NodeName=hybrid-0-1 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 
State=UNKNOWN
NodeName=hybrid-0-2 Sockets=1 Gres=gpu:titanxp:1 CoresPerSocket=4 
ThreadsPerCore=1 State=UNKNOWN
NodeName=hybrid-0-3 Sockets=1 Gres=gpu:titanxp:1,gpu:gtx1080:1 
CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN
NodeName=hybrid-0-4 Sockets=1 Gres=gpu:gtx980:1 CoresPerSocket=4 
ThreadsPerCore=2 State=UNKNOWN
NodeName=hybrid-0-5 Sockets=1 Gres=gpu:gtx980:1 CoresPerSocket=4 
ThreadsPerCore=2 State=UNKNOWN
NodeName=hybrid-0-7 Sockets=1 Gres=gpu:titanxp:1 CoresPerSocket=4 
ThreadsPerCore=1 State=UNKNOWN
PartitionName=gpu Nodes=hybrid-0-[2-5,7] Default=YES MaxTime=INFINITE 
State=UP Oversubscribe=NO
PartitionName=allcpu Nodes=hybrid-0-[1-5,7] Default=YES MaxTime=INFINITE 
State=UP Oversubscribe=NO
PartitionName=fastcpu Nodes=hybrid-0-[3-5,7] Default=YES 
MaxTime=INFINITE State=UP Oversubscribe=NO
PartitionName=fastqm Nodes=hybrid-0-5 Default=YES MaxTime=INFINITE 
State=UP Oversubscribe=NO
SlurmctldParameters=enable_configless
ReturnToService=1

#################################################

########### gres.conf ####################

NodeName=hybrid-0-[2,3,7] Name=gpu Type=titanxp File=/dev/nvidia0 COREs=0
NodeName=hybrid-0-3 Name=gpu Type=gtx1080 File=/dev/nvidia1 COREs=1
NodeName=hybrid-0-[4-5] Name=gpu Type=gtx980 File=/dev/nvidia0 COREs=0

###############

Thanks and sorry for the looong message

Andrea

-- 

¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
Andrea Carotti
Dipartimento di Scienze Farmaceutiche
Università di Perugia
Via del Liceo, 1
06123 Perugia, Italy
phone: +39 075 585 5121
fax: +39 075 585 5161
mail: andrea.carotti at unipg.it