[slurm-users] Slurm 22.05.8 - salloc not starting shell on remote host

Brian Andrus toomuchit at gmail.com
Fri May 19 15:28:16 UTC 2023


Defaulting to a shell for salloc is a newer feature.

For your version, you should:

     srun -n 1 -t 00:10:00 --mem=1G --pty bash

Brian Andrus

On 5/19/2023 8:24 AM, Ryan Novosielski wrote:
> I’m not at a computer, and we run an older version of Slurm yet so I 
> can’t say with 100% confidence that his this has changed and I can’t 
> be too specific, but I know that this is the behavior you should 
> expect from that command. I believe that there are configuration 
> options to make it behave differently.
>
> Otherwise, you can use srun to run commands on the assigned node.
>
> I think if you search this list for “interactive,” or search the Slurm 
> bugs database, you will see some other conversations about this.
>
> Sent from my iPhone
>
>> On May 19, 2023, at 10:35, Prentice Bisbal <pbisbal at pppl.gov> wrote:
>>
>> 
>>
>> I'm setting up Slurm from scratch for the first time ever. Using 
>> 22.05.8 since I haven't had a changed to upgrade our DB server to 
>> 23.02 yet. When I try to use salloc to get a shell on a compute node 
>> (ranger-s22-07), I end up with a shell on the login node (ranger):
>>
>> [pbisbal at ranger ~]$ salloc -n 1 -t 00:10:00  --mem=1G salloc: Granted 
>> job allocation 23 salloc: Waiting for resource configuration salloc: 
>> Nodes ranger-s22-07 are ready for job [pbisbal at ranger ~]$
>>
>> Any ideas what's going wrong here? I have the following line in my 
>> slurm.conf:
>>
>> LaunchParameters=user_interactive_step
>>
>> When I run salloc with -vvvvv, here's what I see:
>>
>> [pbisbal at ranger ~]$ salloc -vvvvv -n 1 -t 00:10:00  --mem=1G
>> salloc: defined options
>> salloc: -------------------- --------------------
>> salloc: mem                 : 1G
>> salloc: ntasks              : 1
>> salloc: time                : 00:10:00
>> salloc: verbose             : 5
>> salloc: -------------------- --------------------
>> salloc: end of defined options
>> salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_cons_res.so
>> salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Consumable Resources (CR) Node Selection plugin type:select/cons_res version:0x160508
>> salloc: select/cons_res: common_init: select/cons_res loaded
>> salloc: debug3: Success.
>> salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_cons_tres.so
>> salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Trackable RESources (TRES) Selection plugin type:select/cons_tres version:0x160508
>> salloc: select/cons_tres: common_init: select/cons_tres loaded
>> salloc: debug3: Success.
>> salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_cray_aries.so
>> salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Cray/Aries node selection plugin type:select/cray_aries version:0x160508
>> salloc: select/cray_aries: init: Cray/Aries node selection plugin loaded
>> salloc: debug3: Success.
>> salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_linear.so
>> salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Linear node selection plugin type:select/linear version:0x160508
>> salloc: select/linear: init: Linear node selection plugin loaded with argument 20
>> salloc: debug3: Success.
>> salloc: debug:  Entering slurm_allocation_msg_thr_create()
>> salloc: debug:  port from net_stream_listen is 43881
>> salloc: debug:  Entering _msg_thr_internal
>> salloc: debug4: eio: handling events for 1 objects
>> salloc: debug3: eio_message_socket_readable: shutdown 0 fd 6
>> salloc: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so
>> salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge authentication plugin type:auth/munge version:0x160508
>> salloc: debug:  auth/munge: init: Munge authentication plugin loaded
>> salloc: debug3: Success.
>> salloc: debug3: Trying to load plugin /usr/lib64/slurm/hash_k12.so
>> salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:KangarooTwelve hash plugin type:hash/k12 version:0x160508
>> salloc: debug:  hash/k12: init: init: KangarooTwelve hash plugin loaded
>> salloc: debug3: Success.
>> salloc: Granted job allocation 24
>> salloc: Waiting for resource configuration
>> salloc: Nodes ranger-s22-07 are ready for job
>> salloc: debug:  laying out the 1 tasks on 1 hosts ranger-s22-07 dist 8192
>> [pbisbal at ranger ~]$
>>
>> This is all I see in /var/log/slurm/slurmd.log on the compute node:
>>
>> [2023-05-19T10:21:36.898] [24.extern] task/cgroup: _memcg_initialize: job: alloc=1024MB mem.limit=1024MB memsw.limit=unlimited
>> [2023-05-19T10:21:36.899] [24.extern] task/cgroup: _memcg_initialize: step: alloc=1024MB mem.limit=1024MB memsw.limit=unlimited
>>
>> And this is all I see in /var/log/slurm/slurmctld.log on the controller:
>>
>> [2023-05-19T10:18:16.815] sched: _slurm_rpc_allocate_resources JobId=23 NodeList=ranger-s22-07 usec=1136
>> [2023-05-19T10:18:22.423] Time limit exhausted for JobId=22
>> [2023-05-19T10:21:36.861] sched: _slurm_rpc_allocate_resources JobId=24 NodeList=ranger-s22-07 usec=1039
>> Here's my slurm.conf file:
>>
>> # grep -v ^# /etc/slurm/slurm.conf  | grep -v ^$
>> ClusterName=ranger
>> SlurmctldHost=ranger-master
>> EnforcePartLimits=ALL
>> JobSubmitPlugins=lua,require_timelimit
>> LaunchParameters=user_interactive_step
>> MaxStepCount=2500
>> MaxTasksPerNode=32
>> MpiDefault=none
>> ProctrackType=proctrack/cgroup
>> PrologFlags=contain
>> ReturnToService=0
>> SlurmctldPidFile=/var/run/slurmctld.pid
>> SlurmctldPort=6817
>> SlurmdPidFile=/var/run/slurmd.pid
>> SlurmdPort=6818
>> SlurmdSpoolDir=/var/spool/slurmd
>> SlurmUser=slurm
>> StateSaveLocation=/var/spool/slurmctld
>> SwitchType=switch/none
>> TaskPlugin=task/affinity,task/cgroup
>> TopologyPlugin=topology/tree
>> CompleteWait=32
>> InactiveLimit=0
>> KillWait=30
>> MinJobAge=300
>> SlurmctldTimeout=120
>> SlurmdTimeout=300
>> Waittime=0|
>> DefMemPerCPU=5000
>> SchedulerType=sched/backfill
>> SelectType=select/cons_tres
>> SelectTypeParameters=CR_Core_Memory
>> PriorityType=priority/multifactor
>> PriorityDecayHalfLife=15-0
>> PriorityCalcPeriod=15
>> PriorityFavorSmall=NO
>> PriorityMaxAge=180-0
>> PriorityWeightAge=5000
>> PriorityWeightFairshare=5000
>> PriorityWeightJobSize=5000
>> AccountingStorageEnforce=all
>> AccountingStorageHost=slurm.pppl.gov
>> AccountingStorageType=accounting_storage/slurmdbd
>> AccountingStoreFlags=job_script
>> JobCompType=jobcomp/none
>> JobAcctGatherFrequency=30
>> JobAcctGatherParams=UsePss
>> JobAcctGatherType=jobacct_gather/cgroup
>> SlurmctldDebug=info
>> SlurmctldLogFile=/var/log/slurm/slurmctld.log
>> SlurmdDebug=info
>> SlurmdLogFile=/var/log/slurm/slurmd.log
>> NodeName=ranger-s22-07 CPUs=72 Boards=1 SocketsPerBoard=4 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=384880 State=UNKNOWN
>> PartitionName=all Nodes=ALL Default=YES GraceTime=300 MaxTime=24:00:00 State=UP
>> -- 
>> Prentice
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230519/be9df593/attachment-0001.htm>


More information about the slurm-users mailing list