[slurm-users] [External] Re: Slurm 22.05.8 - salloc not starting shell on remote host

Prentice Bisbal pbisbal at pppl.gov
Fri May 19 17:37:05 UTC 2023


This is fixed. I was a little overzealous in my IPtables rules on the 
login host and was restricting traffic from the compute node back to the 
login node.

Thanks to Ryan and Brian for the quick replies offering suggestions.

Prentice

On 5/19/23 1:11 PM, Prentice Bisbal wrote:
>
> Brian,
>
> Thanks for the reply, and I was hoping that would be the fix, but that 
> doesn't seem to be the case. I'm using 22.05.8, which isn't that old. 
> I double-checked the documentation archives for version 22.05.08's 
> documetation, and setting
>
> LaunchParameters=use_interactive_step
>
> should be valid here. From 
> https://slurm.schedmd.com/archive/slurm-22.05.8/slurm.conf.html:
>
>> *use_interactive_step*
>>     Have salloc use the Interactive Step to launch a shell on an
>>     allocated compute node rather than locally to wherever salloc was
>>     invoked. This is accomplished by launching the srun command with
>>     InteractiveStepOptions as options.
>>
>>     This does not affect salloc called with a command as an argument.
>>     These jobs will continue to be executed as the calling user on
>>     the calling host.
>>
> and
>
>> *InteractiveStepOptions*
>>     When LaunchParameters=use_interactive_step is enabled, launching
>>     salloc will automatically start an srun process with
>>     InteractiveStepOptions to launch a terminal on a node in the job
>>     allocation. The default value is "--interactive --preserve-env
>>     --pty $SHELL". The "--interactive" option is intentionally not
>>     documented in the srun man page. It is meant only to be used in
>>     *InteractiveStepOptions* in order to create an "interactive step"
>>     that will not consume resources so that other steps may run in
>>     parallel with the interactive step. 
>>
> According to that, setting LaunchParameters=use_interactive_step 
> should be enough, since "--interactive --preserve-env --pty $SHELL" is 
> the default.
>
> A colleague pointed out that my slurm.conf was setting 
> LaunchParameters to "user_interactive_step" when it should be 
> "use_interactive_step", but changing that didn't fix my problem, just 
> changed it. Now when I try to start an interactive shell, it just 
> hangs and eventually returns an error:
>
> [pbisbal at ranger ~]$ salloc -n 1 -t 00:10:00 --mem=1G
> salloc: Granted job allocation 29
> salloc: Waiting for resource configuration
> salloc: Nodes ranger-s22-07 are ready for job
> srun: error: timeout waiting for task launch, started 0 of 1 tasks
> srun: launch/slurm: launch_p_step_launch: StepId=29.interactive 
> aborted before step completely launched.
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: Timed out waiting for job step to complete
> salloc: Relinquishing job allocation 29
> [pbisbal at ranger ~]$
>
>
>
> On 5/19/23 11:28 AM, Brian Andrus wrote:
>>
>> Defaulting to a shell for salloc is a newer feature.
>>
>> For your version, you should:
>>
>>     srun -n 1 -t 00:10:00 --mem=1G --pty bash
>>
>> Brian Andrus
>>
>> On 5/19/2023 8:24 AM, Ryan Novosielski wrote:
>>> I’m not at a computer, and we run an older version of Slurm yet so I 
>>> can’t say with 100% confidence that his this has changed and I can’t 
>>> be too specific, but I know that this is the behavior you should 
>>> expect from that command. I believe that there are configuration 
>>> options to make it behave differently.
>>>
>>> Otherwise, you can use srun to run commands on the assigned node.
>>>
>>> I think if you search this list for “interactive,” or search the 
>>> Slurm bugs database, you will see some other conversations about this.
>>>
>>> Sent from my iPhone
>>>
>>>> On May 19, 2023, at 10:35, Prentice Bisbal <pbisbal at pppl.gov> wrote:
>>>>
>>>> 
>>>>
>>>> I'm setting up Slurm from scratch for the first time ever. Using 
>>>> 22.05.8 since I haven't had a changed to upgrade our DB server to 
>>>> 23.02 yet. When I try to use salloc to get a shell on a compute 
>>>> node (ranger-s22-07), I end up with a shell on the login node 
>>>> (ranger):
>>>>
>>>> [pbisbal at ranger ~]$ salloc -n 1 -t 00:10:00  --mem=1G salloc: 
>>>> Granted job allocation 23 salloc: Waiting for resource 
>>>> configuration salloc: Nodes ranger-s22-07 are ready for job 
>>>> [pbisbal at ranger ~]$
>>>>
>>>> Any ideas what's going wrong here? I have the following line in my 
>>>> slurm.conf:
>>>>
>>>> LaunchParameters=user_interactive_step
>>>>
>>>> When I run salloc with -vvvvv, here's what I see:
>>>>
>>>> [pbisbal at ranger ~]$ salloc -vvvvv -n 1 -t 00:10:00  --mem=1G
>>>> salloc: defined options
>>>> salloc: -------------------- --------------------
>>>> salloc: mem                 : 1G
>>>> salloc: ntasks              : 1
>>>> salloc: time                : 00:10:00
>>>> salloc: verbose             : 5
>>>> salloc: -------------------- --------------------
>>>> salloc: end of defined options
>>>> salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_cons_res.so
>>>> salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Consumable Resources (CR) Node Selection plugin type:select/cons_res version:0x160508
>>>> salloc: select/cons_res: common_init: select/cons_res loaded
>>>> salloc: debug3: Success.
>>>> salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_cons_tres.so
>>>> salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Trackable RESources (TRES) Selection plugin type:select/cons_tres version:0x160508
>>>> salloc: select/cons_tres: common_init: select/cons_tres loaded
>>>> salloc: debug3: Success.
>>>> salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_cray_aries.so
>>>> salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Cray/Aries node selection plugin type:select/cray_aries version:0x160508
>>>> salloc: select/cray_aries: init: Cray/Aries node selection plugin loaded
>>>> salloc: debug3: Success.
>>>> salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_linear.so
>>>> salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Linear node selection plugin type:select/linear version:0x160508
>>>> salloc: select/linear: init: Linear node selection plugin loaded with argument 20
>>>> salloc: debug3: Success.
>>>> salloc: debug:  Entering slurm_allocation_msg_thr_create()
>>>> salloc: debug:  port from net_stream_listen is 43881
>>>> salloc: debug:  Entering _msg_thr_internal
>>>> salloc: debug4: eio: handling events for 1 objects
>>>> salloc: debug3: eio_message_socket_readable: shutdown 0 fd 6
>>>> salloc: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so
>>>> salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge authentication plugin type:auth/munge version:0x160508
>>>> salloc: debug:  auth/munge: init: Munge authentication plugin loaded
>>>> salloc: debug3: Success.
>>>> salloc: debug3: Trying to load plugin /usr/lib64/slurm/hash_k12.so
>>>> salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:KangarooTwelve hash plugin type:hash/k12 version:0x160508
>>>> salloc: debug:  hash/k12: init: init: KangarooTwelve hash plugin loaded
>>>> salloc: debug3: Success.
>>>> salloc: Granted job allocation 24
>>>> salloc: Waiting for resource configuration
>>>> salloc: Nodes ranger-s22-07 are ready for job
>>>> salloc: debug:  laying out the 1 tasks on 1 hosts ranger-s22-07 dist 8192
>>>> [pbisbal at ranger ~]$
>>>>
>>>> This is all I see in /var/log/slurm/slurmd.log on the compute node:
>>>>
>>>> [2023-05-19T10:21:36.898] [24.extern] task/cgroup: _memcg_initialize: job: alloc=1024MB mem.limit=1024MB memsw.limit=unlimited
>>>> [2023-05-19T10:21:36.899] [24.extern] task/cgroup: _memcg_initialize: step: alloc=1024MB mem.limit=1024MB memsw.limit=unlimited
>>>>
>>>> And this is all I see in /var/log/slurm/slurmctld.log on the 
>>>> controller:
>>>>
>>>> [2023-05-19T10:18:16.815] sched: _slurm_rpc_allocate_resources JobId=23 NodeList=ranger-s22-07 usec=1136
>>>> [2023-05-19T10:18:22.423] Time limit exhausted for JobId=22
>>>> [2023-05-19T10:21:36.861] sched: _slurm_rpc_allocate_resources JobId=24 NodeList=ranger-s22-07 usec=1039
>>>> Here's my slurm.conf file:
>>>>
>>>> # grep -v ^# /etc/slurm/slurm.conf  | grep -v ^$
>>>> ClusterName=ranger
>>>> SlurmctldHost=ranger-master
>>>> EnforcePartLimits=ALL
>>>> JobSubmitPlugins=lua,require_timelimit
>>>> LaunchParameters=user_interactive_step
>>>> MaxStepCount=2500
>>>> MaxTasksPerNode=32
>>>> MpiDefault=none
>>>> ProctrackType=proctrack/cgroup
>>>> PrologFlags=contain
>>>> ReturnToService=0
>>>> SlurmctldPidFile=/var/run/slurmctld.pid
>>>> SlurmctldPort=6817
>>>> SlurmdPidFile=/var/run/slurmd.pid
>>>> SlurmdPort=6818
>>>> SlurmdSpoolDir=/var/spool/slurmd
>>>> SlurmUser=slurm
>>>> StateSaveLocation=/var/spool/slurmctld
>>>> SwitchType=switch/none
>>>> TaskPlugin=task/affinity,task/cgroup
>>>> TopologyPlugin=topology/tree
>>>> CompleteWait=32
>>>> InactiveLimit=0
>>>> KillWait=30
>>>> MinJobAge=300
>>>> SlurmctldTimeout=120
>>>> SlurmdTimeout=300
>>>> Waittime=0|
>>>> DefMemPerCPU=5000
>>>> SchedulerType=sched/backfill
>>>> SelectType=select/cons_tres
>>>> SelectTypeParameters=CR_Core_Memory
>>>> PriorityType=priority/multifactor
>>>> PriorityDecayHalfLife=15-0
>>>> PriorityCalcPeriod=15
>>>> PriorityFavorSmall=NO
>>>> PriorityMaxAge=180-0
>>>> PriorityWeightAge=5000
>>>> PriorityWeightFairshare=5000
>>>> PriorityWeightJobSize=5000
>>>> AccountingStorageEnforce=all
>>>> AccountingStorageHost=slurm.pppl.gov
>>>> AccountingStorageType=accounting_storage/slurmdbd
>>>> AccountingStoreFlags=job_script
>>>> JobCompType=jobcomp/none
>>>> JobAcctGatherFrequency=30
>>>> JobAcctGatherParams=UsePss
>>>> JobAcctGatherType=jobacct_gather/cgroup
>>>> SlurmctldDebug=info
>>>> SlurmctldLogFile=/var/log/slurm/slurmctld.log
>>>> SlurmdDebug=info
>>>> SlurmdLogFile=/var/log/slurm/slurmd.log
>>>> NodeName=ranger-s22-07 CPUs=72 Boards=1 SocketsPerBoard=4 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=384880 State=UNKNOWN
>>>> PartitionName=all Nodes=ALL Default=YES GraceTime=300 MaxTime=24:00:00 State=UP
>>>> -- 
>>>> Prentice
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230519/b1df9509/attachment-0001.htm>


More information about the slurm-users mailing list