[slurm-users] [External] Re: Slurm 22.05.8 - salloc not starting shell on remote host
Prentice Bisbal
pbisbal at pppl.gov
Fri May 19 17:11:07 UTC 2023
Brian,
Thanks for the reply, and I was hoping that would be the fix, but that
doesn't seem to be the case. I'm using 22.05.8, which isn't that old. I
double-checked the documentation archives for version 22.05.08's
documetation, and setting
LaunchParameters=use_interactive_step
should be valid here. From
https://slurm.schedmd.com/archive/slurm-22.05.8/slurm.conf.html:
> *use_interactive_step*
> Have salloc use the Interactive Step to launch a shell on an
> allocated compute node rather than locally to wherever salloc was
> invoked. This is accomplished by launching the srun command with
> InteractiveStepOptions as options.
>
> This does not affect salloc called with a command as an argument.
> These jobs will continue to be executed as the calling user on the
> calling host.
>
and
> *InteractiveStepOptions*
> When LaunchParameters=use_interactive_step is enabled, launching
> salloc will automatically start an srun process with
> InteractiveStepOptions to launch a terminal on a node in the job
> allocation. The default value is "--interactive --preserve-env
> --pty $SHELL". The "--interactive" option is intentionally not
> documented in the srun man page. It is meant only to be used in
> *InteractiveStepOptions* in order to create an "interactive step"
> that will not consume resources so that other steps may run in
> parallel with the interactive step.
>
According to that, setting LaunchParameters=use_interactive_step should
be enough, since "--interactive --preserve-env --pty $SHELL" is the
default.
A colleague pointed out that my slurm.conf was setting LaunchParameters
to "user_interactive_step" when it should be "use_interactive_step", but
changing that didn't fix my problem, just changed it. Now when I try to
start an interactive shell, it just hangs and eventually returns an error:
[pbisbal at ranger ~]$ salloc -n 1 -t 00:10:00 --mem=1G
salloc: Granted job allocation 29
salloc: Waiting for resource configuration
salloc: Nodes ranger-s22-07 are ready for job
srun: error: timeout waiting for task launch, started 0 of 1 tasks
srun: launch/slurm: launch_p_step_launch: StepId=29.interactive aborted
before step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
salloc: Relinquishing job allocation 29
[pbisbal at ranger ~]$
On 5/19/23 11:28 AM, Brian Andrus wrote:
>
> Defaulting to a shell for salloc is a newer feature.
>
> For your version, you should:
>
> srun -n 1 -t 00:10:00 --mem=1G --pty bash
>
> Brian Andrus
>
> On 5/19/2023 8:24 AM, Ryan Novosielski wrote:
>> I’m not at a computer, and we run an older version of Slurm yet so I
>> can’t say with 100% confidence that his this has changed and I can’t
>> be too specific, but I know that this is the behavior you should
>> expect from that command. I believe that there are configuration
>> options to make it behave differently.
>>
>> Otherwise, you can use srun to run commands on the assigned node.
>>
>> I think if you search this list for “interactive,” or search the
>> Slurm bugs database, you will see some other conversations about this.
>>
>> Sent from my iPhone
>>
>>> On May 19, 2023, at 10:35, Prentice Bisbal <pbisbal at pppl.gov> wrote:
>>>
>>>
>>>
>>> I'm setting up Slurm from scratch for the first time ever. Using
>>> 22.05.8 since I haven't had a changed to upgrade our DB server to
>>> 23.02 yet. When I try to use salloc to get a shell on a compute node
>>> (ranger-s22-07), I end up with a shell on the login node (ranger):
>>>
>>> [pbisbal at ranger ~]$ salloc -n 1 -t 00:10:00 --mem=1G salloc:
>>> Granted job allocation 23 salloc: Waiting for resource configuration
>>> salloc: Nodes ranger-s22-07 are ready for job [pbisbal at ranger ~]$
>>>
>>> Any ideas what's going wrong here? I have the following line in my
>>> slurm.conf:
>>>
>>> LaunchParameters=user_interactive_step
>>>
>>> When I run salloc with -vvvvv, here's what I see:
>>>
>>> [pbisbal at ranger ~]$ salloc -vvvvv -n 1 -t 00:10:00 --mem=1G
>>> salloc: defined options
>>> salloc: -------------------- --------------------
>>> salloc: mem : 1G
>>> salloc: ntasks : 1
>>> salloc: time : 00:10:00
>>> salloc: verbose : 5
>>> salloc: -------------------- --------------------
>>> salloc: end of defined options
>>> salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_cons_res.so
>>> salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Consumable Resources (CR) Node Selection plugin type:select/cons_res version:0x160508
>>> salloc: select/cons_res: common_init: select/cons_res loaded
>>> salloc: debug3: Success.
>>> salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_cons_tres.so
>>> salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Trackable RESources (TRES) Selection plugin type:select/cons_tres version:0x160508
>>> salloc: select/cons_tres: common_init: select/cons_tres loaded
>>> salloc: debug3: Success.
>>> salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_cray_aries.so
>>> salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Cray/Aries node selection plugin type:select/cray_aries version:0x160508
>>> salloc: select/cray_aries: init: Cray/Aries node selection plugin loaded
>>> salloc: debug3: Success.
>>> salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_linear.so
>>> salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Linear node selection plugin type:select/linear version:0x160508
>>> salloc: select/linear: init: Linear node selection plugin loaded with argument 20
>>> salloc: debug3: Success.
>>> salloc: debug: Entering slurm_allocation_msg_thr_create()
>>> salloc: debug: port from net_stream_listen is 43881
>>> salloc: debug: Entering _msg_thr_internal
>>> salloc: debug4: eio: handling events for 1 objects
>>> salloc: debug3: eio_message_socket_readable: shutdown 0 fd 6
>>> salloc: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so
>>> salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge authentication plugin type:auth/munge version:0x160508
>>> salloc: debug: auth/munge: init: Munge authentication plugin loaded
>>> salloc: debug3: Success.
>>> salloc: debug3: Trying to load plugin /usr/lib64/slurm/hash_k12.so
>>> salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:KangarooTwelve hash plugin type:hash/k12 version:0x160508
>>> salloc: debug: hash/k12: init: init: KangarooTwelve hash plugin loaded
>>> salloc: debug3: Success.
>>> salloc: Granted job allocation 24
>>> salloc: Waiting for resource configuration
>>> salloc: Nodes ranger-s22-07 are ready for job
>>> salloc: debug: laying out the 1 tasks on 1 hosts ranger-s22-07 dist 8192
>>> [pbisbal at ranger ~]$
>>>
>>> This is all I see in /var/log/slurm/slurmd.log on the compute node:
>>>
>>> [2023-05-19T10:21:36.898] [24.extern] task/cgroup: _memcg_initialize: job: alloc=1024MB mem.limit=1024MB memsw.limit=unlimited
>>> [2023-05-19T10:21:36.899] [24.extern] task/cgroup: _memcg_initialize: step: alloc=1024MB mem.limit=1024MB memsw.limit=unlimited
>>>
>>> And this is all I see in /var/log/slurm/slurmctld.log on the
>>> controller:
>>>
>>> [2023-05-19T10:18:16.815] sched: _slurm_rpc_allocate_resources JobId=23 NodeList=ranger-s22-07 usec=1136
>>> [2023-05-19T10:18:22.423] Time limit exhausted for JobId=22
>>> [2023-05-19T10:21:36.861] sched: _slurm_rpc_allocate_resources JobId=24 NodeList=ranger-s22-07 usec=1039
>>> Here's my slurm.conf file:
>>>
>>> # grep -v ^# /etc/slurm/slurm.conf | grep -v ^$
>>> ClusterName=ranger
>>> SlurmctldHost=ranger-master
>>> EnforcePartLimits=ALL
>>> JobSubmitPlugins=lua,require_timelimit
>>> LaunchParameters=user_interactive_step
>>> MaxStepCount=2500
>>> MaxTasksPerNode=32
>>> MpiDefault=none
>>> ProctrackType=proctrack/cgroup
>>> PrologFlags=contain
>>> ReturnToService=0
>>> SlurmctldPidFile=/var/run/slurmctld.pid
>>> SlurmctldPort=6817
>>> SlurmdPidFile=/var/run/slurmd.pid
>>> SlurmdPort=6818
>>> SlurmdSpoolDir=/var/spool/slurmd
>>> SlurmUser=slurm
>>> StateSaveLocation=/var/spool/slurmctld
>>> SwitchType=switch/none
>>> TaskPlugin=task/affinity,task/cgroup
>>> TopologyPlugin=topology/tree
>>> CompleteWait=32
>>> InactiveLimit=0
>>> KillWait=30
>>> MinJobAge=300
>>> SlurmctldTimeout=120
>>> SlurmdTimeout=300
>>> Waittime=0|
>>> DefMemPerCPU=5000
>>> SchedulerType=sched/backfill
>>> SelectType=select/cons_tres
>>> SelectTypeParameters=CR_Core_Memory
>>> PriorityType=priority/multifactor
>>> PriorityDecayHalfLife=15-0
>>> PriorityCalcPeriod=15
>>> PriorityFavorSmall=NO
>>> PriorityMaxAge=180-0
>>> PriorityWeightAge=5000
>>> PriorityWeightFairshare=5000
>>> PriorityWeightJobSize=5000
>>> AccountingStorageEnforce=all
>>> AccountingStorageHost=slurm.pppl.gov
>>> AccountingStorageType=accounting_storage/slurmdbd
>>> AccountingStoreFlags=job_script
>>> JobCompType=jobcomp/none
>>> JobAcctGatherFrequency=30
>>> JobAcctGatherParams=UsePss
>>> JobAcctGatherType=jobacct_gather/cgroup
>>> SlurmctldDebug=info
>>> SlurmctldLogFile=/var/log/slurm/slurmctld.log
>>> SlurmdDebug=info
>>> SlurmdLogFile=/var/log/slurm/slurmd.log
>>> NodeName=ranger-s22-07 CPUs=72 Boards=1 SocketsPerBoard=4 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=384880 State=UNKNOWN
>>> PartitionName=all Nodes=ALL Default=YES GraceTime=300 MaxTime=24:00:00 State=UP
>>> --
>>> Prentice
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230519/cf43b8a8/attachment.htm>
More information about the slurm-users
mailing list