[slurm-users] [External] Re: Down nodes

Sat Jul 31 00:05:35 UTC 2021

Brian,

Yes, slurmd is not running on that node because the node itself is not
there anymore (the whole VM is gone!). When the node is no longer in use,
slurm automatically runs slurm_suspend.sh script which removes the whole
node(VM) by running "openstack server delete $host". There is no server/VM,
no IP address, no DNS name, nothing. "slurm4-compute9" only exists as a
hypothetical node that can be launched in the future in case there are more
jobs to run. That's how "cloud" partition works, right?

[slurm.conf]
SuspendProgram=/usr/local/sbin/slurm_suspend.sh
SuspendTime=600 #time in seconds before an idle node is suspended

I am wondering.. maybe something went wrong when slurm ran slurm_suspend.sh
so that slurm *thinks* that the node is still there.. so it tries to ping
it, and it fails to ping it (obviously...) and marking it as DOWN?

I don't know if my theory is right or not.. but just to get our cluster
going again, is there a way to force slurm to forget about the node that it
"suspended" earlier? Is there a command like "scontrol forcesuspend
node=$id"?

Thank you for your help!

-soichi

On Fri, Jul 30, 2021 at 7:56 PM Brian Andrus <toomuchit at gmail.com> wrote:

> This message was sent from a non-IU address. Please exercise caution when
> clicking links or opening attachments from external sources.
>
> That 'not responding' is the issue and usually means 1 of 2 things:
>
> 1) slurmd is not running on the node
> 2) something on the network is stopping the communication between the node
> and the master (firewall, selinux, congestion, bad nic, routes, etc)
>
> Brian Andrus
> On 7/30/2021 3:51 PM, Soichi Hayashi wrote:
>
> Brian,
>
> Thank you for your reply and thanks for setting the email title. I forgot
> to edit it before I sent it!
>
> I am not sure how I can reply to your your reply.. but I hope this make it
> so the right place..
>
> I've updated slurm.conf to increase the controller debug level
> > SlurmctldDebug=5
>
> I now see additional log output (debug).
>
> [2021-07-30T22:42:05.255] debug:  Spawning ping agent for
> slurm4-compute[2-6,10,12-14]
> [2021-07-30T22:42:05.256] error: Nodes slurm4-compute[9,15,19-22,30] not
> responding, setting DOWN
>
> It's still very sparse, but it looks like slurm is trying to ping nodes
> that are already removed (they don't exist anymore - as they are removed by
> slurm_suspend.sh script)
>
> I tried sinfo -R but it doesn't really give much info..
>
> $ sinfo -R
> REASON               USER      TIMESTAMP           NODELIST
> Not responding       slurm     2021-07-30T22:42:05
> slurm4-compute[9,15,19-22,30]
>
> These machines are gone, so it should not respond.
>
> $ ping slurm4-compute9
> ping: slurm4-compute9: Name or service not known
>
> This is expected.
>
> Why is slurm keeps trying to contact the node that's already removed?
> slurm_suspend.sh does the following to "remove" node from the partition.
> > scontrol update nodename=${host} nodeaddr="(null)"
> Maybe this isn't the correct way to do it? Is there a way to force slurm
> to forget about the node? I tried "scontrol update node=$node state=idle",
> but this only works for a few minutes until slurm's ping agent kicks in and
> marking them down again.
>
> Thanks!!
> Soichi
>
>
>
>
> On Fri, Jul 30, 2021 at 2:21 PM Soichi Hayashi <hayashis at iu.edu> wrote:
>
>> Hello. I need a help with troubleshooting our slurm cluster.
>>
>> I am running slurm-wlm 17.11.2 on Ubuntu 20 on a public cloud
>> infrastructure (Jetstream) using an elastic computing mechanism (
>> https://slurm.schedmd.com/elastic_computing.html). Our cluster works for
>> the most part, but for some reason, a few of our nodes constantly goes into
>> "down" state.
>>
>> PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES
>>   STATE NODELIST
>> cloud*       up 2-00:00:00 1-infinite   no    YES:4        all     10
>>   idle~ slurm9-compute[1-5,10,12-15]
>> cloud*       up 2-00:00:00 1-infinite   no    YES:4        all      5
>>    down slurm9-compute[6-9,11]
>>
>> The only log I see in the slurm log is this..
>>
>> [2021-07-30T15:10:55.889] Invalid node state transition requested for
>> node slurm9-compute6 from=COMPLETING to=RESUME
>> [2021-07-30T15:21:37.339] Invalid node state transition requested for
>> node slurm9-compute6 from=COMPLETING* to=RESUME
>> [2021-07-30T15:27:30.039] update_node: node slurm9-compute6 reason set
>> to: completing
>> [2021-07-30T15:27:30.040] update_node: node slurm9-compute6 state set to
>> DOWN
>> [2021-07-30T15:27:40.830] update_node: node slurm9-compute6 state set to
>> IDLE
>> ..
>> [2021-07-30T15:34:20.628] error: Nodes slurm9-compute[6-9,11] not
>> responding, setting DOWN
>>
>> WIth elastic computing, any unused nodes are automatically removed
>> (by SuspendProgram=/usr/local/sbin/slurm_suspend.sh). So nodes are
>> *expected* to not respond once they are removed, but they should not be
>> marked as DOWN. They should simply be set to "idle".
>>
>> To work around this issue, I am running the following cron job.
>>
>> 0 0 * * * scontrol update node=slurm9-compute[1-30] state=resume
>>
>> This "works" somewhat.. but our nodes go to "DOWN" state so often that
>> running this every hour is not enough.
>>
>> Here is the full content of our slurm.conf
>>
>> root at slurm9:~# cat /etc/slurm-llnl/slurm.conf
>> ClusterName=slurm9
>> ControlMachine=slurm9
>>
>> SlurmUser=slurm
>> SlurmdUser=root
>> SlurmctldPort=6817
>> SlurmdPort=6818
>> AuthType=auth/munge
>> StateSaveLocation=/tmp
>> SlurmdSpoolDir=/tmp/slurmd
>> SwitchType=switch/none
>> MpiDefault=none
>> SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
>> SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
>> ProctrackType=proctrack/pgid
>> ReturnToService=1
>> Prolog=/usr/local/sbin/slurm_prolog.sh
>>
>> #
>> # TIMERS
>> SlurmctldTimeout=300
>> SlurmdTimeout=300
>> #make slurm a little more tolerant here
>> MessageTimeout=30
>> TCPTimeout=15
>> BatchStartTimeout=20
>> GetEnvTimeout=20
>> InactiveLimit=0
>> MinJobAge=604800
>> KillWait=30
>> Waittime=0
>> #
>> # SCHEDULING
>> SchedulerType=sched/backfill
>> SelectType=select/cons_res
>> SelectTypeParameters=CR_CPU_Memory
>> #FastSchedule=0
>>
>> # LOGGING
>> SlurmctldDebug=3
>> SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
>> SlurmdDebug=3
>> SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
>> JobCompType=jobcomp/none
>>
>> # ACCOUNTING
>> JobAcctGatherType=jobacct_gather/linux
>> JobAcctGatherFrequency=30
>>
>> AccountingStorageType=accounting_storage/filetxt
>> AccountingStorageLoc=/var/log/slurm-llnl/slurm_jobacct.log
>>
>> #CLOUD CONFIGURATION
>> PrivateData=cloud
>> ResumeProgram=/usr/local/sbin/slurm_resume.sh
>> SuspendProgram=/usr/local/sbin/slurm_suspend.sh
>> ResumeRate=1 #number of nodes per minute that can be created; 0 means no
>> limit
>> ResumeTimeout=900 #max time in seconds between ResumeProgram running and
>> when the node is ready for use
>> SuspendRate=1 #number of nodes per minute that can be suspended/destroyed
>> SuspendTime=600 #time in seconds before an idle node is suspended
>> SuspendTimeout=300 #time between running SuspendProgram and the node
>> being completely down
>> TreeWidth=30
>>
>> NodeName=slurm9-compute[1-15] State=CLOUD CPUs=24 RealMemory=60388
>> PartitionName=cloud LLN=YES Nodes=slurm9-compute[1-15] Default=YES
>> MaxTime=48:00:00 State=UP Shared=YES
>>
>> I appreciate your assistance!
>>
>> Soichi Hayashi
>> Indiana University
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210730/854fa77e/attachment.htm>