[slurm-users] [External] Re: Down nodes
Brian Andrus
toomuchit at gmail.com
Sat Jul 31 14:16:16 UTC 2021
You should definitely upgrade because there have been significant
improvements in that area.
You can label nodes as cloud nodes and merely updating the state to
'power_down' will run your suspend script.
Brian Andrus
On 7/30/2021 5:05 PM, Soichi Hayashi wrote:
> Brian,
>
> Yes, slurmd is not running on that node because the node itself is not
> there anymore (the whole VM is gone!). When the node is no longer in
> use, slurm automatically runs slurm_suspend.sh script which removes
> the whole node(VM) by running "openstack server delete $host". There
> is no server/VM, no IP address, no DNS name, nothing.
> "slurm4-compute9" only exists as a hypothetical node that can be
> launched in the future in case there are more jobs to run. That's how
> "cloud" partition works, right?
>
> [slurm.conf]
> SuspendProgram=/usr/local/sbin/slurm_suspend.sh
> SuspendTime=600 #time in seconds before an idle node is suspended
>
> I am wondering.. maybe something went wrong when slurm ran
> slurm_suspend.sh so that slurm *thinks* that the node is still there..
> so it tries to ping it, and it fails to ping it (obviously...) and
> marking it as DOWN?
>
> I don't know if my theory is right or not.. but just to get our
> cluster going again, is there a way to force slurm to forget about the
> node that it "suspended" earlier? Is there a command like "scontrol
> forcesuspend node=$id"?
>
> Thank you for your help!
>
> -soichi
>
> On Fri, Jul 30, 2021 at 7:56 PM Brian Andrus <toomuchit at gmail.com
> <mailto:toomuchit at gmail.com>> wrote:
>
> This message was sent from a non-IU address. Please exercise
> caution when clicking links or opening attachments from external
> sources.
>
> That 'not responding' is the issue and usually means 1 of 2 things:
>
> 1) slurmd is not running on the node
> 2) something on the network is stopping the communication between
> the node and the master (firewall, selinux, congestion, bad nic,
> routes, etc)
>
> Brian Andrus
>
> On 7/30/2021 3:51 PM, Soichi Hayashi wrote:
>> Brian,
>>
>> Thank you for your reply and thanks for setting the email title.
>> I forgot to edit it before I sent it!
>>
>> I am not sure how I can reply to your your reply.. but I hope
>> this make it so the right place..
>>
>> I've updated slurm.conf to increase the controller debug level
>> > SlurmctldDebug=5
>>
>> I now see additional log output (debug).
>>
>> [2021-07-30T22:42:05.255] debug: Spawning ping agent for
>> slurm4-compute[2-6,10,12-14]
>> [2021-07-30T22:42:05.256] error: Nodes
>> slurm4-compute[9,15,19-22,30] not responding, setting DOWN
>>
>> It's still very sparse, but it looks like slurm is trying to ping
>> nodes that are already removed (they don't exist anymore - as
>> they are removed by slurm_suspend.sh script)
>>
>> I tried sinfo -R but it doesn't really give much info..
>>
>> $ sinfo -R
>> REASON USER TIMESTAMP NODELIST
>> Not responding slurm 2021-07-30T22:42:05
>> slurm4-compute[9,15,19-22,30]
>>
>> These machines are gone, so it should not respond.
>>
>> $ ping slurm4-compute9
>> ping: slurm4-compute9: Name or service not known
>>
>> This is expected.
>>
>> Why is slurm keeps trying to contact the node that's already
>> removed? slurm_suspend.sh does the following to "remove" node
>> from the partition.
>> > scontrol update nodename=${host} nodeaddr="(null)"
>> Maybe this isn't the correct way to do it? Is there a way to
>> force slurm to forget about the node? I tried "scontrol update
>> node=$node state=idle", but this only works for a few minutes
>> until slurm's ping agent kicks in and marking them down again.
>>
>> Thanks!!
>> Soichi
>>
>>
>>
>>
>> On Fri, Jul 30, 2021 at 2:21 PM Soichi Hayashi <hayashis at iu.edu
>> <mailto:hayashis at iu.edu>> wrote:
>>
>> Hello. I need a help with troubleshooting our slurm cluster.
>>
>> I am running slurm-wlm 17.11.2 on Ubuntu 20 on a public cloud
>> infrastructure (Jetstream) using an elastic computing
>> mechanism (https://slurm.schedmd.com/elastic_computing.html
>> <https://slurm.schedmd.com/elastic_computing.html>). Our
>> cluster works for the most part, but for some reason, a few
>> of our nodes constantly goes into "down" state.
>>
>> PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS
>> NODES STATE NODELIST
>> cloud* up 2-00:00:00 1-infinite no YES:4 all
>> 10 idle~ slurm9-compute[1-5,10,12-15]
>> cloud* up 2-00:00:00 1-infinite no YES:4 all
>> 5 down slurm9-compute[6-9,11]
>>
>> The only log I see in the slurm log is this..
>>
>> [2021-07-30T15:10:55.889] Invalid node state transition
>> requested for node slurm9-compute6 from=COMPLETING to=RESUME
>> [2021-07-30T15:21:37.339] Invalid node state transition
>> requested for node slurm9-compute6 from=COMPLETING* to=RESUME
>> [2021-07-30T15:27:30.039] update_node: node slurm9-compute6
>> reason set to: completing
>> [2021-07-30T15:27:30.040] update_node: node slurm9-compute6
>> state set to DOWN
>> [2021-07-30T15:27:40.830] update_node: node slurm9-compute6
>> state set to IDLE
>> ..
>> [2021-07-30T15:34:20.628] error: Nodes slurm9-compute[6-9,11]
>> not responding, setting DOWN
>>
>> WIth elastic computing, any unused nodes are automatically
>> removed (by SuspendProgram=/usr/local/sbin/slurm_suspend.sh).
>> So nodes are *expected* to not respond once they are removed,
>> but they should not be marked as DOWN. They should simply be
>> set to "idle".
>>
>> To work around this issue, I am running the following cron job.
>>
>> 0 0 * * * scontrol update node=slurm9-compute[1-30] state=resume
>>
>> This "works" somewhat.. but our nodes go to "DOWN" state so
>> often that running this every hour is not enough.
>>
>> Here is the full content of our slurm.conf
>>
>> root at slurm9:~# cat /etc/slurm-llnl/slurm.conf
>> ClusterName=slurm9
>> ControlMachine=slurm9
>>
>> SlurmUser=slurm
>> SlurmdUser=root
>> SlurmctldPort=6817
>> SlurmdPort=6818
>> AuthType=auth/munge
>> StateSaveLocation=/tmp
>> SlurmdSpoolDir=/tmp/slurmd
>> SwitchType=switch/none
>> MpiDefault=none
>> SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
>> SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
>> ProctrackType=proctrack/pgid
>> ReturnToService=1
>> Prolog=/usr/local/sbin/slurm_prolog.sh
>>
>> #
>> # TIMERS
>> SlurmctldTimeout=300
>> SlurmdTimeout=300
>> #make slurm a little more tolerant here
>> MessageTimeout=30
>> TCPTimeout=15
>> BatchStartTimeout=20
>> GetEnvTimeout=20
>> InactiveLimit=0
>> MinJobAge=604800
>> KillWait=30
>> Waittime=0
>> #
>> # SCHEDULING
>> SchedulerType=sched/backfill
>> SelectType=select/cons_res
>> SelectTypeParameters=CR_CPU_Memory
>> #FastSchedule=0
>>
>> # LOGGING
>> SlurmctldDebug=3
>> SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
>> SlurmdDebug=3
>> SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
>> JobCompType=jobcomp/none
>>
>> # ACCOUNTING
>> JobAcctGatherType=jobacct_gather/linux
>> JobAcctGatherFrequency=30
>>
>> AccountingStorageType=accounting_storage/filetxt
>> AccountingStorageLoc=/var/log/slurm-llnl/slurm_jobacct.log
>>
>> #CLOUD CONFIGURATION
>> PrivateData=cloud
>> ResumeProgram=/usr/local/sbin/slurm_resume.sh
>> SuspendProgram=/usr/local/sbin/slurm_suspend.sh
>> ResumeRate=1 #number of nodes per minute that can be created;
>> 0 means no limit
>> ResumeTimeout=900 #max time in seconds between ResumeProgram
>> running and when the node is ready for use
>> SuspendRate=1 #number of nodes per minute that can be
>> suspended/destroyed
>> SuspendTime=600 #time in seconds before an idle node is suspended
>> SuspendTimeout=300 #time between running SuspendProgram and
>> the node being completely down
>> TreeWidth=30
>>
>> NodeName=slurm9-compute[1-15] State=CLOUD CPUs=24
>> RealMemory=60388
>> PartitionName=cloud LLN=YES Nodes=slurm9-compute[1-15]
>> Default=YES MaxTime=48:00:00 State=UP Shared=YES
>>
>> I appreciate your assistance!
>>
>> Soichi Hayashi
>> Indiana University
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210731/5dd1bb41/attachment.htm>
More information about the slurm-users
mailing list