[slurm-users] [External] Re: Down nodes

Sat Jul 31 14:16:16 UTC 2021

You should definitely upgrade because there have been significant 
improvements in that area.

You can label nodes as cloud nodes and merely updating the state to 
'power_down' will run your suspend script.

Brian Andrus

On 7/30/2021 5:05 PM, Soichi Hayashi wrote:
> Brian,
>
> Yes, slurmd is not running on that node because the node itself is not 
> there anymore (the whole VM is gone!). When the node is no longer in 
> use, slurm automatically runs slurm_suspend.sh script which removes 
> the whole node(VM) by running "openstack server delete $host". There 
> is no server/VM, no IP address, no DNS name, nothing. 
> "slurm4-compute9" only exists as a hypothetical node that can be 
> launched in the future in case there are more jobs to run. That's how 
> "cloud" partition works, right?
>
> [slurm.conf]
> SuspendProgram=/usr/local/sbin/slurm_suspend.sh
> SuspendTime=600 #time in seconds before an idle node is suspended
>
> I am wondering.. maybe something went wrong when slurm ran 
> slurm_suspend.sh so that slurm *thinks* that the node is still there.. 
> so it tries to ping it, and it fails to ping it (obviously...) and 
> marking it as DOWN?
>
> I don't know if my theory is right or not.. but just to get our 
> cluster going again, is there a way to force slurm to forget about the 
> node that it "suspended" earlier? Is there a command like "scontrol 
> forcesuspend node=$id"?
>
> Thank you for your help!
>
> -soichi
>
> On Fri, Jul 30, 2021 at 7:56 PM Brian Andrus <toomuchit at gmail.com 
> <mailto:toomuchit at gmail.com>> wrote:
>
>     This message was sent from a non-IU address. Please exercise
>     caution when clicking links or opening attachments from external
>     sources.
>
>     That 'not responding' is the issue and usually means 1 of 2 things:
>
>     1) slurmd is not running on the node
>     2) something on the network is stopping the communication between
>     the node and the master (firewall, selinux, congestion, bad nic,
>     routes, etc)
>
>     Brian Andrus
>
>     On 7/30/2021 3:51 PM, Soichi Hayashi wrote:
>>     Brian,
>>
>>     Thank you for your reply and thanks for setting the email title.
>>     I forgot to edit it before I sent it!
>>
>>     I am not sure how I can reply to your your reply.. but I hope
>>     this make it so the right place..
>>
>>     I've updated slurm.conf to increase the controller debug level
>>     > SlurmctldDebug=5
>>
>>     I now see additional log output (debug).
>>
>>     [2021-07-30T22:42:05.255] debug:  Spawning ping agent for
>>     slurm4-compute[2-6,10,12-14]
>>     [2021-07-30T22:42:05.256] error: Nodes
>>     slurm4-compute[9,15,19-22,30] not responding, setting DOWN
>>
>>     It's still very sparse, but it looks like slurm is trying to ping
>>     nodes that are already removed (they don't exist anymore - as
>>     they are removed by slurm_suspend.sh script)
>>
>>     I tried sinfo -R but it doesn't really give much info..
>>
>>     $ sinfo -R
>>     REASON               USER      TIMESTAMP NODELIST
>>     Not responding       slurm     2021-07-30T22:42:05
>>     slurm4-compute[9,15,19-22,30]
>>
>>     These machines are gone, so it should not respond.
>>
>>     $ ping slurm4-compute9
>>     ping: slurm4-compute9: Name or service not known
>>
>>     This is expected.
>>
>>     Why is slurm keeps trying to contact the node that's already
>>     removed? slurm_suspend.sh does the following to "remove" node
>>     from the partition.
>>     > scontrol update nodename=${host} nodeaddr="(null)"
>>     Maybe this isn't the correct way to do it? Is there a way to
>>     force slurm to forget about the node? I tried "scontrol update
>>     node=$node state=idle", but this only works for a few minutes
>>     until slurm's ping agent kicks in and marking them down again.
>>
>>     Thanks!!
>>     Soichi
>>
>>
>>
>>
>>     On Fri, Jul 30, 2021 at 2:21 PM Soichi Hayashi <hayashis at iu.edu
>>     <mailto:hayashis at iu.edu>> wrote:
>>
>>         Hello. I need a help with troubleshooting our slurm cluster.
>>
>>         I am running slurm-wlm 17.11.2 on Ubuntu 20 on a public cloud
>>         infrastructure (Jetstream) using an elastic computing
>>         mechanism (https://slurm.schedmd.com/elastic_computing.html
>>         <https://slurm.schedmd.com/elastic_computing.html>). Our
>>         cluster works for the most part, but for some reason, a few
>>         of our nodes constantly goes into "down" state.
>>
>>         PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS GROUPS
>>          NODES       STATE NODELIST
>>         cloud*       up 2-00:00:00 1-infinite   no  YES:4        all
>>             10       idle~ slurm9-compute[1-5,10,12-15]
>>         cloud*       up 2-00:00:00 1-infinite   no  YES:4        all
>>              5        down slurm9-compute[6-9,11]
>>
>>         The only log I see in the slurm log is this..
>>
>>         [2021-07-30T15:10:55.889] Invalid node state transition
>>         requested for node slurm9-compute6 from=COMPLETING to=RESUME
>>         [2021-07-30T15:21:37.339] Invalid node state transition
>>         requested for node slurm9-compute6 from=COMPLETING* to=RESUME
>>         [2021-07-30T15:27:30.039] update_node: node slurm9-compute6
>>         reason set to: completing
>>         [2021-07-30T15:27:30.040] update_node: node slurm9-compute6
>>         state set to DOWN
>>         [2021-07-30T15:27:40.830] update_node: node slurm9-compute6
>>         state set to IDLE
>>         ..
>>         [2021-07-30T15:34:20.628] error: Nodes slurm9-compute[6-9,11]
>>         not responding, setting DOWN
>>
>>         WIth elastic computing, any unused nodes are automatically
>>         removed (by SuspendProgram=/usr/local/sbin/slurm_suspend.sh).
>>         So nodes are *expected* to not respond once they are removed,
>>         but they should not be marked as DOWN. They should simply be
>>         set to "idle".
>>
>>         To work around this issue, I am running the following cron job.
>>
>>         0 0 * * * scontrol update node=slurm9-compute[1-30] state=resume
>>
>>         This "works" somewhat.. but our nodes go to "DOWN" state so
>>         often that running this every hour is not enough.
>>
>>         Here is the full content of our slurm.conf
>>
>>         root at slurm9:~# cat /etc/slurm-llnl/slurm.conf
>>         ClusterName=slurm9
>>         ControlMachine=slurm9
>>
>>         SlurmUser=slurm
>>         SlurmdUser=root
>>         SlurmctldPort=6817
>>         SlurmdPort=6818
>>         AuthType=auth/munge
>>         StateSaveLocation=/tmp
>>         SlurmdSpoolDir=/tmp/slurmd
>>         SwitchType=switch/none
>>         MpiDefault=none
>>         SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
>>         SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
>>         ProctrackType=proctrack/pgid
>>         ReturnToService=1
>>         Prolog=/usr/local/sbin/slurm_prolog.sh
>>
>>         #
>>         # TIMERS
>>         SlurmctldTimeout=300
>>         SlurmdTimeout=300
>>         #make slurm a little more tolerant here
>>         MessageTimeout=30
>>         TCPTimeout=15
>>         BatchStartTimeout=20
>>         GetEnvTimeout=20
>>         InactiveLimit=0
>>         MinJobAge=604800
>>         KillWait=30
>>         Waittime=0
>>         #
>>         # SCHEDULING
>>         SchedulerType=sched/backfill
>>         SelectType=select/cons_res
>>         SelectTypeParameters=CR_CPU_Memory
>>         #FastSchedule=0
>>
>>         # LOGGING
>>         SlurmctldDebug=3
>>         SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
>>         SlurmdDebug=3
>>         SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
>>         JobCompType=jobcomp/none
>>
>>         # ACCOUNTING
>>         JobAcctGatherType=jobacct_gather/linux
>>         JobAcctGatherFrequency=30
>>
>>         AccountingStorageType=accounting_storage/filetxt
>>         AccountingStorageLoc=/var/log/slurm-llnl/slurm_jobacct.log
>>
>>         #CLOUD CONFIGURATION
>>         PrivateData=cloud
>>         ResumeProgram=/usr/local/sbin/slurm_resume.sh
>>         SuspendProgram=/usr/local/sbin/slurm_suspend.sh
>>         ResumeRate=1 #number of nodes per minute that can be created;
>>         0 means no limit
>>         ResumeTimeout=900 #max time in seconds between ResumeProgram
>>         running and when the node is ready for use
>>         SuspendRate=1 #number of nodes per minute that can be
>>         suspended/destroyed
>>         SuspendTime=600 #time in seconds before an idle node is suspended
>>         SuspendTimeout=300 #time between running SuspendProgram and
>>         the node being completely down
>>         TreeWidth=30
>>
>>         NodeName=slurm9-compute[1-15] State=CLOUD CPUs=24
>>         RealMemory=60388
>>         PartitionName=cloud LLN=YES Nodes=slurm9-compute[1-15]
>>         Default=YES MaxTime=48:00:00 State=UP Shared=YES
>>
>>         I appreciate your assistance!
>>
>>         Soichi Hayashi
>>         Indiana University
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210731/5dd1bb41/attachment.htm>