[slurm-users] SLURM nodes flap in "Not responding" status when iptables firewall enabled
Sean Caron
scaron at umich.edu
Thu May 17 14:21:54 MDT 2018
Sorry, how do you mean? The environment is very basic. Compute nodes and
SLURM controller are on an RFC1918 subnet. Gateways are dual homed with one
leg on a public IP and one leg on the RFC1918 cluster network. It used to
be that nodes that only had a leg on the RFC1918 network (compute nodes and
the SLURM controller) had no firewall at all and nodes that were dual homed
basically were set to just permit all traffic from the cluster side NIC
(i.e. iptables rule like -A INPUT -i ethX -j ACCEPT).
Now we're trying to go back to the gateways and compute nodes and actually
codify, instead of just passing all traffic from the cluster side NIC, what
ports and protocols are actually in use, or at least, what server-to-server
communication is expected and normative, and then define a rule set to
permit those while dropping other traffic not explicitly whitelisted.
The compute and gateway nodes work fine with SLURM even when iptables is
enabled and the policy is "permit all traffic from that NIC" but once we
tighten it down just a little bit to "permit all traffic to and from the
SLURM controller" we see these weird instances of node state flapping. It's
not clear to me why this is the case since from the standpoint of node to
controller communications, these policies are logically very similar, but
there it is. The nodes shouldn't have to talk to anything else besides the
SLURM controller for SLURM to work, so long as time is synched up between
them and there are no issues with the nodes getting to slurm.conf.
Best,
Sean
On Thu, May 17, 2018 at 1:21 PM, Patrick Goetz <pgoetz at math.utexas.edu>
wrote:
> Does your SMS have a dedicated interface for node traffic?
>
> On 05/16/2018 04:00 PM, Sean Caron wrote:
>
>> I see some chatter on 6818/TCP from the compute node to the SLURM
>> controller, and from the SLURM controller to the compute node.
>>
>> The policy is to permit all packets inbound from SLURM controller
>> regardless of port and protocol, and perform no filtering whatsoever on any
>> output packets to anywhere. I wouldn't expect this to interfere.
>>
>> Anyway, it's not that it NEVER works once the firewall is switched on.
>> It's that it flaps. The firewall is clearly passing enough traffic to have
>> the node marked as up some of the time. But why the periodic "not
>> responding" ... "responding" cycles? Once it says "not responding" I can
>> still scontrol ping from the compute node in question, and standard ICMP
>> ping from one to the other works as well.
>>
>> Best,
>>
>> Sean
>>
>>
>> On Wed, May 16, 2018 at 2:13 PM, Alex Chekholko <alex at calicolabs.com
>> <mailto:alex at calicolabs.com>> wrote:
>>
>> Add a logging rule to your iptables and look at what traffic is
>> actually being blocked?
>>
>> On Wed, May 16, 2018 at 11:11 AM Sean Caron <scaron at umich.edu
>> <mailto:scaron at umich.edu>> wrote:
>>
>> Hi all,
>>
>> Does anyone use SLURM in a scenario where there is an iptables
>> firewall on the compute nodes on the same network it uses to
>> communicate with the SLURM controller and DBD machine?
>>
>> I have the very basic situation where ...
>>
>> 1. There is no iptables firewall enabled at all on the SLURM
>> controller/DBD machine.
>>
>> 2. Compute nodes are set to permit all ports and protocols from
>> the SLURM controller with a rule like:
>>
>> -A INPUT -s IP.of.SLURM.controller/32 -j ACCEPT
>>
>> If I enable this on the compute nodes, they flap up in down in
>> "Not responding state". If I switch off the firewall on the
>> compute nodes, they work fine.
>>
>> When firewall is up on the compute nodes, SLURM controller can
>> ping compute nodes, no problem. I have no reason to believe all
>> ports and protocols are not being passed. Time is synched. No
>> trouble accessing slurm.conf on any of the clients.
>>
>> Has anyone seen this before? There seems to be very little
>> information about SLURM's interactions with iptables. I know
>> this is kind of a funky scenario but regulatory requirements
>> have me needing to tighten down our cluster network a little
>> bit. Is this like a latency issue, or ...?
>>
>> Thanks,
>>
>> Sean
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180517/0090523f/attachment.html>
More information about the slurm-users
mailing list