Quantcast
Channel: SRX Services Gateway topics
Viewing all articles
Browse latest Browse all 3959

New Node (SRX300) In Cluster Not Forwarding Traffic

$
0
0

I've got a problem that I can't wrap my head around and I could use any help I can get.

 

At a remote customer site we have 2 SRX300s set up in an active/passive cluster. Recently one of the SRX300 devices died (node0) and had to be replaced. I followed the steps in the Juniper KB article to install the new firewall as node0 (https://kb.juniper.net/InfoCenter/index?page=content&id=KB21134&actp=METADATA). First I down-graded the OS on the new node0 to match what was installed on node1. node1 is the primary since node0 failed. ge-0/0/2 and ge-1/0/2 are set up as reth1 (5.6.7.8), connected to the customer switch (1.2.3.4). We connect to the customer site via VPN. We have our "internal" network on reth2, ge-0/0/3 and ge-1/0/3.

 

Everything seems fine. "show chassis cluster information" reports everything is good. "show system alarms" reports no problems. I've done a side by side comparison of the configs on node0 and node1 to verify that they are identical, and they are.

I wanted to be 100% sure that the new node0 worked properly so I rebooted node1 so that node0 would become the primary, and I lost all remote connectivity. I had the customer power cycle node0 so that node1 would become the primary again and I got connectivity back. So last night I scheduled a reboot (outside of production hours) of node1 and another reboot of node0 30 minutes later. On node0 I set up a packet capture and also a cron entry that would ping (ping -c 2 1.2.3.4) the customer default gateway every 5 minutes and also dump the arp table, both into a text file in /var/tmp/. I also set up on-going pings from the internal network to the customer default gateway, as well as on-going pings from my computer through the VPN to the customer router at 1.2.3.4 and node0 at 5.6.7.8. I can't figure out what was going on for the 30 minutes that node0 was the primary.

 

In that 30 minutes node0 received back 1 ping reply (from the cron entry) from the customer default gateway immediately after node1 started rebooting, and the rest failed for the rest of the 30 minutes. Why would/could that happen? The arp table never changed, other than the fxp entry for node1 disappearing when it rebooted. The packet captures and pings are the stranger to me. Pings from my computer over the VPN to the customer router at 1.2.3.4 were good, I never stopped receiving replies from 1.2.3.4 in that 30 minutes. But I received no replies from node0 at 5.6.7.8 in that 30 minutes. However the packet captures on node0 show that it was receiving those ICMP requests, but the packet captures do not show that node0 replied. Same thing with the pings from "internal"/reth2 to the customer gateway at 1.2.3.4, there were no replies in that 30 minutes. The packet captures show that node0 recieved that ICMP request from the internal computer, but does not forward it on to 1.2.3.4. Why would this be happening? After node0 rebooted 30 minutes later, node1 became the primary again and everything was fine.

 

I'm not sure if the problem is with the new SRX300 node0, or something on the customer network. I verified with the customer that the cables are in the correct ports, and I verified that with the packet captures from node0, I am seeing the correct IP addresses going in/out on the correct interfaces of node0 in those 30 minutes. I thought that maybe the customer's switch/router (1.2.3.4) that is our default gateway was set up to filter MAC addresses, and since the new node0 obviously has a different MAC address then it wasn't permitted, but the customer assures me that no MAC filtering is in place. Then I thought somehow it was a Juniper licensing problem, but that can't be it either, right?

 

At this point I'm at a loss, any guidance is very appreciated. Thanks!


Viewing all articles
Browse latest Browse all 3959


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>