Saturday, October 4, 2014

Always Know Where the Default Gateway Resides

The other day I was assigned the task of replacing two switches on our wireless network. Our wireless network is completely separate from our wired network and we typically set up our network in this style of topology:


We use a backbone area that is a flat Layer 2 topology to the wireless controller.  We use Layer 3 switches at the access layer to create a separate VLAN and IP address space for the wireless access points.  The access layer switches run OSPF and include a default route.  The route to the access points is advertised via OSPF.

However, in the building that I was replacing switches in the topology was closer to this:


VLAN 44 used Layer 2 back to the core Layer 3 switch while VLAN 55 used a Layer 3 switch like our usual design.  At least, that's what I thought.  As we'll see, the two assumptions I made carried some consequences.

-

I've duplicated the behavior I witnessed on my home lab equipment.  The actual topology where I work was larger but this set up will show what I ran into.  I used 2) 3550s, 2) 2950s and 2) 1721s.

The second 1721 router in the lower left corner is being used as an end-user device in order to keep the VLAN in an 'up' state.


-

VLAN 11 is used as a backbone VLAN for the switches to connect back to the R1 router.

VLAN 22 is used for the access point VLAN, but I made the assumption that R1 was the default gateway.  I was tasked with replacing SW1 and SW3 with newer models.  I configured the SW1 to behave per our usual design (only the backbone VLAN is allowed out of the switch and the access point VLAN stays local to the switch and OSPF is used to advertise the access point network).  SW3 was unique that it used two VLANs for the access points [I've only set up one on this topology].  I never looked at the config on SW2 before replacing the switches.

VLAN 33 is a different backbone VLAN that is used only between R1 and Dist-Switch.

I replaced SW3 and all of the connected access points came back online and everything was working as expected.

I replaced SW1 and those access points came online.  However, the access points connected on SW2 went offline!  But the access points on SW3 were still online.

So, I logged into SW2 and tried pinging the the VLAN 22 gateway of 22.22.22.1:


And the pings failed, which surprised me since I assumed that R1 was the gateway.

I logged into SW3 and tried the same thing:


And the pings succeeded!  At this point I was really confused.  How could pings fail from one switch but succeed from another?

After much digging around (nearly an hour!), I was able to get a hold of the engineer and he found some documentation that said that SW1 was actually the gateway for VLAN 22!

When I installed SW2, I configured the switch to only allow VLAN 11 (#switchport trunk allowed vlan 11).  This configuration cut off Layer 2 access to the VLAN 22 gateway.



I added VLAN 22 to the trunk and the rest of the access points immediately came back online and service was restored.

---

But how were the access points on SW3 able to reach the VLAN 22 gateway?

I was able to replicate the configurations and recreate the problem here:

On R1, we have a router-on-a-stick configuration with the three VLANs.


Note that fa0.22 is not .1 (the usual default gateway).

-

Dist-Switch is a pure Layer 2 switch (no routing).  It uses an SVI on VLAN 33 and uses R1's fa 0.3's subinterface as the default gateway.


-

SW1 is still configured incorrectly (only VLAN 11 is allowed on the trunk link to Dist-Switch).  SW1 is running OSPF with R1 and uses VLAN 11 to reach R1.  Access point VLAN 22 is advertised to R1 and we'll see that plays a big role in the weird behavior.


-

SW2 is configured as a Layer 2 switch (even through it is a 3550).  This switch uses VLAN 22 (the access point VLAN) as its' default gateway.  At this point, 22.22.22.1 (located on SW1) is not accessible at Layer 2 since VLAN 22 is not allowed on the trunk link between SW1 and Dist-Switch.


-

SW3 is a Layer 2 only switch as well.  However, it was configured to use VLAN 11 (the backbone VLAn) and 11.11.11.1 (R1's fa 0.1 subinterface) as its' default gateway.


The differences in the default-gateway configuration was the reason why SW2's access points lost connectivity, but SW3's access points remained online.

-

On SW2, VLAN 22 and 22.22.22.1 were the exit point out of the VLAN.  Since I configured SW1 to not allow VLAN 22 on the trunk, 22.22.22.1 was not available from a Layer 2 perspective.



On SW3, it was using VLAN 11 (the backbone VLAN) and was able to reach R1.  When I sent pings from SW3 to 22.22.22.1 (SW1), the traffic went from SW3 to R1 to SW1.  




Since SW3 advertised 22.22.22.1 to R1 via OSPF on VLAN 11, R1 is able to forward the packets and receive a response from SW1.


Note that 22.22.22.0 is reached via 11.11.11.11 (SW1) on fa 0.11.

That explained why I was able to ping SW1 from SW3 (Layer 2 worked from SW3 to R1 and then the packets were sent by Layer 3 through R1 via OSPF), but not able to ping SW1 from SW2 (default-gateway was on VLAN 22 and only accessible by Layer 2 and the gateway was cut off from the rest of the VLAN because the VLAN was cut off of the trunk on the SW1 end).

--- 

The lessons learned:

Make sure that you know which device is the default gateway for each VLAN before replacing hardware or changing configurations.

If a VLAN is used by multiple devices via trunking, make sure the trunk is allowed on both ends of the link.

Ultimately, we'll have to re-engineer these access point VLANs so that the same access point VLAN is not used on multiple access switches (per our usual design).