Introduction
This post is inspired by a post at IEOC about Uplinkfast and TCN which
can be found here.
Before we get to those parts, let’s recap how Ethernet and STP work together.
Spanning Tree
The Spanning Tree Algorithm builds a loop free tree by comparing Bridge ID(BID) and
least cost paths to the root bridge. By doing this it blocks all links not leading
to the root.
MAC Learning
Switches learn where to forward frames by looking at the source MAC address of the frame
on the port that the frame was received on. This learning is done in the data plane
as opposed to routing where the routes are learned in control plane. I will come back
to this later in the post.
S4 learns that A is located on port 1 after A has sent a frame. This is stored in
the MAC address table located in Content Addressable Memory (CAM). The CAM is a
fast memory optimized for quick lookups in the table. By default there is a 300
second aging timeout for learned MAC addressesm, meaning that if the switch
does not see any traffic from a source MAC within five minutes the entry will
age out of the table. This is used to remove stale entries and to keep the
MAC address table from becoming too large.
Potential Issues
As I mentioned briefly earlier in the post, MAC learning is done in the data plane.
When we exchange routes through protocols such as OSPF, EIGRP and BGP, this is
done in the control plane. If there is a /24 route in the routing table pointing
at a router, then those up to 254 hosts are behind that router. With MAC learning
every source MAC has its own entry, which would be the same as if we had /32 routes
for every host in the network. Not very effecient! This can also become a scalibility
issue in large networks if there are more hosts than the CAM can hold.
There are also other issues such as not being able to use all the links in the
network. Spanning tree will block the redundant links so we don’t get more bandwidth
if we add more links unless we put them into an Etherchannel or use technologies
such as vPC. In datacenter designs, using STP will lead to low bisectional bandwidth,
meaning that even if there are lots of links between a section in the network, most of
them will actually be blocked.
Another issue is that broadcast and unknown unicast traffic is flooded in the network.
Imagine a scenario as below where A is sending unicast traffic to B and it’s
an unidirectional flow. B rarely sends any traffic so its entry has been aged out
of the MAC address table.
In this scenario the unknown unicast will be flooded to all the switches and
all servers will have to receive the 300 Mbit/s stream and then discard the
traffic until the switches have learned the MAC of B again!
There is also a potential for black holing of traffic. In the topology below there
are four switches connected together and the primary path is through S4-S1-S2-S3.
Then the link between S1 and S2 fails.
When using 802.1D, there is no synchronization of the topology. It will take up to
50 seconds for the link between S3 and S4 to come up unless Backbonefast has been
deployed. When traffic is going from A to B, it will be blackholed. S4 still has an
entry for B towards S1. When the traffic reaches S1 it has nowhere to go.
Without aging of stale entries, this would take up to five minutes. This is
the purpose of topology change in STP, to faster age out stale entries.
Topology Change
Like I described above, without a mechanism for topology change, traffic could
potentially be black holed for quite a while. In 802.1D, when a link goes up
or down, the switch will generate a TCN BPDU which is a special BPDU sent out
the root port. Normally switches only relay BPDUs from the root on their designated
ports but this is a special case. A switch that receives a TCN BPDU will reply
to it with a configuration BPDU with the TC Acknowledge bit set.
The TCN BPDU will eventually reach the root which will then send out a configuration
BPDU with the TC bit set. This is done for a duration of MaxAge + FwDelay
seconds which is 20 + 15 seconds by default.
When switches receive this BPDU from the root with the TC bit set, they will age out
entries in the CAM at a faster pace. The aging timeout will be set to 15 seconds.
This will age out any stale entries in the CAM. If there are active flows they will
not be aged out because the age will be reset as the switch sees frames coming in
with the source MAC in question. As I described earlier there could be unidirectional
flows leading to flooding. Also flows that are inactive for a while and then resume
can get flooded if their entries time out during the period that the root bridge is
sending out these configuration BPDUs with TC set.
Uplinkfast
Uplinkfast is a feature deployed on access switches which have dual links to
the distribution layer. Because the switches are located at the edge of the network
it is safe to bring up an alternate port immediately without going through the regular
listening and learning phase, saving up to 30 seconds.
After a switch has failed over to the alternate link it will start to send out
dummy multicast frames. This is to speed up convergence. Even if a configuration
BPDU with TC set is sent by the root, it can still take up to 15 seconds before
stale entries age out.
So based on the thread at IEOC, what is the consequence of Uplinkfast and TC together?
The configuration BPDU with TC is sent for 35 seconds by default. Dummy multicast frames
will be sent out for a duration that is unknown. It depends on how many entries there are
in the CAM and the rate that the packets are sent at. So depending on when the multicast
frame is sent and if you have an unidirectional flow or a host gone silent, then yes
the configuration BPDU with TC could be counter productive. Traffic would reach its
destination though but it would be through flooding of the traffic.
In reality I doubt this would be much of an issue and most networks would be running
RSTP today. RSTP works differently by synchronizing the topology and when the TC bit
is set in BPDUs the entire CAM is flushed on all ports except where the BPDU was
received.
Good post!
Well written post of the facets of STP aka 802.1D.
Thank you for the article Daniel 🙂
Hi this post is very interesting i want to know The configuration BPDU with TC is sent for 35 seconds by default is understandable thing but why dummy multicast frames will be sent for unknown duration of time?
can we combine ring and mesh topology??
Explain Please!!!
can we combine ring and mesh topology??