These are my study notes for CCDE based on “CCIE Routing and Switching v5.0 Official Cert Guide, Volume 1, Fifth Edition” and “Designing Cisco Network Service Architectures (ARCH) Foundation Learning Guide: (CCDP ARCH 642-874), Third Edition“, “INE – Understanding MSTP” and “Spanning Tree Design Guidelines for Cisco NX-OS Software and Virtual PortChannels“. This post is not meant to cover STP and all its aspects, it’s a summary of key concepts and design aspects of running STP.
STP was originally defined in IEEE 802.1D and improvements were defined in amendments to the standard. RSTP was defined in amendment 802.1w and MSTP was defined in 802.1s. The latest 802.1D-2004 standard does not include “legacy STP”, it covers RSTP. MSTP was integrated into 802.1Q-2005 and later revisions.
STP has two types of BPDUs: Configuration BPDUs and Topology Change Notification BPDUs. To handle topology change, there are two flags in the Configuration BPDU: Topology Change Acknowledgment flag and Topology Change flag.
MessageAge is an estimation of the age of BPDU since it was generated by root, root sends it with an age of 0 and other switches increment this value by 1. The lifetime of a BPDU is MaxAge – MessageAge. MaxAge, HelloTime and ForwardDelay are values set by the root and locally configured values will only be used if that switch becomes the root.
STP works by comparing which Configuration BPDU is superior according to the following ordered list where lower values are better:
- Root Bridge ID (RBID)
- Root Path Cost (RPC)
- Sender Bridge ID (SBID)
- Sender Port ID (SPID)
- Receiver Port ID (RPID; not included in the BPDU, evaluated locally)
Each port stores the superior BPDU that has been sent or received, depending on the port role. Root and blocking ports store the received BPDU, designated ports store the sent BPDU.
To determine port roles and which ports forward and block, the following three-step process is used:
- Elect the root switch
- Determine each switch’s Root port
- Determine the Designated port for each segment
Root bridge is elected based on lowest bridge ID, which consists of 4 bits Priority, 12 bits System ID Extension and 6 bytes System ID (MAC address). Before 802.1t, a lot of MAC addresses were consumed to make the BID unique when using PVST+ or MST.
BPDUs are only forwarded on designated ports, root ports and blocking ports do not send them since they would be inferior on the segment. A designated port is a port with a superior BPDU on a segment.
A topology change event occurs when:
- A TCN BPDU is received by a Designated Port of a switch
- A port moves to the Forwarding state and the switch has at least one Designated Port
- A port moves from Learning or Forwarding to Blocking
- A switch becomes the root switch
STP is slow to converge, especially with indirect failures where a link fails between a root switch and an intermediary switch. When inferior BPDUs are received, MaxAge has to expire before a switch will act on it.
When the topology has changed, CAM table needs to be updated on all switches, a timer equivalent to ForwardDelay is used to time out unused entries.
A topology change starts at a switch and it sends TCN BPDU out its root port. The designated switch sets TCA bit in the field of the configuration BPDU to acknowledge the TCN. The TCN then travels upstream until it reaches the root. The root will then send configuration BPDU with TC bit set for MaxAge + ForwardDelay seconds and all switches will shorten the aging time for the CAM table to ForwardDelay seconds.
PVST+ runs one spanning tree instance per VLAN. This does not scale well for a large number of VLANs and normally there will only be a few logical topologies anyway.
Switches that do not support PVST+ run Common Spanning Tree (CST) which has one instance of STP for all VLANs. Cisco switches can interact with CST through VLAN 1 by sending untagged BPDUs. All other VLANs in the PVST+ region will tag their BPDUs and tunnel the BPDUs through the CST region by using a special destination MAC address. The CST region is treated as a loop-free shared segment from the viewpoint of the PVST+ region. The destination MAC address is a multicast address that will get flooded by the CST switches.
RSTP has four different port roles:
- Root Port
- Designated Port
- Alternate Port
- Backup port
The first two are the same as in legacy STP and the last two are new. An alternate port is a port that is a potential backup for the Root Port. A backup port is a replacement for a Designated Port, you would rarely, if ever, see a Backup Port because it is only used on shared segments.
RSTP uses synchronization process to achieve fast convergence. This only works on links that are point to point and is detected by the duplex mode of an interface. The link type can be hard coded in the rare case where a port is half duplex but still not on a shared segment.
RSTP uses more bits in the Configuration BPDU to encode additional information. These are the Proposal bit, Port Role bits, Learning bit, Forwarding bit and Agreement bit.
RSTP switches send their own BPDUs as opposed to only relaying the roots BPDU as in legacy STP. If no BPDU is heard for 3x hello interval, the BPDU is expired. RSTP does not rely on the MaxAge timer to expire BPDUs. RSTP can also act on inferior BPDUs directly instead of waiting for MaxAge to expire. This speeds up indirect link failure scenarios.
RSTP uses a proposal/agreement process where switches negotiate which port that will become Designated. If proposal bit is set, the switch is proposing that its port should become Designated and the other switch will reply with Agreement to immediately allow this. When ports first come up they are in Designated Discarding state. To not create a temporary loop during the synchronization process, all Non-Edge Designated ports are put into a Discarding state. I have in detail described this process in an earlier post.
With RSTP, only ports moving to a Forwarding state will cause a topology change. RSTP sets the TC bit in the BPDU to notify of a topology change and sends it out its Root Port and Designated Ports that are Non-Edge. MAC addresses are immediately flushed on these ports.
MST uses the same underlying structure such as RSTP with regards to BPDU parameters but it decouples VLANs from spanning tree instances, multiple VLANs can be mapped to a single instance. MST is more efficient because the operator can define the number of instances needed and map the VLANs to these instances. MST is the only standard that supports VLANs and is suitable in a multi vendor environment.
MST switches organize the network into regions, switches within a region use MST in a consistent way. For switches to be in the same region, the name, revision and instance to VLAN mapping must match.
The System ID in MST uses the Instance ID instead of the VLAN ID to create the BID, used in BPDUs. MST sends a single BPDU containing information about all instances. In MST, a port sends BPDUs if it is Designated for at least one MST instance.
MST instance 0 is special and contains all VLANs by default, it is called the Internal Spanning Tree (IST). IST interacts with STP switches that are outside the region. The port role and state determined by the interaction of IST with a neighboring switch will be inherited by all VLANs on that port, not just the VLANs mapped to the IST. This behavior makes the region appear as a single switch to the outside of the region. If running multiple regions, each region can be seen as a single switch from the outside. The resulting network can still contain loops if there are multiple inter region links. MST blocks these loops by building a Common Spanning Tree (CST) running between the regions. CST is also used to interact with non MST switches. The tree built by CST will be used for all VLANs. The IST and CST is then merged together and called the Common and Internal Spanning Tree (CIST).
The CIST Root switch is elected based on the lowest BID from all switches that in any region. This switch will also become the root for the IST (instance 0) within the region, this is called the CIST Regional Root.
In regions that do not contain the CIST Root, only boundary switches are allowed to become the IST Root. A boundary switch is a switch that has a link (or several) to other MST regions. The IST Root is elected based on external root path cost, which is the cost of using the inter region links between MST regions. If there is a tie in cost, the lowest BID is used as a tiebreaker to elect the CIST Regional Root. Cost inside a region is not taken into account.
The CIST Regional Root switch will have its Root Port towards the CIST Root, this is called the master port and this port is used by all MST instances to reach the CIST Root.
The following pictures show the different concepts of MST, starting with a physical topology:
The IST runs within the region to block ports, to break up the physical loop. One switch will be the CIST root and one switch will be the CIST Regional root.
In reality, all these things tie in together and happen simultaneously but to solidify the understanding, we divide them into steps. The IST has run internally and blocked ports. This is what the CST looks like:
The CST runs between regions and/or non MST devices and makes sure there is no loop between regions or to non MST domains. If we combine the CST and the IST, we get the CIST which is the final topology:
Interopability Between MST and Other STP Versions
When communicating with IEEE STP or RSTP switch, the MST switch must share the role and state on the port towards the non MST switch for all VLANs. STP or RSTP can’t see into the MST region so it is treated as a single logical switch. The MST switch will speak by using the IST (instance 0) on boundary ports and format the BPDU to be STP or RSTP. The IST will also process inbound BPDUs from the non MST switch.
When communicating with PVST+ or RPVST+ region, things get a bit more complex. One STP instance is run for each VLAN and port role and state is individually calculated per VLAN. The IST will communicate with the non MST switch and must make sure that the information it sends to each PVST+/RPVST+ instance gets the same information to make a consistent choice. MST and PVST+ must arrive at the same port role and state for all instances even though a single MST instance and PVST+ instance directly interact with each other. This is also known as PVST Simulation mechanism.
The IST will replicate BPDUs for all active VLANs towards the PVST+ switch, meaning that the PVST+ switch will make a consistent choice for port role and state for all VLANs. The IST does this by formatting the BPDUs as PVST+ BPDUs.
In the opposite direction, the IST takes the BPDU from VLAN 1 as a representative for the entire PVST+ region and processes this in the IST. The boundary ports role and state will be binding for all active VLANs on that port. The MST switch must make certain that the result of the IST interaction with VLAN 1 STP instance is consistent with the state of STP instances run in other VLANs.
An MST boundary port will become a Designated Port if the BPDUs it sends out are superior to incoming VLAN 1 PVST+ BPDUs. The port will then be forwarding for all VLANs. To make sure that other PVST+ instances make a consistent decision, the MST switch must check that all incoming PVST+ BPDUs are inferior to its own outgoing BPDUs. If not, the PVST Simulation mechanism will fail.
The CIST Root can be located in the PVST+ region and the boundary port can have a port role of Root if the incoming VLAN 1 PVST+ BPDUs are not only superior to the MST switch but also better than any other VLAN 1 PVST+ BPDUs received on any other boundary port. Once again, to check the consistency of of port role, all Root bridges must be located in the PVST+ region and use the same boundary port to reach these switches. The PVST Simulation mechanism will check that incoming PVST+ BPDUs for VLANs other than VLAN 1 are identical or superior to the VLAN 1 PVST+ BPDUs.
An MST boundary port will become Non-Designated if it receives superior VLAN 1 PVST+ BPDUs compared to its own but not superior enought to make it a Root Port.
It is recommended to have the MST region appear as a Root switch to all PVST+ instances by lowering the IST root’s priority below the priorities of all PVST+ switches in all VLANs.
When an MST switch is communicating to a PVST+ or RPVST+ switch it will always revert back to PVST+. There is less state involved with PVST+ due to not having a Proposal/Agreement process which simplifies the interworking of MST and PVST+.
- Transitions directly to Forwarding state, saving 2x ForwardDelay
- Does not generate topology change events
- Does not flush CAM due to topology change
- DOES send BPDUs
- Does not expect to receive BPDUs
- Not influenced by the Sync step in Proposal/Agreement procedure(RSTP)
Portfast enabled ports may also be referred to as Edge ports. If a Portfast enabled port receives BPDUs it will lose its Portfast status until the port has gone up and down. RSTP uses Proposal/Agreement process and when going through Sync, it will put all Non-Edge Designated ports into a Discarding state. Unless enduser ports are configured as Edge ports they will be affected and lose connectivity briefly during the Sync process. Portfast is also important so that when a PC boots up and requests an IP address via DHCP, it gets one assigned before the process times out, waiting for the port to go into a Forwarding state. Portfast can be enabled per port or globally for all access ports.
- BPDU Guard: Enabled per port of globally for all Portfast enabled ports, will error-disable the port upon receiving ANY BPDU
- Root Guard: Only enabled per port, ignores any superior BPDUs received to prevent the port from becoming a Root Port. If a superior BPDU is received, the port is put into a root-inconsistent blocking state, cease forwarding and receiving data frames until the superior BPDUs cease
After BPDU-Guard has error-disabled a port, it must manually be recovered or by using error-disable recovery feature.
Root Guard will block the port if a superior BPDU comes in, this does not have to be the best BPDU, simply better than what the local switch is originating. Root-Guard will recover the port after the superior BPDU has expired which would be MaxAge – MessageAge or 3x Hello for STP and RSTP respectively.
- If enabled on a port it will unconditionally stop sending and receiving BPDUs
- If enabled globally for Edge ports, it will send 11 BPDUs after enabling the feature and then stop sending BPDUs. If a BPDU is received at any point in time, BPDU Filter is operationally disabled on the port and will revert to normal STP rules, sending and receiving BPDUs.
Protecting Against Unidirectional Link Issues
Several mechanism are available to protects against unidirectional links such as Loop Guard, UDLD, RSTP Dispute mechanism and Bridge Assurance.
UDLD is a Cisco-proprietary layer 2 protocol that serves as an echo mechanism between a pair of devices. It sends UDLD messages advertising its identity and port identifier pair as well as a list of all neighboring switch/port pairs heard on the same segment. The following explicit conditions are used by UDLD to detect an unidirectional link:
- UDLD messages arriving from a neighbor that do not contain the exact switch/port pair matching the receiving switch and its port in the list of detected neighbors. This would suggest that either the neighbor does not hear this switch at all (fiber cut) or that neighbor’s port sending these UDLD messages is different from the neighbor’s port receiving the UDLD messages. This could be the case if the TX fiber is plugged into a different port than the RX fiber.
- If the incoming UDLD messages contain the same switch/port originator pair as the receiving switch, which would indicated that the port is self-looped.
- A switch has detected only a single neighbor but the neighbor’s UDLD messages contain several switch/port pairs in the list of neighbors, this would indicated shared media and lack of visibility between all connected devices.
The above are explicit examples which will error-disable a port due to it being unidirectional. UDLD runs either in normal or aggressive mode. In normal mode, UDLD tries to reconnect with its neighbor(s) up to 8 times if there is a loss of incoming UDLD messages. Normal mode does not react to this implicit condition if not successfull, aggressive mode will error-disable the port if it stops receiving UDLD messages and the reconnect(s) fails. UDLD can be enabled globally or per port, globally enabling it will only enable UDLD on fiber ports.
Loop Guard prevents Root and Alternate ports from becoming Designated in the case of loss of incoming BPDUs. When the stored BPDU on a port expires, Loop Guard will put the port into a loop-inconsistent state. Loop-Guard can be configured clobally or per port.
Bridge Assurance is another mechanism that is available on select platforms and works with RPVST+ and MST on point-to-point links. A port will send BPDUs regardless of state if Bridge Assurance is enabled. If BPDUs are not received, the port will be put into a BA-inconsistent state. This protects from unidirectional links as well as malfunctioning switches that stop participating in RPSVT+/MST.
Finally the Dispute mechanism available in RPVST+/MST works by checking the incoming BPDU flags. If an inferior BPDU is received but the flags are Designated Learning or Forwarding, the local port will move into a Discarding state.
Interfaces can be bundled into a Port Channel which increases the available bandwidth by carrying multiple frames over multiple links. A hashing mechanism run over selected frames address fields will determine which physical link to send the frame over. The hashing is deterministic, meaning that frames of the same flow will travel the same physical link.
Load sharing can be based on MAC address, IP address or on some platforms even port numbers. A choice needs to be made depending on the type of flow, which load sharing mechanism will be most beneficial. Normally only one type of load sharing can be used for all flows on a switch. Normally load sharing will be more balanced if using a number of links divisible by 2. This varies by platform and the number of hash buckets.
To bring interfaces into a bundle, several parameters must match, such as speed, duplex, trunk/access, allowed VLANs, STP cost and so on.
It is recommended to run a dynamic protocol such as LACP to setup the bundle, this will prevent from failure modes where a switching loop is created where one side is unconditionally bundling links and the other side has not yet formed the bundle. Portchannels are treated as a single logical interface by STP and a single physical interface will be responsible for transmitting BPDUs for the bundle. Etherchannel misconfig guard can protect against failures where multiple BPDUs are incoming with different source MAC on ports in the bundle.
STP Scalability and vPC
MST offers greater scalability than RPVST+ due to sending only one BPDU and the decoupling of VLANs from instances. Normally two instances is enough with MST. With MST, VLANs can be created without affecting the STP instances. MST can also better support stretched layer 2 domains through the use of regions.
To achieve load balancing with MST, at least two STP instances need to be defined and different switches will be the root for each of these instances.
Recommendations for MST:
- Define a region configuration to be copied to all the switches that are part of the Layer 2 topology
- As part of the region configuration, define to which instances all the VLANs belong. Normally two instances would be enough
- Define primary and secondary root switches for all the instances that you have defined, also for instance 0. Typically one switch would be the root for instance 0 and instance 1 and a redundant aggregation switch for instance 2
- Preprovision all VLAN mapppings and topologies and later create VLANs as needed
Special Considerations for Spanning Tree with vPCs
Virtual Port Channel (vPC) is a technology used on Nexus switches where to switches act as if they were one by having the primary switch generate BPDUs, LACP messages and so on. The two switches use a link between them to synchronize state and to pass traffic over, this link is called the vPC peer link. Ports that are not configured for vPC behave as normal ports, meaning that BPDUs get generated by the local switch.
Some modifications have been done to STP to be used in combination with vPC, they are the following:
- The peer link should never be blocking because it carries important traffic such as Cisco Fabric Services over Ethernet (CFSoE) Protocol. The peer link is always forwarding
- On vPC ports, only the primary switch generates BPDUs. The secondary switch will relay incoming BPDUs to the primary switch
The following picture shows the behavior of Spanning Tree on Nexus switches:
The operational primary switch sends BPDUs towards Access1 even though it is not the STP Root. BPDUs that come from Access1 are relayed by Agg2. On ports that are not member of a vPC, normal rules apply, meaning that both Agg switches will send BPDUs towards Access2.
It is recommended to align the operational primary role with the STP Root role. If the peer-link fails, the vPC ports on the secondary switch will be shutdown. To keep SVIs up for non vPC VLANs if the peer-link fails, use a backup link between the switches that is independent from the peer-link or the dual-active exclude command. If using an extra link, remove all the non vPC VLANs from the vPC peer-link.
MST and vPC Best Practices
- Associate the root and secondary root role at the aggregation layer and match the vPC primary and secondary roles with the STP root role.
- One MST instance is enough
- Configure regions during the deployment phase
- If changing the VLAN to instance mapping, change both the primary and secondary vPC to avoid global inconsistency
- Use dual-active exclude command to not isolate non vPC VLANs when the peer-link is lost
If using RPVST+, use pathcost method so that lower speed interfaces do not get the same metric as higher speed interfaces. This should be the default for MST but may vary by platform.
Scaling may be affected by the following parameters:
- The number of PortChannels
- The number of VLANs supported by the switch
- Logical interface count
- Oversubscription rate
A logical port is the sum of the number of physical ports times the number of VLANs on each port. When vPC is used, the secondary device passes BPDUs to the primary device which increases the scale of logical interfaces. A PortChannel is a logical interface so it counts as a single logical port regardless of the number of links it contains. To calculate the logical ports, multiply the number of vPCs times the number of VLANs on each vPC. For non vPC switches, the logical ports is the number of trunks * number of VLANs + number of access ports. For a switch with 10 trunks with 100 VLANs and 10 access ports that is 1010 logical ports.
Virtual ports is a line card limitation where a line card can support a maximum number of logical ports per line card. Virtual ports are calculated the same way but for a PortChannel, all physical interfaces count individually.
To reduce the number of logical ports, the following concepts are important:
- Implement multiple aggregation modules
- Perform manual pruning on trunks
- Use MST instead of (R)PVST+
- Distribute trunks and access ports across line cards
- Remove unused VLANs going to Content Switching Modules (CSM) – The CSM automatically has all VLANs defined in the system configuration
This post describes key concepts of STP, different STP optimizations and which scaling factors are important in designing a layer 2 network.