I’m writing a short summary of REP as part of my CCDE studies. REP is an alternative protocol
used in place of STP and is most often run in ring based topologies. It is not limited to
these topologies however and it can also interact with STP if there is a desire to do so.
REP is Cisco proprietary, other vendors have similar protocols like EAPS from Extreme Networks.
REP uses the concept of segments. A segment ID is configured on all switches
belonging to the same segment. Two edge ports are selected where the REP
segment ends. These edge ports must not have connectivity with each other.
One port is blocking and this port is called the alternate port. All other
ports are transit ports.
Traffic flows towards the edge ports.
REP port roles
REP ports are either failed, open or alternate.
- All regular segment ports start out as failed ports
- After adjacencies have been determined, ports move to Alternate state. After negotiations on Alternate port is done the remaining ports move to open state while one port stays in Alternate state.
- When a failure occurs on a link all ports move to failed state. When the Alternate port receives the notification it is moved to open state.
REP does not work the same way that EAPS does. EAPS sends out a poll on one port
and expects to see it back on the other port facing the ring. It has a master node
that is responsible for this action.
REP works by detecting link failure (Loss of Signal). REP also forms adjacencies
with directly connected switches. Because the main method of converging is to detect LoS
that means that the network should be designed without converters or shared segments that
could affect the detection of a failure. REP Link Status Layer (LSL) is responsible for
detecting REP aware neighbors and establishing connectivity within a segment. After
connectivity has been setup, REP will choose which port is to be alternate and the other
ports will be forwarding. The alternate port can also manually be selected if desired.
Like mentioned earlier the main mechanism is to detect Loss of Signal. In the rare case
that the interface does not go down but connectivity it lost, REP must rely on timers.
The default is that the interface will stay up for five seconds when LSL hellos have
not been received from a neighbor.
When a link fails a notification is sent to a multicast destination address. This notification
is flooded in hardware speeding up the convergence. When a switch receives the notification
it must flush its L2 MAC table.
Interaction with STP
REP can interact with STP by generating TCN BPDUs. This could be desirable if you run REP
in a metro network and then have STP running in the network above that. Generally though
it would be best to not have that a large L2 segment so the REP segment should be
connected to a PE that runs MPLS/IP to the core.
End Port Advertisements
Starting from the edge ports End Port Advertisements (ESA) are sent out every four seconds.
These messages are used to discover the REP topology. The messages are relayed by all
intermediate ports and means that all the switches in the same segment knows what the
topology looks like and the state of all the ports in the segment. This can also be used
to see what the topology looked like before a failure because REP has an archive feature.
Other features of REP
REP supports preemption, meaning that when a failed link comes back the network can go
back to what it looked like before the failure. Manual preemption can also be used but
it will cause a temporary loss of traffic.
REP also supports VLAN load balancing meaning that the topology can look different
depending on the VLAN. However REP is not per VLAN in the sense that the hellos are
always sent on one VLAN compared to PVST+/RPVST+ which sends BPDUs per VLAN.
REP uses a concept of administrative VLAN which can be configured, the default is
to use VLAN 1.
Like any control plane protocols that are running in our networks, they can be open for
attacks. What would happen if someone faked PDUs for REP trying to make the network
converge in an unexpected manner or kept sending these PDUs to flap ports at a
very high rate.
Obviously this could be a dangerous scenario. Cisco thought of this and implemented a key
mechanism that starts from the Alternate port. The key consists of a port ID and a random
generated number created when the port activates. This key is distributed through the
segment to the other devices which can then use this key to unblock the alternate port.
REP is a Cisco proprietary protocol mainly used in metro based ring networks. It is likely
to converge faster than STP and can achieve best case convergence of around 50 ms. REP
can interact with STP by sending TCN BPDUs. REP is a similar technology to EAPS with some
differences. REP is supported on Cisco ME switches.
In the future I think protocols like REP and EAPS will start to fade away as metro based
networks go all MPLS/IP.
In todays networks, reliability is critical. Reliability needs to be high and
convergence needs to be fast. There are several ways of detecting network failure
but not all of them scale. This post takes a look at different methods of
detection and discusses when one or the other should be used.
Routing Convergence Components
There are mainly four components of routing convergence:
- Failure detection
- Failure propagation (flooding)
- Topology/Routing recalculation
- Update of the routing and forwarding table (RIB and FIB)
With modern networking networking equipment and CPUs it’s actually the first
one that takes most time and not the flooding or recalculation of the topology.
Failure can be detected at different level of the OSI model. It can be layer 1, 2
or 3. When designing the network it’s important to look at complexity and cost
vs the convergence gain. A more complex solution could increase the Mean Time
Between Failure (MTBF) but also increase the Mean Time To Repair (MTTR) leading
to a lower reliability in the end.
Layer 1 Failure Detection – Ethernet
Ethernet has builtin detection of link failure. This works by sending
pulses across the link to test the integrity of it. This is dependant on
auto negotiation so don’t hard code links unless you must! In the case of
running a P2P link over a CWDM/DWDM network make sure that link failure
detection is still operational or use higher layer methods for detecting
- Runs in software
- Filters link up and down events, notifies protocols
- By default most IOS versions defaults to 2 seconds to suppress flapping
- Not recommended to set it to 0 on SVI
- Router feature
- Delays link down event only
- Runs in firmware
- 100 ms default in NX-OS
- 300 ms default on copper in IOS and 10 ms for fiber
- Recommended to keep it at default
- Switch feature
IP Event Dampening
If modifying the carrier delay and/or debounce timer look at implementing IP
event dampening. Otherwise there is a risk of having the interface flap a lot
if the timers are too fast.
Layer 2 Failure Detection
Some layer 2 protocols have their own keepalives like Frame Relay and PPP. This
post only looks at Ethernet.
- Detects one-way connections due to hardware failure
- Detects one-way connections due to soft failure
- Detects miswiring
- Runs on any single Ethernet link even inside a bundle
- Typically centralized implementation
UDLD is not a fast protocol. Detecting a failure can take more than 20 seconds so
it shouldn’t be used for fast convergence. There is a fast version of UDLD but this
still runs centralized so it does not scale well and should only be used on a select
few ports. It supports sub second convergence.
Spanning Tree Bridge Assurance
- Turns STP into a bidirectional protocol
- Ensures spanning tree fails “closed” rather than “open”
- If port type is “network” send BPDU regardless of state
- If network port stops receiving BPDU it’s put in BA-inconsistent state
Bridge Assurance (BA) can help protect against bridging loops where a port becomes
designated because it has stopped receiving BPDUs. This is similar to the function
of loop guard.
It’s not common knowledge that LACP has builtin mechanisms to detect failures.
This is why you should never hardcode Etherchannels between switches, always
use LACP. LACP is used to:
- Ensure configuration consistence across bundle members on both ends
- Ensure wiring consistency (bundle members between 2 chassis)
- Detect unidirectional links
- Bundle member keepalive
LACP peers will negotiate the requested send rate through the use of PDUs.
If keepalives are not received a port will be suspended from the bundle.
LACP is not a fast protocol, default timers are usually 30 seconds for keepalive
and 90 seconds for dead. The timer can be tuned but it doesn’t scale well if you
have many links because it’s a control plane protocol. IOS XR has support for
sub second timers for LACP.
Layer 3 Failure Detection
There are plenty of protocol timers available at layer 3. OSPF, EIGRP, ISIS,
HSRP and so on. Tuning these from their default values is common and many of
these protocols support sub second timers but because they must run to the
RP/CPU they don’t scale well if you have many interfaces enabled. Tuning these
timers can work well in small and controlled environments though. These are
some reasons to not tune layer 3 timers too low:
- Each interface may have several protocols like PIM, HSRP, OSPF running
- Increased supervisor CPU utilization leading to false positives
- More complex configuration and bandwidth wasted
- Might not support ISSU/SSO
Bidirectional Forwarding Detection (BFD) is a lightweight protocol designed to
detect liveliness over links/bundles. BFD is:
- Designed for sub second failure detection
- Any interested client (OSPF, HSRP, BGP) registers with BFD and is notified when BFD detects loss
- All registered clients benefit from uniform failure detection
- Uses UDP port 3784/3785 (echo)
Because any interested protocol can register with BFD there are less packets
going across the link which means less wasting of bandwidth and the packets
are also smaller in size which reduces this even more.
Many platforms also support offloading BFD to line cards which means that the
CPU does not get increased load when BFD is enabled. It also supports ISSU/SSO.
BFD negotiates the transmit and receive interval. If we have a router R1
that wants to transmit at 50 ms interval but R2 can only receive at 100 ms
then R1 has to transmit at 100ms interval.
BFD can run in asynchronous mode or echo mode. In asynchronous mode the BFD
packets go to the control plane to detect liveliness. This can also be combined
with echo mode which sends a packet with a source and destination IP of the
sending router itself. This way the packet is looped back at the other end
testing the data plane. When echo mode is enabled the control plane packets
are sent at a slower pace.
There can be challenges running BFD over link bundles. Due to CEF polarization
control plane/data plane packets might only be sent over the same link. This
means that not all links in the bundle can be properly tested. There is
a per link BFD mode but it seems to have limited support so far.
Event Driven vs Polled
Generally event driven mechanisms are both faster and scale better than polling
based mechanisms of detecting failure. Rely on event driven if you have the option
and only use polled mechanisms when neccessary.
Detecting a network failure is a very important part of network convergence. It
is generally the step that takes the most time. Which protocols to use depends
on network design and the platforms used. Don’t enable all protocols on a link
without knowing what they actually do. Don’t tune timers too low unless you
know why you are tuning them. Use BFD if you can as it is faster and uses
less resources. For more information refer to BRKRST-2333.
I’m planning to do a post on BPDUs sent by Cisco switches and analyze why they are sent. To fully understand the coming post first we need to understand the different versions of Ethernet. There is more than one version? Yes, there is although mainly one is used for all communication.
Most people will know that Robert Metcalfe was one of the inventors of Ethernet. Robert was working for Xerox back then. Digital, Intel and Xerox worked together on standardizing Ethernet. This is why it is often referred to as a DIX frame. The DIX version 1 standard was published in 1980 and the version used today is version 2. This is why we refer to Ethernet II or Ethernet version 2. The DIX version is the frame type that is most often used.
IEEE was also working on standardizing Ethernet. They began working on it in February 1980 and that is why the standard is called 802 where 802.3 is the Ethernet standard. We refer to it as Ethernet even though when IEEE released their standard it was called “IEEE 802.3 Carrier Sense Multiple Access with Collision Detection (CSMA/CD)
Access Method and Physical Layer Specifications”. So here we see the term CSMA/CD for the first time.
I’m not here to give you a history lesson but instead explain the frame types and briefly discuss the fields in them. We start with the DIX frame or Ethernet II frame. This is the frame that is most commonly used today. It looks like this.
The preamble is a pattern of alternating ones and zeroes and ending with two ones. When this pattern is received it is known that anything that comes after this pattern is the actual frame.
The source and destination MAC is used for switching based on the MAC.
The EtherType field specifies that upper level protocol. Some of the most well known ones are:
0×0800 – IP
0×8100 – 802.1Q tagged frame
0×0806 – ARP
0x86DD – IPv6
After that follow the actual payload which should be between 46 – 1500 bytes in size.
In the end there is a Frame Checking Sequence (FCS) which is used to check the validity of the frame. If the CRC check fails the frame is dropped.
In total the frame will be maximum 1514 bytes or 1518 if counting the FCS.
When it comes to 802.3 Ethernet there are actually two frame formats. One is 802.3 with 802.2 LLC SAP header. It looks like this.
This was the original version from the IEEE. Many of the fields are the same. Let’s look at those that are not.
The preamble is now divided in preamble and Start Frame Delimiter (SFD) but the function is the same.
The length field is used to indicate how many bytes of data are following this field before the FCS. It can also be used to distinguish between DIX frame and 802.3 frame as for DIX the values in this field will be higher e.g. 0×806 for ARP. If this value is greater than 1536 (0×600 Hex) then it is a DIX frame and the value is an Ethertype value.
Then we have some interesting values called DSAP, SSAP and Control. SAP stands for Service Access Point, the S and D in SSAP and DSAP stands for source and destination.
They have a similar function as the Ethertype. The SAP is used to distinguish between different data exchanges on the same station. The SSAP indicates from which service the LLC data unit was sent and the DSAP indicates the service to which the LLC data unit is being sent. IP has a SAP of 6 and 802.1D (STP) has a SAP of 42. It would be very strange to have a different SSAP and DSAP so these values should be the same. IP to IP would be SSAP of 06 and DSAP of 06. One bit (LSB) in the DSAP is used to indicate if it is a group address or an individual address. If it is set to zero it refers to an individual address going to a Local SAP (LSAP). One bit in the SSAP (LSB) indicates if it is a command or response packet. That leaves us with 64 possible different SAPs for SSAP and DSAP.
The contol field is used to select if communication should be connection-less or connection-oriented. Usually error recovery and flow control are performed by higher level services such as TCP.
The IEEE had problems to address all the layer 3 processes due to the short DSAP and SSAP fields in the header. This is why they introduced a new frame format called Subnetwork Access Protocol (SNAP). Basically this header is using the type field found in the DIX header. If the SSAP and DSAP is set to 0xAA and the Control field is set to 0×03 then SNAP encapsulation will follow. SNAP has a five byte extension to the standard 802.2 LLC header and it consists of a 3 byte OUI and a two byte Type field.
From a vendor perspective this is good because then they can have an OUI and then create their own types to use. If we look at PVST+ BPDUs from a Cisco device we will see that they are SNAP encapsulated where the organization code is Cisco (0x00000c) and the PID is PVSTP+ (0x010b). CDP is also using SNAP and it has a PID of CDP (0×0200). I will talk more about BPDUs and STP in a following post but first I wanted to provide the background on the Ethernet frame types used.
In summary there are three different Ethernet frame types used. DIX frame, also called Ethernet II, IEEE 802.3 with LLC and IEEE 802.3 with SNAP encapsulation. There are others out there as well but these are the three major ones and the DIX one is by far the most common one.
RJ 45 pinouts
10-BASE-T and 100BASE-TX uses pairs two and three, gigabit Ethernet uses all four pairs.
Pinout for straight cable: 1-1;2-2;3-3;6-6
Pinout for crossover cable: 1-3;2-6;3-1;6-2
A standard PC transmits on pair one and two and receives on three and six. A switchport is
the opposite. If two alike devices are connected a crossover cable should be used although
MDI-X is a standard today.
Cisco switches can detect the speed of a link through Fast Link Pulses (FLP) even if autonegotiation is disabled but the duplex can not be detected and this means that half duplex must be assumed. This is true for 10BASE-T and 100BASE-TX. Gigabit Ethernet uses all four pairs in the cable and can only use full duplex mode of operation. Also note that for gigabit Ethernet autonegotiation is mandatory although it is possible to hardcode speed and duplex .
Ethernet uses Carrier Sense Multiple Acess/Collision Detection (CSMA/CD). Before a client can send a frame it listens to the wire to see that it is not busy. It sends the frame and listens to ensure a collision has not occured. If a collision occurs all stations that sent a frame send a jamming signal to ensure that all stations recognized the collision. The senders of the original collided frames wait for a random amount of time before sending again.
Frames that were meant to be sent but were paused because frames were being received at the moment. If in half duplex sending and receiving can not occur at the same time.
Collisions that are detected while the first 64 bytes are being transmitted are called collisions and collisions detected after the first 64 bytes are called late collisions.
Provides synchronization and signal transitions to allow proper clocking of the transmitted signal. Consists of 62 alternating one and zeroes and then ends with a pair of ones.
I/G bit and U/L bit
The I/G bit is placed in the most significant byte and the most significant bit of the MAC address. If set to zero it is an Individual (I) address and if set to one it is a Group (G) address. Multicast at layer two always sends to 01.00.5E which means that the G bit is set. The bit before the I/G bit is the U/L bit, this indicates if it is an Universally (U) administerad address or an Locally (L) assigned address. If it is an MAC address set by a manufacturer this should be set to zero.
SPAN and RSPAN
SPAN and RSPAN are used to mirror traffic. The source of traffic can be a VLAN or a switchport or a routed port. Traffic can be mirrored from both rx and tx or just one of them. SPAN sends the traffic to a local destination port, RSPAN sends the traffic to a RSPAN VLAN which is used to transfer the traffic to its destination. Note that some layer two frames are not sent by default including CDP, VTP, DTP, BPDU and PagP, to include these use the command encapsulation replicate. SPAN is configured with the monitor session command.
The previous post talked about autonegotiation. This time I will talk about cables and pinouts and how auto MDIX works. Although I’m not very old I still like to do it the old school way. I don’t rely on auto MDIX, instead I use the right cable. Lets look at a pinout for T568B:
A regular end device like a PC transmits on pin one and two and receives on pin three and six. Although we have four pairs only two are actually used, unless we are using gigabit Ethernet but that is another topic. A device like a switch does the opposite, it receives on pin one and two and sends on three and six. This is why we use a straight through cable. When connecting similar devices like a switch to a switch we need to use a cross over cable since they want to send on the same pins and receive on the same. So when choosing a cable remember that similar devices requires cross over and different devices needs a straight through.
An engineer at HP developed the auto MDIX standard since he was tired of looking for cross over cables. But how does it work?
The NIC expects to receive Fast Link Pulses (FLP) on pins three and six. If it receives FLPs it will know that the configuration is correct. If it doesn’t receive FLP’s it will switch over to MDI-X mode. This is a very simplified view of it, the process involves different timers and a XOR algorithm. If you want to know more check out the IEEE 802.3 specification section 3, clause 40.4.4.
Autonegotiation – Either you love it or you hate it but pretty much everyone has an opinion on it. I was going to write something more lengthy at first but decided a blog was the wrong place.
Autonegotiation works by sending eletrical pulses. In 10Base-T these are called Normal Link Pulses (NLP). They are sent every 16th ms with a tolerance of 8 ms. They are only sent when the Network Interface Card (NIC) is not receiving or sending traffic. They look like this:
In the fast Ethernet standard (802.3u) these are called Fast Link Pulses (FLP) and they look like this:
These electrical pulses lets us determine the speed and duplex mode that is available in autonegotiation. The priority for choosing a speed and duplex mode goes like this:
- 1000Base-T – Full duplex
- 1000Base-T - Half duplex
- 100Base-T2 – Full duplex
- 100Base-TX – Full duplex
- 100Base-T2 – Half duplex
- 100Base-TX – Half duplex
- 10BaseT – Full duplex
- 10BaseT – Half duplex
If one side is set to auto and the other side hardcoded parallell detection kicks in. Parallell detection can determine the speed by looking at the format of the electrical pulses it is receiving from its link partner. Duplex can’t be detected so that will default to half duplex. This is why we sometimes see links with 100/half duplex. If one side is auto and the other 100/full the auto side will be set to 100/half.
Half duplex is of course very bad, it leads to frame errors, dropped packets and late collisions.
Ethernet is the most used layer 2 protocol today and it’s dominance is not likely to end anytime soon. I decided to make a section with some quick facts about Ethernet. There is a lot to know about Ethernet but we usually neglect this because we are very focused on IP. Take a look at an Ethernet frame:
The preamble field is not known to many people. It won’t show up in a packet capture since the network card will already have stripped it before it’s available for capture. So what is the purpose of preamble? The preamble field contains a synchronization pattern that consists of alternating ones and zeros and ends with two consecutive ones. It is used to synchronize node communication but also to indicate where the frame start. Because it is not processed in the same way as the rest of the frame we do not have to count the eight bytes of preamble when calculating Ethernet frame size. This is what preamble looks like: