Detecting Network Failure – Daniels Networking Blog

Introduction

In todays networks, reliability is critical. Reliability needs to be high and
convergence needs to be fast. There are several ways of detecting network failure
but not all of them scale. This post takes a look at different methods of
detection and discusses when one or the other should be used.

Routing Convergence Components

There are mainly four components of routing convergence:

Failure detection
Failure propagation (flooding)
Topology/Routing recalculation
Update of the routing and forwarding table (RIB and FIB)

With modern networking networking equipment and CPUs it’s actually the first
one that takes most time and not the flooding or recalculation of the topology.

Failure can be detected at different level of the OSI model. It can be layer 1, 2
or 3. When designing the network it’s important to look at complexity and cost
vs the convergence gain. A more complex solution could increase the Mean Time
Between Failure (MTBF) but also increase the Mean Time To Repair (MTTR) leading
to a lower reliability in the end.

Layer 1 Failure Detection – Ethernet

Ethernet has builtin detection of link failure. This works by sending
pulses across the link to test the integrity of it. This is dependant on
auto negotiation so don’t hard code links unless you must! In the case of
running a P2P link over a CWDM/DWDM network make sure that link failure
detection is still operational or use higher layer methods for detecting
failure.

Carrier Delay

Runs in software
Filters link up and down events, notifies protocols
By default most IOS versions defaults to 2 seconds to suppress flapping
Not recommended to set it to 0 on SVI
Router feature

Debounce Timer

Delays link down event only
Runs in firmware
100 ms default in NX-OS
300 ms default on copper in IOS and 10 ms for fiber
Recommended to keep it at default
Switch feature

IP Event Dampening

If modifying the carrier delay and/or debounce timer look at implementing IP
event dampening. Otherwise there is a risk of having the interface flap a lot
if the timers are too fast.

Layer 2 Failure Detection

Some layer 2 protocols have their own keepalives like Frame Relay and PPP. This
post only looks at Ethernet.

UDLD

Detects one-way connections due to hardware failure
Detects one-way connections due to soft failure
Detects miswiring
Runs on any single Ethernet link even inside a bundle
Typically centralized implementation

UDLD is not a fast protocol. Detecting a failure can take more than 20 seconds so
it shouldn’t be used for fast convergence. There is a fast version of UDLD but this
still runs centralized so it does not scale well and should only be used on a select
few ports. It supports sub second convergence.

Spanning Tree Bridge Assurance

Turns STP into a bidirectional protocol
Ensures spanning tree fails “closed” rather than “open”
If port type is “network” send BPDU regardless of state
If network port stops receiving BPDU it’s put in BA-inconsistent state

Bridge Assurance (BA) can help protect against bridging loops where a port becomes
designated because it has stopped receiving BPDUs. This is similar to the function
of loop guard.

LACP

It’s not common knowledge that LACP has builtin mechanisms to detect failures.
This is why you should never hardcode Etherchannels between switches, always
use LACP. LACP is used to:

Ensure configuration consistence across bundle members on both ends
Ensure wiring consistency (bundle members between 2 chassis)
Detect unidirectional links
Bundle member keepalive

LACP peers will negotiate the requested send rate through the use of PDUs.
If keepalives are not received a port will be suspended from the bundle.
LACP is not a fast protocol, default timers are usually 30 seconds for keepalive
and 90 seconds for dead. The timer can be tuned but it doesn’t scale well if you
have many links because it’s a control plane protocol. IOS XR has support for
sub second timers for LACP.

Layer 3 Failure Detection

There are plenty of protocol timers available at layer 3. OSPF, EIGRP, ISIS,
HSRP and so on. Tuning these from their default values is common and many of
these protocols support sub second timers but because they must run to the
RP/CPU they don’t scale well if you have many interfaces enabled. Tuning these
timers can work well in small and controlled environments though. These are
some reasons to not tune layer 3 timers too low:

Each interface may have several protocols like PIM, HSRP, OSPF running
Increased supervisor CPU utilization leading to false positives
More complex configuration and bandwidth wasted
Might not support ISSU/SSO

BFD

Bidirectional Forwarding Detection (BFD) is a lightweight protocol designed to
detect liveliness over links/bundles. BFD is:

Designed for sub second failure detection
Any interested client (OSPF, HSRP, BGP) registers with BFD and is notified when BFD detects loss
All registered clients benefit from uniform failure detection
Uses UDP port 3784/3785 (echo)

Because any interested protocol can register with BFD there are less packets
going across the link which means less wasting of bandwidth and the packets
are also smaller in size which reduces this even more.

Many platforms also support offloading BFD to line cards which means that the
CPU does not get increased load when BFD is enabled. It also supports ISSU/SSO.

BFD negotiates the transmit and receive interval. If we have a router R1
that wants to transmit at 50 ms interval but R2 can only receive at 100 ms
then R1 has to transmit at 100ms interval.

BFD can run in asynchronous mode or echo mode. In asynchronous mode the BFD
packets go to the control plane to detect liveliness. This can also be combined
with echo mode which sends a packet with a source and destination IP of the
sending router itself. This way the packet is looped back at the other end
testing the data plane. When echo mode is enabled the control plane packets
are sent at a slower pace.

Link bundles

There can be challenges running BFD over link bundles. Due to CEF polarization
control plane/data plane packets might only be sent over the same link. This
means that not all links in the bundle can be properly tested. There is
a per link BFD mode but it seems to have limited support so far.

Event Driven vs Polled

Generally event driven mechanisms are both faster and scale better than polling
based mechanisms of detecting failure. Rely on event driven if you have the option
and only use polled mechanisms when neccessary.

Conclusion

Detecting a network failure is a very important part of network convergence. It
is generally the step that takes the most time. Which protocols to use depends
on network design and the platforms used. Don’t enable all protocols on a link
without knowing what they actually do. Don’t tune timers too low unless you
know why you are tuning them. Use BFD if you can as it is faster and uses
less resources. For more information refer to BRKRST-2333.

Detecting Network Failure

9 thoughts on “Detecting Network Failure”

Michael
September 27, 2013 at 8:55 am

Neat summary. It’d be nice to hear a bit more elaboration on Ethernet OAM and 802.1ag. AFAIK OAM and autonegotiation may be mutually exclusive (http://www.cisco.com/en/US/products/hw/routers/ps368/module_installation_and_configuration_guides_chapter09186a0080523f3c.html)
- reaper81Post author
  September 29, 2013 at 7:52 am
  
  Thanks Michael!
  
  There will be a separate post on Ethernet OAM but I need to read up on it more first.
oergun
September 29, 2013 at 6:31 pm

Good brief.. When the complexity increase MTBF would be reduced , right as oppose to MTBR or MTBM 🙂
- reaper81Post author
  September 30, 2013 at 7:47 am
  
  Thanks. I worded it a bit bad. What I meant was that by increasing redundancy MTBF could be increased but if we add too much redundancy then the network is not deterministic and MTTR may increase. So you are taking the lab soon right?
oergun
September 30, 2013 at 8:06 am

Correct, adding a back to back links decrease , adding parallel links for redundancy ( optimal 2 ) generally increase MTBF. Since I need to design 5 9s nowadays for the customer , I am very careful on it : ) . Yes It has been scheduled November 22 in Chicago. What about you ?
Deepak Arora
October 3, 2013 at 4:32 pm

Couple of quick questions

> What about Layer 1 protection mechanism ?
> How to balance between NSF and BFD as selection. Where NSF says Route Through vs BFD as Route around. So does that mean go for NSF if single home and BFD if dual home ?
> There are couple of line cards there I have seen don’t understand loss of carrier. So carrier delay doesn’t work
> While LACP is an excellent choice at high level. I encounter lots of issues while trying to bundle two metro links between two buildings in campus under cross stack environment. Both links were from different providers and one of them didn’t allow passing LACP packets somehow and it end up being suspended
> Problem with multilink at some point is if your physical interfaces have different latency for example. So buffer depth tunning is another aspect there

Perhaps we should try to put up a dummy case study using all these and present to CCDE group and lets see what existing CCDEs have to say ? 🙂
Pingback:Network Campus Design | Daniels networking blog
Pingback:Fast Convergence and the Fast Reroute - Definitions/Design Considerations in IP and MPLS | Cisco Network Design and Architecture | CCDE Bootcamp | orhanergun.net
Pingback:CCDE Success: References Used – localpref.net

9 thoughts on “Detecting Network Failure”

Leave a Reply Cancel reply