Why fast IGP timers aren’t always beneficial

Introduction

When tuning your IGP of choice, the first thing people look at is usually the hello
and dead interval. This is a flawed logic, it is true that it can help in certain
cases but convergence consists of much more than just hello timers.

Why tune timers?

Detecting that the other side of the link is down is an important part of converging.
That’s why your design should avoid putting any bump in the wires such as converters
or a L2 cloud between the L3 endpoints. If you avoid such things when one end of the
link goes down the other end will as well which provides fast detection of failure.

In rare cases you can have the link being up but traffic is not passing over it. For
such cases or for those cases where there was no chance of avoiding a converter or
L2 cloud, tuning the hello timers can help with failure detection. The answer is almost
always BFD though, if the platform supports it.

Topologies where tuning timers is bad

When using a topology where VSS is involved such as Catalyst 6500 or Catalyst 4500,
tuning the timers is very bad. A common topology might look like this:

The L3 switches are dually connected to the VSS. These L3 switches might be in the
distribution layer and the VSS is part of the core. The distribution switches run
LACP towards the VSS which acts as one device from an outside perspective.

The VSS runs Stateful Switchover (SSO) which syncs configuration, boots the standby
supervisor with the software and has the line cards ready to go in case of failure
of the primary chassis. Hardware forwarding tables are also synchronized, SSO
switchover takes somewhere up to 10 seconds.

The active VSS chassis runs the control plane. Routing protocols such as OSPF are not
HA aware, meaning that the state of the routing protocols is not synchronized between
the chassis.

When using fast timers and a switchover occurs, what happens is that OSPF detects that the
neighbor is not replying and tears down the adjacency. The secondary chassis then has to go
bring the adjacency back up by sending out hello packets, exchanging LSAs and updating
RIB/FIB. This may take as long as 20 seconds with the time included from the switchover.

Non Stop Forwarding (NSF)

NSF combined with graceful restart is a technology used to forward packets when
a switchover has occured. The goal of NSF is to delay the failure detection which
may sound strange from a convergence perspective. Remember though that the VSS acts
as one device.

With NSF the forwarding is done according to the last known FIB entries. After a
switchover the secondary VSS will use graceful restart to inform its neighbors that
it has restarted and needs to synchronize its LSDB. This is done by sending hello packets
with a special bit set and the synchronization is done Out Of Band (OOB) to not tear
down the existing adjacency. The neighbors exchange LSAs and run SPF as normal. The
RIB and FIB can then be updated and and normal forwarding ensues.

This process is dependant on that the neighbors are also NSF aware otherwise they
would tear down the adjacency when the secondary VSS is restarting its routing
processes. So the key here is that the adjacency must stay up and that’s why timers
should be left at default if running VSS. This goes for both the VSS and any routers
that are neighbors to the VSS.

Conclusion

When using VSS always leave IGP timers at the default. Fast timers ruins the NSF
process and will lead to much higher convergence times than leaving them at the
default.

Murad

March 31, 2014 at 10:58 pm

Good article Daniel,

I wonder why many people will like to play with the timers. Better to leave the default.
BFD is good but still not many people in SP networks will prefer it. Few seconds extra are better than complete breakdown 🙂 Old gurus call it fancy features. May be in data centers,enterprise networks etc
Thanks

Pete Lumbis

March 31, 2014 at 11:05 pm

Great as always Daniel! Even BFD will work with NSF/GR to tell the neighbor to suspend for a short period of time while the routing protocols reestablish without impact.

reaper81
April 1, 2014 at 8:13 pm

Thanks Pete!

That’s great information. So just another reason to go with BFD then.

Gabriel

August 12, 2014 at 12:48 pm

Hi,
Thanks for the post. While reading the Cisco “ARCH Foundation Learning Guide” i found a description of this situation but it was not really clear :

NSF attempts to maintain the flow of traffic through a router that has experienced a failure. NSF with SSO is designed to maintain a link-up Layer 3 up state during a routing convergence event. However, because an interaction occurs between the IGP timers and the NSF timers, the tuned IGP timers can cause NSF-aware neighbors to reset the neighbor relationships

It’s hard to really undersand the situation just with these few lines. Your post made it very clear

Why fast IGP timers aren’t always beneficial

4 thoughts on “Why fast IGP timers aren’t always beneficial”

Leave a Comment Cancel Reply