Introduction
This post will look at the steps involved in BGP convergence and how it interacts with IGP to converge.
Any network of scale will use route reflectors (RRs) so this post will focus on deployments with RRs. Networks running a full mesh will have all paths available which makes hot potato routing and fast convergence easily achievable, at the cost of scaling and management overhead. A combination of full mesh and RRs is also possible where one scenario would be to run a full mesh within a point of presence (PoP) and RRs within the pop, peering with central RRs.
BGP can be used for both internal (iBGP) and external (eBGP) peerings and convergence and timers differ depending if it’s internal or external peerings.
BGP is a path vector protocol which means that it behaves as a distance vector protocol where it can only advertise routes that are installed into the RIB. There is an exception to the rule when BGP selective route download (SRD) is used to not download routes to the RIB but still advertise the routes. BGP will by default only install one path into the RIB even if there are multiple equal candidates and it will only advertise this best route.
Timers
BGP Update Delay
When BGP establishes its first peer, a timer called the update-delay is triggered. This is by default set to 120 seconds and the BGP best path algorithm will not run until this timer expires or until the peer signals that it has sent all routes. The peer can signal that it’s done by either sending a BGP Keepalive or the BGP End of RIB message which is normally used with graceful restart (GR). This timer is used to minimize the number of BGP best path runs, simplify update generation and to better pack routes into TCP segments. Do note though that it would take a lot of routes to make it take longer than 120 seconds and that this timer only runs when the first peer is established.
Minimum Route Advertisement Interval
There is another timer called the minimum route advertisement interval (MRAI) which is set to 0 seconds by default for iBGP but 30 seconds for eBGP. This means that iBGP updates can be immediately sent but eBGP updates will be delayed up to 30 seconds. The goal of this timer is to reduce route churn and to produce fewer BGP updates but it does slow down convergence. The MRAI can be set per neighbor. Do note though that this timer checks when the latest batch of updates was sent and if that was more than 30 seconds ago, then the routes can be sent immediately.
BGP Next Hop Tracking
BGP uses a process called the BGP Scanner. This process is responsible for walking the BGP table and confirming reachability of the BGP next-hops. With MPLS VPN, it is also responsible for importing and exporting routes into a VRF. This process runs every 60 seconds by default. The problem with this process is that it only runs every 60 seconds. Imagine a scenario with three routers, PE1, PE2 and PE3 where PE1 is learning prefix p1 via PE2 and PE3. The current best path is through PE2. Now imagine that PE2 goes down due to a power failure or hardware failure. The IGP will react quickly and within a few seconds the IGP will know that the next-hop for p1 is no longer valid. However, BGP is currently not aware that PE2 has gone missing. It has to rely on timers which are normally 60 seconds for keepalives and 180 seconds for the hold time. This means that it would take up to three minutes for BGP to detect that the peer is gone. BGP will through the BGP Scanner process try to validate the next-hop and in it’s next run it will notice that the next-hop for PE2 is no longer valid. This takes up to 60 seconds depending on the when the BGP Scanner process last ran. So even with a super optimized IGP, BGP will be very slow to converge.
The problem is that the BGP Scanner process is not event driven. BGP next hop tracking (NHT) is the solution to the BGP Scanner process. Think of BGP NHT as an event driven version of the BGP Scanner process. BGP NHT will react to different events such as the next-hop becoming reachable or unreachable, if the metric to the next-hop changes and so on. The change of metric for the next-hop is considered a non critical event and the change of next-hop is considered a critical event, although only IOS-XR makes this separation, IOS reacts the same way to both events. The job of BGP NHT is to:
- Determine whether a next-hop is reachable
- Find the fully recursed IGP metric to the next-hop
- Validate the received next-hops
- Calculate the outgoing next-hops
- Verify the reachability and connectedness of neighbors
BGP receives notifications from the routing information base (RIB) when next-hop information changes. BGP will react to these changes reported from the IGP immediately. If a next-hop goes away for a route, then the route is no longer valid. What BGP NHT does though, is to impose a delay for when to walk the BGP table. If the IGP reports a failure of a next-hop at t0, BGP NHT will not notify BGP to walk the table until a number of seconds passes. The number of seconds varies on platforms and OS but IOS by default will have 5 seconds for any events and IOS-XR 3 seconds for critical and 10 seconds for non critical events. This delay is meant to give the IGP a chance to flood information, aggregate events and converge.
How does a feature like BGP prefix independent convergence (PIC) work together with BGP NHT? BGP normally selects a single best path. It’s possible to use multiple paths but BGP only still selects a best path and advertises this to its neighbors. BGP PIC works by installing an additional path, meant as a backup path in the BGP RIB and also in the forwarding information base (FIB). Because this repair path is already in the BGP RIB and in the FIB, it does not have to wait for BGP NHT to kick in. Once the IGP reacts, the primary path is no longer valid due to the unreachable next-hop. BGP immediately reacts to this and installs the backup path. This means that convergence is very fast, as fast as your IGP combined with how fast you can react, such as loss of signal (LoS) or bidirectional forwarding detection (BFD). BGP NHT will still walk the table after reacting to IGP but your traffic will be protected before this happens, which is the important part.
There are other optimizations to NHT which are out of scope for this blog such as only tracking and installing next-hops of a certain length. If a /32 or /31 goes missing, you may not want to fall back to a /24 route or a default route as your next-hop. This brings us to another important part of NHT. If your IGP is divided into areas/levels and you do summarization, this can be problematic with NHT. If NHT is tracking a /32 but that /32 is hidden behind a /24 summary, the summary will still be advertised even if the /32 goes away. This means that NHT can’t react since this change in reachability is not flooded. This means that we have to fall back to BGP converging by itself, which can be very slow. One alternative to handle this could be doing multi hop BFD.
If your IGP is properly tuned for fast convergence, it may be safe to tune down the NHT a bit although the default values are often the ones that are recommended by Cisco.
Fast Peering Session Deactivation
EBGP sessions are normally done with directly adjacent routers. A keepalive is sent every 60 seconds and the hold time is 180 seconds. BGP uses peering session deactivation to tear down the BGP session if the outgoing interface associated with the BGP session goes down. The same holds true if the next-hop goes away for eBGP multi hop peering. This is often referred to as bgp fast-external-fallover. This behavior is not desirable for iBGP because peering sessions are generally done to a loopback and there are often several paths as calculated by the IGP to reach that loopback. Therefore, this feature is not enabled for iBGP by default. It is desirable to have loss of signal (LoS) or some form of link down event on both sides of the link if the link or router fails. The second best option is to use BFD for detecting if a peer goes away.
Interacting with IGP
Based on the description above, it is clear that if BGP can react to the IGP, convergence will be fast. Convergence is much slower if we have to wait for BGP to send updates and withdraws. For this reason it is important to have diversity of paths, but we will discuss this point later. To achieve fast convergence, we need a tuned IGP. Most SPs will use either OSPF or ISIS and they behave very similarly the way they converge. This post will not go into detail how to tune your IGP but the key point is to react quickly, the best way is event driven detection such as LoS and the second best option is BFD. The LSA/LSP needs to be flooded quickly and the SPF algorithm needs to run quickly after receiving the LSA/LSP. It is important though to throttle the flooding and SPF runs if there is a lot of churn. Both OSPF and ISIS have timers for this. A feature such as IP dampening could also be used to further reduce churn. This is even more important when BGP NHT is in use because NHT is event driven while the BGP Scanner is periodic. Some platforms also supports prioritizing prefixes in the IGP of a certain length such as /32 or with a certain tag to run SPF for these prefixes first.
The Importance Of the Next-hop
BGP will normally modify the next-hop to itself on eBGP peerings unless third party next-hop modification is used, which is most common in IX scenarios. When sending updates over an iBGP peering, the next-hop is normally not modified unless next-hop-self is used. It is very common to use next-hop-self to not have to carry the external next-hops in your IGP. Next-hop-self is good in theory, it does have some drawbacks though. If a router sets the next-hop to itself, normally a loopback, all traffic is attracted to that next-hop. Assume that this next-hop loses its eBGP peering where it is learning all the external prefixes. Because we are setting the next-hop to our self, the IGP will not be aware of this event. Convergence will depend on BGP where this router must send withdraw messages, then routers have to calculate a new best path and so on. If we instead had sent the prefixes with an unmodified next-hop, convergence would have depended on the IGP, where this exit link would be part of the IGP. This would then trigger BGP NHT to find a new best path.
These are the kind of decision that are always involved in a design, there are always tradeoffs. Keep your IGP as small as possible or achieve faster convergence?
If you do put exit links into your IGP, you should either make the links passive or redistribute the network into your IGP. So which one is better? External prefixes have a wider flooding scope (domain) than internal prefixes and does not generally require a full SPF run. Internal prefixes generally require a full run but incremental SPF (iSPF) has made this pretty much a moot point, also considering that SPF runs very quickly on modern CPU’s. If ISIS is in use, then partial route calculation (PRC) would be run and not a full run because this would be considered a leaf of the SPF topology, only reachability would have changed, not the topology.
Achieving Diversity When Using Route Reflectors
Almost any network of scale, will use RRs in some form. This however has some drawbacks, we will have less path diversity, suboptimal exit paths and a slower converging network. How can we work around these issues?
An RR will only reflect one best path, which means that a lot of the paths in the network will not get used. With MPLS VPN, it is quite easy to achieve path diversity for prefixes that connect to redundant PEs. The “trick” is to use an unique RD per VRF per PE. The RD together with the prefix, makes up the VPNv4 prefix and if the RD is unique, so is also the prefix. This means that an RR will see the same network from different PEs as two different prefixes. This way path diversity and faster convergence can be achieved in an MPLS VPN network.
Achieving diversity in an IP network takes some more work. We can’t use the RD trick which means that we need to use some other features. The first step is to actually get more than one prefix in BGP to our PEs. This can be done in various ways such as shadow RR, shadow session or add path. Using a shadow RR or shadow session only requires support on the RR while add path requires support on the PEs as well. The second part is to have the router actually install the backup path. This is commonly referred to as BGP PIC Edge, where the PE will use a command such as bgp additional-paths install to install the backup path. With a backup path we can achieve fast convergence. If we want to do load sharing, that would require multi path and equal cost routes or relaxing BGP best path algorithm to install multiple paths.
In some situations we will not achieve diversity because a PE will not advertise its route due to policy. Imagine that we have two PEs, PE1 and PE2 with a customer prefix p1 and PE1 is the preferred exit by policy such as local preference. Even though PE2 learns p1 via eBGP, it will not install and use this path because local preference is higher ranked in the BGP best path algorithm. This means that this diverse path will never get advertised in the network. To overcome this, the feature best-external needs to be used, the PE will then advertise its best external path together with the real best path.
The final feature I want to mention is BGP PIC Core which protects against core links/nodes failing. If a core link or node fails, the IGP will converge and the BGP next-hop will stay the same. This means that the path has changed but the next-hop is the same so BGP does not need to converge, only the IGP.
This post should give you a good insight into BGP convergence and what parts that go into converging a BGP network.
Excellent post. Informative overview and very top of mind for me lately. Thanks.
Thanks, Phil!
Fun fact. Setting a different MRAI per neighbor will break the update-groups. This used to be the old-school way of isolating “slow-peers” before the actual feature was added. So, if you want to maintain the same update-group members, adjust MRAI for all of them accordingly.
Thanks. Good to know. It would be likely that you would have the same MRAI for all internal peers and same MRAI for all external peers, I suppose.
Great write up. thanks
Thanks!
Fantastic post Daniel.
There’s one more exception to routes being selected best and advertised to peers but not downloaded to RIB and that’s “inactive” routes.
Thanks, Michael! And thanks for that piece of information. So that would most likely be routes with RIB failure, meaning that they are available from some other source protocol and the BGP next-hop is not the same as the the one reported by the protocol with the lower AD.
Great post Daniel!
Thanks a lot, Brent!
Nice read Daniel. Few questions
1) If you have to choose between faster convergence vs lot_of_churn, which one would you go with ?.
2) As i see, BGP is not build to work in an env where there is lot of churn happening but people still use them inside DC’s these days with reduced time intervals. What are you thoughts on that ?
Thanks. It all comes down to business requirements. There’s no point designing for sub second convergence if my requirements are sub 5 seconds. With BGP PIC, LFA, rLFA etc we can do a lot of the work in advanced by precalculating our backup paths. I would use features such as IP dampening and SPF throttling to not overwhelm my router.
BGP in the DC is an interesting design and I think you could make it work quite well. Dampening feature could be used there as well and timers such as MRAI can be used to not send updates too often. If you have flapping links it will generally cause issues for any protocol and you need to monitor and take care of such links.
In DC, people moved from OSPF(or ISIS) to BGP in order to reduce the amount of churn in the network. Architecturally, BGP is less chatty, so does not like handle constant change while IGPs is build for handle constant change.
Thanks Daniel for sharing this useful info. Kindly do write another blog specially on MPLS AToM.
Thanks Daniel
Very useful post! thanks a lot.
Pingback:CCDE Success: References Used – localpref.net
Dear Daniel,
Do you think in an MPLS environnement, setting NHS on the PE-CE Peer Link can provide faster convergence ?
You mean in a scenario where you are running eBGP towards the customer in the VRF?