Christmas is lurking around the corner and in the spirit of Denise “Fish” Fishburne, I give you the “The Tale of the Mysterious PIM Prune”.
I have been working a lot with multicast lately which is also why I’ve blogged about it. To start off this story, let’s begin with a network topology.
The multicast source is located in AS 65000 and contains two routers that are connected to the multicast source. The routers run BFD, OSPF, iBGP, PIM internally and the RP is located on C1. There is a local receiver in AS 65000 and a remote one in AS 64512. The networks 10.0.1.0/24 and 10.0.21.0/24 come off the same physical interface. If you want to replicate this lab, all the configs are provided here.
This network requires fast convergence and I have been troubleshooting a scenario where the active multicast router (R1) has its LAN interface go down, meaning that the traffic from the source must come in on R2. In this scenario I have seen convergence in up to 60 seconds which is not acceptable. The BGP design is for R2 to still exit out via R1 if the link is available towards C1. The picture below shows the normal multicast flow.
When R1 has its LAN interface go down, the traffic will pass from R2 over the link to R1 and out to C1.
R1 and R2 have a default route learned via BGP that points at C1. This will be an important piece of the puzzle later. Let’s go through what happens, step by step when R1 has its LAN interface go down. To simulate the multicast traffic I have an Ubuntu machine acting as the source. The receivers are CSR1000v routers with debug ip icmp to see how often the traffic is coming in. I’m sending ICMP packets every 100 ms.
*Dec 14 19:48:08.922: HSRP: Gi3.1 Interface DOWN
The interface on R1 goes down.
*Dec 14 19:48:08.926: is_up: GigabitEthernet3.1 0 state: 6 sub state: 1 line: 1 *Dec 14 19:48:08.926: RT: interface GigabitEthernet3.1 removed from routing table *Dec 14 19:48:08.926: RT: del 10.0.1.0 via 0.0.0.0, connected metric [0/0] *Dec 14 19:48:08.926: RT: delete subnet route to 10.0.1.0/24 *Dec 14 19:48:08.926: is_up: GigabitEthernet3.1 0 state: 6 sub state: 1 line: 1 *Dec 14 19:48:08.927: RT(multicast): interface GigabitEthernet3.1 removed from routing table *Dec 14 19:48:08.927: RT(multicast): del 10.0.1.0 via 0.0.0.0, connected metric [0/0] *Dec 14 19:48:08.927: RT(multicast): delete subnet route to 10.0.1.0/24 *Dec 14 19:48:08.927: RT: del 10.0.1.2 via 0.0.0.0, connected metric [0/0] *Dec 14 19:48:08.927: RT: delete subnet route to 10.0.1.2/32 *Dec 14 19:48:08.927: RT(multicast): del 10.0.1.2 via 0.0.0.0, connected metric [0/0] *Dec 14 19:48:08.927: RT(multicast): delete subnet route to 10.0.1.2/32
It takes roughly 5 ms to remove the route from the RIB and the MRIB.
*Dec 14 19:48:08.935: RT: add 10.0.1.0/24 via 10.0.255.2, bgp metric [200/0] *Dec 14 19:48:08.936: RT: updating bgp 10.0.21.0/24 (0x0) : via 10.0.255.2 0 1048577
R1 starts to install the route to the source via R2 into the RIB, but the route is not yet installed! This is a key concept.
*Dec 14 19:48:08.937: MRT(0): Delete GigabitEthernet1/239.0.0.1 from the olist of (10.0.1.10, 239.0.0.1) *Dec 14 19:48:08.937: MRT(0): Reset the PIM interest flag for (10.0.1.10, 239.0.0.1) *Dec 14 19:48:08.938: MRT(0): set min mtu for (10.0.1.10, 239.0.0.1) 1500->1500 *Dec 14 19:48:08.938: MRT(0): (10.0.1.10,239.0.0.1), RPF change from GigabitEthernet3.1/0.0.0.0 to GigabitEthernet1/10.0.11.1 *Dec 14 19:48:08.938: MRT(0): Reset the F-flag for (10.0.1.10, 239.0.0.1) *Dec 14 19:48:08.938: PIM(0): Insert (10.0.1.10,239.0.0.1) join in nbr 10.0.11.1's queue *Dec 14 19:48:08.938: PIM(0): Building Join/Prune packet for nbr 10.0.11.1 *Dec 14 19:48:08.938: PIM(0): Adding v2 (10.0.1.10/32, 239.0.0.1), S-bit Join *Dec 14 19:48:08.938: PIM(0): Send v2 join/prune to 10.0.11.1 (GigabitEthernet1) *Dec 14 19:48:08.938: RT: updating bgp 10.0.1.0/24 (0x0) : via 10.0.255.2 0 1048577
R1 cleans up some multicast state and then it sends a PIM Join towards C1, but the source is not located in that direction! The giveaway here is the message about RPF change and 0.0.0.0 which is the default route pointing towards C1. This default route is already installed into the RIB so the PIM Join is sent out the RPF interface which for a brief period happens to be towards C1. The receiver is still located in that direction though.
*Dec 14 19:48:08.938: RT: closer admin distance for 10.0.1.0, flushing 1 routes *Dec 14 19:48:08.938: RT: add 10.0.1.0/24 via 10.0.255.2, bgp metric [200/0] *Dec 14 19:48:08.939: MRT(0): (10.0.1.10,239.0.0.1), RPF change from GigabitEthernet1/10.0.11.1 to GigabitEthernet2/10.0.112.2 *Dec 14 19:48:08.939: PIM(0): Insert (10.0.1.10,239.0.0.1) join in nbr 10.0.112.2's queue *Dec 14 19:48:08.939: PIM(0): Insert (10.0.1.10,239.0.0.1) prune in nbr 10.0.11.1's queue *Dec 14 19:48:08.939: PIM(0): Building Join/Prune packet for nbr 10.0.11.1 *Dec 14 19:48:08.939: PIM(0): Adding v2 (10.0.1.10/32, 239.0.0.1), S-bit Prune *Dec 14 19:48:08.939: PIM(0): Send v2 join/prune to 10.0.11.1 (GigabitEthernet1) *Dec 14 19:48:08.939: PIM(0): Building Join/Prune packet for nbr 10.0.112.2 *Dec 14 19:48:08.939: PIM(0): Adding v2 (10.0.1.10/32, 239.0.0.1), S-bit Join *Dec 14 19:48:08.939: PIM(0): Send v2 join/prune to 10.0.112.2 (GigabitEthernet2)
R1 then installs the route via R2, it tries to cover its tracks by pruning off the interface towards C1. Hold on a second here though, isn’t that where the receiver is located? Indeed it is, this means the (S,G) tree is broken until C1 sends a periodic Join which could take from 0-60 seconds depending on when the last Join came in. The arrows on the topology shows the events in order, first a Join towards C1, then a Prune towards C1, then a Join towards R2.
*Dec 14 19:48:09.932: PIM(0): Insert (10.0.1.10,239.0.0.1) prune in nbr 10.0.112.2's queue - deleted *Dec 14 19:48:09.932: PIM(0): Building Join/Prune packet for nbr 10.0.112.2 *Dec 14 19:48:09.932: PIM(0): Adding v2 (10.0.1.10/32, 239.0.0.1), S-bit Prune *Dec 14 19:48:09.932: PIM(0): Send v2 join/prune to 10.0.112.2 (GigabitEthernet2)
R1 no longer has any receivers interested in the traffic, it pruned off the only remaining receiver located towards C1 so it sends a Prune towards R2.
*Dec 14 19:48:50.864: PIM(0): Received v2 Join/Prune on GigabitEthernet1 from 10.0.11.1, to us *Dec 14 19:48:50.864: PIM(0): Join-list: (10.0.1.10/32, 239.0.0.1), S-bit set *Dec 14 19:48:50.864: MRT(0): WAVL Insert interface: GigabitEthernet1 in (10.0.1.10,239.0.0.1) Successful *Dec 14 19:48:50.864: MRT(0): set min mtu for (10.0.1.10, 239.0.0.1) 18010->1500 *Dec 14 19:48:50.864: MRT(0): Add GigabitEthernet1/239.0.0.1 to the olist of (10.0.1.10, 239.0.0.1), Forward state - MAC built *Dec 14 19:48:50.864: PIM(0): Add GigabitEthernet1/10.0.11.1 to (10.0.1.10, 239.0.0.1), Forward state, by PIM SG Join *Dec 14 19:48:50.864: MRT(0): Add GigabitEthernet1/239.0.0.1 to the olist of (10.0.1.10, 239.0.0.1), Forward state - MAC built *Dec 14 19:48:50.864: MRT(0): Set the PIM interest flag for (10.0.1.10, 239.0.0.1) *Dec 14 19:48:50.864: PIM(0): Insert (10.0.1.10,239.0.0.1) join in nbr 10.0.112.2's queue *Dec 14 19:48:50.864: PIM(0): Building Join/Prune packet for nbr 10.0.112.2 *Dec 14 19:48:50.864: PIM(0): Adding v2 (10.0.1.10/32, 239.0.0.1), S-bit Join *Dec 14 19:48:50.864: PIM(0): Send v2 join/prune to 10.0.112.2 (GigabitEthernet2)
C1 is still interested in this multicast traffic, it sends a periodic Join towards R1 which then triggers R1 to send a Join towards R2. There was roughly a 40 second delay between these events. This can also be seen from the ping that I had running.
*Dec 14 19:48:08.842: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 19:48:08.942: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 19:48:09.042: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 19:48:51.054: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 19:48:51.158: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 19:48:51.255: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 19:48:51.355: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
This story shows how the unicast routing table and a race condition can effect your multicast traffic. What would a story be without a happy ending though? What can we do to solve the race condition?
The key here is that the default route in already installed into the RIB. To beat it we will have to put a longer match into the RIB or into the MRIB. This can be done by putting a static unicast route or a static multicast route. I prefer to use a static multicast route since that will have no effect on unicast traffic.
ip mroute 10.0.1.0 255.255.255.0 10.0.112.2
or
ip route 10.0.1.0 255.255.255.0 10.0.112.2
The connected route is the best route until R1 has its LAN interface go down. The MRIB will then use the next match which is 10.0.1.0 via 10.0.112.2. Let’s run another test now that we have altered the MRIB.
*Dec 14 20:29:03.128: HSRP: Gi3.1 Interface going DOWN
Interface goes down.
*Dec 14 20:29:03.142: is_up: GigabitEthernet3.1 0 state: 6 sub state: 1 line: 1 *Dec 14 20:29:03.142: RT: interface GigabitEthernet3.1 removed from routing table *Dec 14 20:29:03.142: RT: del 10.0.1.0 via 0.0.0.0, connected metric [0/0] *Dec 14 20:29:03.142: RT: delete subnet route to 10.0.1.0/24 *Dec 14 20:29:03.143: is_up: GigabitEthernet3.1 0 state: 6 sub state: 1 line: 1 *Dec 14 20:29:03.143: RT(multicast): interface GigabitEthernet3.1 removed from routing table *Dec 14 20:29:03.143: RT(multicast): del 10.0.1.0 via 0.0.0.0, connected metric [0/0] *Dec 14 20:29:03.143: RT(multicast): delete subnet route to 10.0.1.0/24 *Dec 14 20:29:03.143: RT: del 10.0.1.2 via 0.0.0.0, connected metric [0/0] *Dec 14 20:29:03.143: RT: delete subnet route to 10.0.1.2/32 *Dec 14 20:29:03.143: RT(multicast): del 10.0.1.2 via 0.0.0.0, connected metric [0/0] *Dec 14 20:29:03.143: RT(multicast): delete subnet route to 10.0.1.2/32
Remove 10.0.1.0/24 from the RIB and MRIB.
*Dec 14 20:29:03.150: RT: add 10.0.1.0/24 via 10.0.255.2, bgp metric [200/0] *Dec 14 20:29:03.150: RT: updating bgp 10.0.21.0/24 (0x0) : via 10.0.255.2 0 1048577 *Dec 14 20:29:03.150: RT: add 10.0.21.0/24 via 10.0.255.2, bgp metric [200/0]
Start installing the BGP route.
*Dec 14 20:29:03.152: MRT(0): (10.0.1.10,239.0.0.1), RPF change from GigabitEthernet3.1/0.0.0.0 to GigabitEthernet2/10.0.112.2 *Dec 14 20:29:03.152: MRT(0): Reset the F-flag for (10.0.1.10, 239.0.0.1) *Dec 14 20:29:03.152: PIM(0): Insert (10.0.1.10,239.0.0.1) join in nbr 10.0.112.2's queue *Dec 14 20:29:03.152: PIM(0): Building Join/Prune packet for nbr 10.0.112.2 *Dec 14 20:29:03.152: PIM(0): Adding v2 (10.0.1.10/32, 239.0.0.1), S-bit Join *Dec 14 20:29:03.152: PIM(0): Send v2 join/prune to 10.0.112.2 (GigabitEthernet2) *Dec 14 20:29:03.153: RT: updating bgp 10.0.1.0/24 (0x0) : via 10.0.255.2 0 1048577 *Dec 14 20:29:03.153: RT: closer admin distance for 10.0.1.0, flushing 1 routes *Dec 14 20:29:03.153: RT: add 10.0.1.0/24 via 10.0.255.2, bgp metric [200/0]
The RPF interface changes but now points towards the correct destination, which is R2. The multicast traffic can flow again. How many packets did we lose?
*Dec 14 20:29:02.017: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 20:29:02.115: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 20:29:02.216: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 20:29:02.317: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 20:29:02.418: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 20:29:02.519: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 20:29:02.620: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 20:29:02.720: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 20:29:02.820: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 20:29:02.920: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 20:29:03.021: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 20:29:03.122: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 20:29:03.223: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 20:29:03.324: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 20:29:03.424: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 20:29:03.526: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 20:29:03.626: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 20:29:03.727: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 20:29:03.828: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0 *Dec 14 20:29:03.927: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
No packets lost! I would probably have to send packets more often than every 100ms to catch the tree converging here. In a real network you would see some delay because pulling the cable is different than shutting down an interface, the carrier delay would come into play here. Here are some key concepts you should learn from this post.
- Unicast routing table will impact the multicast routing table
- Never assume anything, verify
- Installing routes takes time
There are a lot of moving parts here. If you don’t understand it all at once, don’t worry. It’s a complex scenario and don’t be afraid to ask questions in the comments section. Here are the key concepts of converging in this scenario.
- Interface goes down (0-2 seconds)
- Remove stale route (14 ms)
- Install new route (3 ms)
- RPF change/PIM Join (2 ms)
These are values from my tests and you will see higher values on production equipment. Because PIM Joins are triggered by the routing table changing, convergence can be very fast depending on how fast you detect it. As you can see from the values above, you can achieve convergence somewhere between 0.5 – 3 seconds realistically depending on how aggressively you tune your timers.
After reading this post you should have a better understanding of multicast, the RIB, MRIB and the unicast routing table impact on multicast flows.
Daniel, wouldn’t it be fair to say that the reason the static mroute takes precedence over the BGP route is because mroute lookups prefer the admin distance of the route and not because the static mroute has a longer mask?
I enjoy your blog, please keep writing!
Seems you are correct. Normally AD would only come into play for equally long routes but it seems the MRIB prefers static mroutes over longer prefixes from the RIB.
Linux distributions are first and foremost propped on free and open-source software, at least to some extent; that element consist of the Linux kernel and more often than not a very huge anthology of software of all sorts.