In a previous post, EVPN Deepdive Route Types 2 and 3, I covered route types 2 and 3. In this post I’ll cover route type 5 which is used for advertising IP prefixes. This route type is covered in RFC 9136.
There are two main use cases for advertising IP prefixes in EVPN route type 5:
- Advertising external prefixes into the VXLAN network.
- Advertising prefixes for connectivity towards silent hosts.
The first scenario is pretty obvious. There are other places in the network, such as remote offices via a WAN, partners and external parties, as well as the internet. To route towards these destinations, a route type is needed and this is route type 5. Remember, route type 2 only provides host routing which poses the following problems for external connectivity:
- Advertising everything as /32 and /128 would be highly inefficient.
- It requires an EVPN speaker to generate the RT2 and the external prefixes are originated from non-EVPN speakers.
- It would not be possible to advertise a default route.
- Without RT5, external connectivity would have to be advertised from another protocol than EVPN.
The last bullet may be worth expanding a bit on. If the external prefixes aren’t advertised by EVPN through RT5, then another protocol would be needed. This protocol would still probably be BGP, but then you would need to either use VRF lite with a lot of configuration required or to establish a VPNv4 topology. Then you would have to configure and manage both VXLAN and MPLS in the same network which would add a lot of overhead.
RFC 9136 brings up another interesting point why RT5 is needed. RT2 has a tight coupling between MAC and IP address. This route type is catered for end hosts. However, there may be other devices in the network such as load balancers, firewalls, and appliances, which may have many interfaces and that may share an IP in a failover pair, using for example VRRP, or proprietary protocols. With a tight coupling of MAC and IP, there could be scenarios with a lot of route churn to update all the RT2 routes when the owner of the floating IP changes. RT5 decouples MAC from IP and only contains the prefix.
Before diving into the silent host scenario, let’s take a look at the structure of RT5:
The main differences to RT2 is:
- MAC is not advertised.
- The IP prefix length is variable.
- Only L3 VNI and no L2 VNI.
- Contains GW IP address.
The GW IP is an interesting part of RT5 that we will get back to in a future post, but it is used as an overlay identifier.
Now, I said that RT5 is used with silent hosts. There are two main scenarios with silent hosts:
- Hosts are on the same L2 VNI and traffic is bridged.
- Hosts are in different L2 VNIs and traffic is routed.
For the scenario where hosts are in the same L2 VNI, if a host does not know the MAC of the other host it can simply ARP for it and the leaf switch would use whatever method it uses for BUM flooding such as ingress replication or multicast in the underlay. There is no need for RT5 in this scenario.
For a scenario where hosts are in different L2 VNIs, there are actually two scenarios:
- Both L2 VNIs are on both leaf switches.
- The L2 VNIs are not on both leaf switches.
For the first scenario, the leaf switch can ARP for the destination host. For the second scenario, this is where RT5 comes into play. The following topology will be used:
This is the current route table of Leaf-2:
Leaf2# show ip route vrf Tenant1 IP Route Table for VRF "Tenant1" '*' denotes best ucast next-hop '**' denotes best mcast next-hop '[x/y]' denotes [preference/metric] '%<string>' in via output denotes VRF <string> 10.0.0.0/24, ubest/mbest: 1/0, attached *via 10.0.0.1, Vlan20, [0/0], 1w3d, direct 10.0.0.1/32, ubest/mbest: 1/0, attached *via 10.0.0.1, Vlan20, [0/0], 1w3d, local 10.0.0.22/32, ubest/mbest: 1/0, attached *via 10.0.0.22, Vlan20, [190/0], 00:10:23, hmm 198.51.100.11/32, ubest/mbest: 1/0 *via 203.0.113.1%default, [200/0], 1w4d, bgp-65000, internal, tag 65000, segid: 10001 tunnelid: 0xcb007101 encap: VXLAN
Note that there is no entry for Server-4. Note that there is a host entry for Server-1 (198.51.100.11). Before diving into our scenario, there is one thing that we need to do. NX-OS does not advertise any RT5 routes until we configure it. Let’s do this on Leaf-4 and Leaf-1 by configuring redistribute direct
under the VRF:
Leaf1(config)# route-map PERMIT_ALL permit 10 Leaf1(config-route-map)# router bgp 65000 Leaf1(config-router)# vrf Tenant1 Leaf1(config-router-vrf)# address-family ipv4 unicast Leaf1(config-router-vrf-af)# redistribute direct route-map PERMIT_ALL Leaf4(config)# route-map PERMIT_ALL permit 10 Leaf4(config-route-map)# router bgp 65000 Leaf4(config-router)# vrf Tenant1 Leaf4(config-router-vrf)# address-family ipv4 unicast Leaf4(config-router-vrf-af)# redistribute direct route-map PERMIT_ALL
Note that a route-map must be used. Either allow all as above or be more selective. It’s possible to tag prefixes and match on that in the route-map.
Leaf-2 will now have two routes available, one via Leaf-1 and one via Leaf-4:
Leaf2# show bgp l2vpn evpn route-type 5 BGP routing table information for VRF default, address family L2VPN EVPN Route Distinguisher: 192.0.2.3:3 BGP routing table entry for [5]:[0]:[0]:[24]:[198.51.100.0]/224, version 6421 Paths: (2 available, best #2) Flags: (0x000002) (high32 00000000) on xmit-list, is not in l2rib/evpn, is not in HW Path type: internal, path is valid, not best reason: Neighbor Address, no labeled nexthop Gateway IP: 0.0.0.0 AS-Path: NONE, path sourced internal to AS 203.0.113.1 (metric 81) from 192.0.2.12 (192.0.2.2) Origin incomplete, MED 0, localpref 100, weight 0 Received label 10001 Extcommunity: RT:65000:10001 ENCAP:8 Router MAC:00ad.e688.1b08 Originator: 192.0.2.3 Cluster list: 192.0.2.2 Advertised path-id 1 Path type: internal, path is valid, is best path, no labeled nexthop Imported to 2 destination(s) Imported paths list: Tenant1 L3-10001 Gateway IP: 0.0.0.0 AS-Path: NONE, path sourced internal to AS 203.0.113.1 (metric 81) from 192.0.2.11 (192.0.2.1) Origin incomplete, MED 0, localpref 100, weight 0 Received label 10001 Extcommunity: RT:65000:10001 ENCAP:8 Router MAC:00ad.e688.1b08 Originator: 192.0.2.3 Cluster list: 192.0.2.1 Path-id 1 not advertised to any peer Route Distinguisher: 192.0.2.6:3 BGP routing table entry for [5]:[0]:[0]:[24]:[198.51.100.0]/224, version 6423 Paths: (2 available, best #2) Flags: (0x000002) (high32 00000000) on xmit-list, is not in l2rib/evpn, is not in HW Path type: internal, path is valid, not best reason: Neighbor Address, no labeled nexthop Gateway IP: 0.0.0.0 AS-Path: NONE, path sourced internal to AS 203.0.113.4 (metric 81) from 192.0.2.12 (192.0.2.2) Origin incomplete, MED 0, localpref 100, weight 0 Received label 10001 Extcommunity: RT:65000:10001 ENCAP:8 Router MAC:00ad.7083.1b08 Originator: 192.0.2.6 Cluster list: 192.0.2.2 Advertised path-id 1 Path type: internal, path is valid, is best path, no labeled nexthop Imported to 2 destination(s) Imported paths list: Tenant1 L3-10001 Gateway IP: 0.0.0.0 AS-Path: NONE, path sourced internal to AS 203.0.113.4 (metric 81) from 192.0.2.11 (192.0.2.1) Origin incomplete, MED 0, localpref 100, weight 0 Received label 10001 Extcommunity: RT:65000:10001 ENCAP:8 Router MAC:00ad.7083.1b08 Originator: 192.0.2.6 Cluster list: 192.0.2.1 Path-id 1 not advertised to any peer
Because there are two RRs, Leaf-2 has the same route from both of them. Note that it first selects a route from both Leaf-1 and Leaf-4 from one of the RRs based on the neighbor IP. However, only one path is then imported as we have not configured BGP for ECMP:
Route Distinguisher: 192.0.2.4:3 (L3VNI 10001) BGP routing table entry for [5]:[0]:[0]:[24]:[198.51.100.0]/224, version 6424 Paths: (2 available, best #2) Flags: (0x000002) (high32 00000000) on xmit-list, is not in l2rib/evpn, is not in HW Path type: internal, path is valid, not best reason: Router Id, no labeled nexthop Imported from 192.0.2.6:3:[5]:[0]:[0]:[24]:[198.51.100.0]/224 Gateway IP: 0.0.0.0 AS-Path: NONE, path sourced internal to AS 203.0.113.4 (metric 81) from 192.0.2.11 (192.0.2.1) Origin incomplete, MED 0, localpref 100, weight 0 Received label 10001 Extcommunity: RT:65000:10001 ENCAP:8 Router MAC:00ad.7083.1b08 Originator: 192.0.2.6 Cluster list: 192.0.2.1 Advertised path-id 1 Path type: internal, path is valid, is best path, no labeled nexthop Imported from 192.0.2.3:3:[5]:[0]:[0]:[24]:[198.51.100.0]/224 Gateway IP: 0.0.0.0 AS-Path: NONE, path sourced internal to AS 203.0.113.1 (metric 81) from 192.0.2.11 (192.0.2.1) Origin incomplete, MED 0, localpref 100, weight 0 Received label 10001 Extcommunity: RT:65000:10001 ENCAP:8 Router MAC:00ad.e688.1b08 Originator: 192.0.2.3 Cluster list: 192.0.2.1 Path-id 1 not advertised to any peer
The route from Leaf-1 is currently selected as best based on Router ID. We can see it in the routing table:
Leaf2# show ip route 198.51.100.0/24 vrf Tenant1 IP Route Table for VRF "Tenant1" '*' denotes best ucast next-hop '**' denotes best mcast next-hop '[x/y]' denotes [preference/metric] '%<string>' in via output denotes VRF <string> 198.51.100.0/24, ubest/mbest: 1/0 *via 203.0.113.1%default, [200/0], 00:27:05, bgp-65000, internal, tag 65000, segid: 10001 tunnelid: 0xcb007101 encap: VXLAN
What if we wanted to enable ECMP for this route? We would then configure maximum paths under the VRF in BGP:
Leaf2(config)# router bgp 65000 Leaf2(config-router)# vrf Tenant1 Leaf2(config-router-vrf)# address-family ipv4 unicast Leaf2(config-router-vrf-af)# maximum-paths ibgp 2
There are now two routes installed:
Leaf2# show ip route 198.51.100.0/24 vrf Tenant1 IP Route Table for VRF "Tenant1" '*' denotes best ucast next-hop '**' denotes best mcast next-hop '[x/y]' denotes [preference/metric] '%<string>' in via output denotes VRF <string> 198.51.100.0/24, ubest/mbest: 2/0 *via 203.0.113.1%default, [200/0], 00:00:03, bgp-65000, internal, tag 65000, segid: 10001 tunnelid: 0xcb007101 encap: VXLAN *via 203.0.113.4%default, [200/0], 00:00:03, bgp-65000, internal, tag 65000, segid: 10001 tunnelid: 0xcb007104 encap: VXLAN
Now let’s back to our scenario. Server-2 is going to ping Server-4. Server-2 has the MAC of Leaf-2 in its ARP cache so there is no need to ARP for the MAC of the gateway:
server2:~$ ip neighbor | grep 10.0.0.1 10.0.0.1 dev ens192 lladdr 00:01:00:01:00:01 STALE
Server-2 sends the ICMP Echo:
When the packet arrives at Leaf-2, it has a destination MAC of SVI for VLAN 20 so the packet is consumed. It will do a lookup in its route table for 198.51.100.44:
Leaf2# show ip route 198.51.100.0/24 vrf Tenant1 IP Route Table for VRF "Tenant1" '*' denotes best ucast next-hop '**' denotes best mcast next-hop '[x/y]' denotes [preference/metric] '%<string>' in via output denotes VRF <string> 198.51.100.0/24, ubest/mbest: 2/0 *via 203.0.113.1%default, [200/0], 00:00:03, bgp-65000, internal, tag 65000, segid: 10001 tunnelid: 0xcb007101 encap: VXLAN *via 203.0.113.4%default, [200/0], 00:00:03, bgp-65000, internal, tag 65000, segid: 10001 tunnelid: 0xcb007104 encap: VXLAN
This is where RT5 comes into play. Leaf-2 can now send the packet to either Leaf-1 or Leaf-4 (where Server-4 is) based on ECMP algorithm. In this case Leaf-4 was selected. This is shown below:
There are some interesting things to note here. In the outer Ethernet header (blue) the source MAC is the MAC of outgoing interface of Leaf-2 towards Spine-1 and destination MAC of Spine-1’s interface. For the inner Ethernet header (gray), the destination MAC is the Router MAC that is used to forward to the correct SVI. We can see the router MACs with the show nve peers command:
Leaf2# show nve peers Interface Peer-IP State LearnType Uptime Router-Mac --------- -------------------------------------- ----- --------- -------- ----------------- nve1 203.0.113.1 Up CP 1w6d 00ad.e688.1b08 nve1 203.0.113.3 Up CP 1w6d n/a nve1 203.0.113.4 Up CP 1w6d 00ad.7083.1b08
The packet as seen in Wireshark:
Frame 42: 148 bytes on wire (1184 bits), 148 bytes captured (1184 bits) on interface ens161, id 0 Ethernet II, Src: 00:ad:b3:fd:1b:08, Dst: 00:ad:70:83:1b:08 Internet Protocol Version 4, Src: 203.0.113.2, Dst: 203.0.113.4 User Datagram Protocol, Src Port: 64182, Dst Port: 4789 Virtual eXtensible Local Area Network Flags: 0x0800, VXLAN Network ID (VNI) Group Policy ID: 0 VXLAN Network Identifier (VNI): 10001 Reserved: 0 Ethernet II, Src: 00:ad:f3:bb:1b:08, Dst: 00:ad:70:83:1b:08 Internet Protocol Version 4, Src: 10.0.0.22, Dst: 198.51.100.44 Internet Control Message Protocol
When the packet arrives at Leaf-4, it is consumed by SVI for VLAN 100. The packet is routed and as 198.51.100.0/24 is available locally, the packet could be forwarded to 198.51.100.44, but there is nothing in the ARP cache. Leaf-4 must ARP for 198.51.100.44. This frame can be seen below:
The frame as seen in Wireshark:
Frame 60: 60 bytes on wire (480 bits), 60 bytes captured (480 bits) on interface ens194, id 8 Ethernet II, Src: 00:01:00:01:00:01, Dst: ff:ff:ff:ff:ff:ff Address Resolution Protocol (request) Hardware type: Ethernet (1) Protocol type: IPv4 (0x0800) Hardware size: 6 Protocol size: 4 Opcode: request (1) Sender MAC address: 00:01:00:01:00:01 Sender IP address: 198.51.100.1 Target MAC address: ff:ff:ff:ff:ff:ff Target IP address: 198.51.100.44
Note that this frame does not only get flooded locally. It will also get flooded to all the members of L2 VNI 10000 via ingress replication. This packet is sent to all the NVE peers for that L2 VNI, 203.0.113.1, 203.0.113.2, and 203.0.113.3 in my lab:
It’s interesting to note that two of the packets had one destination MAC, that of Spine-1 (00:ad:b3:fd:1b:08) and the other had the destination MAC of Spine-2 (00:ad:7b:30:1b:08). This is due to ECMP towards the NVE IP addresses. Note that all packets had the same source MAC of 00:ad:70:83:1b:08. This is due to the IP unnumbered setup in my lab.
Server-4 will then send a response to the ARP request:
The frame as seen in Wireshark:
Frame 61: 60 bytes on wire (480 bits), 60 bytes captured (480 bits) on interface ens194, id 8 Ethernet II, Src: 00:50:56:ad:7d:68, Dst: 00:01:00:01:00:01 Address Resolution Protocol (reply) Hardware type: Ethernet (1) Protocol type: IPv4 (0x0800) Hardware size: 6 Protocol size: 4 Opcode: reply (2) Sender MAC address: 00:50:56:ad:7d:68 Sender IP address: 198.51.100.44 Target MAC address: 00:01:00:01:00:01 Target IP address: 198.51.100.1
The ARP cache of Leaf-4 is now populated with the MAC of Server-4:
Leaf4# show ip arp vrf Tenant1 Flags: * - Adjacencies learnt on non-active FHRP router + - Adjacencies synced via CFSoE # - Adjacencies Throttled for Glean CP - Added via L2RIB, Control plane Adjacencies PS - Added via L2RIB, Peer Sync RO - Re-Originated Peer Sync Entry D - Static Adjacencies attached to down interface IP ARP Table for context Tenant1 Total number of entries: 1 Address Age MAC Address Interface Flags 198.51.100.44 00:06:48 0050.56ad.7d68 Vlan10
This means that the ICMP Echo can be forwarded to Server-4:
Packet as seen by Wireshark:
Frame 91: 98 bytes on wire (784 bits), 98 bytes captured (784 bits) on interface ens194, id 8 Ethernet II, Src: 00:ad:70:83:1b:08, Dst: 00:50:56:ad:7d:68 Internet Protocol Version 4, Src: 10.0.0.22, Dst: 198.51.100.44 Internet Control Message Protocol
From here on forwarding will take place as usual. It’s interesting to note though that as Leaf-4 now knows both the MAC and the IP of Server-4, it will generate a BGP update of type RT2 which contains MAC and IP of Server-4. On Leaf-2 we can now see this route:
Leaf2# show bgp l2vpn evpn 198.51.100.44 Route Distinguisher: 192.0.2.6:32777 BGP routing table entry for [2]:[0]:[0]:[48]:[0050.56ad.7d68]:[32]:[198.51.100.44]/272, version 6444 Paths: (2 available, best #2) Flags: (0x000202) (high32 00000000) on xmit-list, is not in l2rib/evpn, is not in HW Path type: internal, path is valid, not best reason: Neighbor Address, no labeled nexthop AS-Path: NONE, path sourced internal to AS 203.0.113.4 (metric 81) from 192.0.2.12 (192.0.2.2) Origin IGP, MED not set, localpref 100, weight 0 Received label 10000 10001 Extcommunity: RT:65000:10000 RT:65000:10001 ENCAP:8 Router MAC:00ad.7083.1b08 Originator: 192.0.2.6 Cluster list: 192.0.2.2 Advertised path-id 1 Path type: internal, path is valid, is best path, no labeled nexthop Imported to 3 destination(s) Imported paths list: Tenant1 L3-10001 L2-10000 AS-Path: NONE, path sourced internal to AS 203.0.113.4 (metric 81) from 192.0.2.11 (192.0.2.1) Origin IGP, MED not set, localpref 100, weight 0 Received label 10000 10001 Extcommunity: RT:65000:10000 RT:65000:10001 ENCAP:8 Router MAC:00ad.7083.1b08 Originator: 192.0.2.6 Cluster list: 192.0.2.1 Path-id 1 not advertised to any peer
This means that Leaf-2, and the other switches, can now import this route into the VRF:
Leaf2# show ip route 198.51.100.44 vrf Tenant1 IP Route Table for VRF "Tenant1" '*' denotes best ucast next-hop '**' denotes best mcast next-hop '[x/y]' denotes [preference/metric] '%<string>' in via output denotes VRF <string> 198.51.100.44/32, ubest/mbest: 1/0 *via 203.0.113.4%default, [200/0], 05:49:47, bgp-65000, internal, tag 65000, segid: 10001 tunnelid: 0xcb007104 encap: VXLAN
The next time traffic needs to be sent towards 198.51.100.44, it can be sent directly to Leaf-4. To summarize:
- RT5 is decoupled from MAC.
- It is mainly used for external prefixes and handling of silent/shy hosts.
- Traffic towards silent/shy host could go to a leaf where the host is not connected when following the RT5.
- When the leaf learns about the host, it will send BGP update RT2 with MAC and IP.
- Leafs can then install the /32 for more optimal forwarding.
I hope you enjoyed this deepdive! See you in the next one.
Hello ,
Will you be discussing GW IP Address in EVPN Type in your next post.
Regards,
Nabeel
Hello Daniel!
Absolutely majestic series! Very good for getting into and learning about VXLAN and the benefits of using alongside EVPN.
I have a request if you don’t mind, this series is about NX-os (obviously DC is the main usage) but what about Catalyst 9000 series? It differs quite a bit but is still supported and functional.
If you by any chance could take up a summary on some setups in catalyst switches and use cases that would be awesome!
//Cristian
Hi Cristian 🙂
I hope to cover Catalyst 9000 in the future. I’m not sure what the virtual platform supports, though.
Hi Daniel,
I’d like to echo Cristian’s comment above.
The documentation I’ve seen so far related to BGP EVPN mostly discusses DC environments where you have a leaf-spine topology. Even though there’s been support for BGP EVPN in the Catalyst 9000 switches, I haven’t really seen any detailed use cases or examples for Campus use (where, in my experience, the network topology is not so strict and stable).
Also, as far as I’m aware, Cat9Kv doesn’t yet support it (hopefully new releases will come this year that we can use for testing and labbing!).
I’m hoping to find out more about it at Cisco Live in Amsterdam.
Hi Daniel,
You wrote “Leaf-2 will now have two routes available, one via Leaf-1 and one via Leaf-4”. Can you please explain why Leaf-1 can also provide route to 198.51.100.44/24 this type-5 route?
(Please take the following with a grain of salt, as I’m still learning EVPN) This is not spelled directly in the post, but both Leaf-1 and Leaf-4 have the relevant L3 VNI and SVI configured. Therefore, both have directly connected routes to 198.51.100.44/24. Since we are redistributing directly connected routes into BGP, both will advertise it and Leaf-2 will receive both routes.
Hi Daniel, excellent post. I have one doubt, you said the following : There are some interesting things to note here. In the outer Ethernet header (blue) the source MAC is the MAC of outgoing interface of Leaf-2 towards Spine-1 and destination MAC of Spine-1’s interface. For the inner Ethernet header (gray), the source MAC is the Router MAC that is used to forward to the correct SVI. We can see the router MACs with the show nve peers command:
But in the Figure above this explanation, at the Leaf2, both inner and outer ethernet header are the same, while in the explanation you said: “In the outer Ethernet header (blue) the source MAC is the MAC of outgoing interface of Leaf-2 towards Spine-1 and destination MAC of Spine-1’s interface. For the inner Ethernet header (gray), the source MAC is the Router MAC that is used to forward to the correct SVI”
Can you explain me why both inner (S-MAC: 00ad.f3bb.1b08, D-MAC: 00ad.7083.1b08) and outer ethernet header (S-MAC: 00ad.f3bb.1b08, D-MAC: 00ad.7083.1b08) are shown the same ?
Hello my friend!
You have a great sense for details. Turns out I had a mistake in this image. Probably some copy/paste mistake. I have corrected it now.
Thank you!
Hello Daniel,
Juste one precision if you permit me, when you said
“For the inner Ethernet header (gray), the source MAC is the Router MAC that is used to forward to the correct SVI. We can see the router MACs with the show nve peers command:”
I think you should write ” the Destination MAC is the Routed MAC…. “not the source MAC”
The source MAC of both the L2 outer and L2 inner is always the MAC of the outgoing interface to towards Spine-1. The packet capture confirms this.
Best regards
Nice catch! Thanks!
Re-Hello Daniel, Thanks a lot for your atypical article.
By the way, as I wrote in my previous comment and based on my lab : The SOURCE MAC of both the L2 OUTER and L2 INNER is always the MAC of the outgoing interface towards Spine-1). The question that I didnt find the answer, why the Leaf selects the MAC address of the physical outgoing interface as the source for both (Inner and Outer L2) instead of its local ROUTER MAC ? Did you have an idea?