In a previous post, EVPN Deepdive Route Types 2 and 3, I covered route types 2 and 3. In this post I’ll cover route type 5 which is used for advertising IP prefixes. This route type is covered in RFC 9136.

There are two main use cases for advertising IP prefixes in EVPN route type 5:

  • Advertising external prefixes into the VXLAN network.
  • Advertising prefixes for connectivity towards silent hosts.

The first scenario is pretty obvious. There are other places in the network, such as remote offices via a WAN, partners and external parties, as well as the internet. To route towards these destinations, a route type is needed and this is route type 5. Remember, route type 2 only provides host routing which poses the following problems for external connectivity:

  • Advertising everything as /32 and /128 would be highly inefficient.
  • It requires an EVPN speaker to generate the RT2 and the external prefixes are originated from non-EVPN speakers.
  • It would not be possible to advertise a default route.
  • Without RT5, external connectivity would have to be advertised from another protocol than EVPN.

The last bullet may be worth expanding a bit on. If the external prefixes aren’t advertised by EVPN through RT5, then another protocol would be needed. This protocol would still probably be BGP, but then you would need to either use VRF lite with a lot of configuration required or to establish a VPNv4 topology. Then you would have to configure and manage both VXLAN and MPLS in the same network which would add a lot of overhead.

RFC 9136 brings up another interesting point why RT5 is needed. RT2 has a tight coupling between MAC and IP address. This route type is catered for end hosts. However, there may be other devices in the network such as load balancers, firewalls, and appliances, which may have many interfaces and that may share an IP in a failover pair, using for example VRRP, or proprietary protocols. With a tight coupling of MAC and IP, there could be scenarios with a lot of route churn to update all the RT2 routes when the owner of the floating IP changes. RT5 decouples MAC from IP and only contains the prefix.

Before diving into the silent host scenario, let’s take a look at the structure of RT5:

The main differences to RT2 is:

  • MAC is not advertised.
  • The IP prefix length is variable.
  • Only L3 VNI and no L2 VNI.
  • Contains GW IP address.

The GW IP is an interesting part of RT5 that we will get back to in a future post, but it is used as an overlay identifier.

Now, I said that RT5 is used with silent hosts. There are two main scenarios with silent hosts:

  • Hosts are on the same L2 VNI and traffic is bridged.
  • Hosts are in different L2 VNIs and traffic is routed.

For the scenario where hosts are in the same L2 VNI, if a host does not know the MAC of the other host it can simply ARP for it and the leaf switch would use whatever method it uses for BUM flooding such as ingress replication or multicast in the underlay. There is no need for RT5 in this scenario.

For a scenario where hosts are in different L2 VNIs, there are actually two scenarios:

  • Both L2 VNIs are on both leaf switches.
  • The L2 VNIs are not on both leaf switches.

For the first scenario, the leaf switch can ARP for the destination host. For the second scenario, this is where RT5 comes into play. The following topology will be used:

This is the current route table of Leaf-2:

Leaf2# show ip route vrf Tenant1
IP Route Table for VRF "Tenant1"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

10.0.0.0/24, ubest/mbest: 1/0, attached
    *via 10.0.0.1, Vlan20, [0/0], 1w3d, direct
10.0.0.1/32, ubest/mbest: 1/0, attached
    *via 10.0.0.1, Vlan20, [0/0], 1w3d, local
10.0.0.22/32, ubest/mbest: 1/0, attached
    *via 10.0.0.22, Vlan20, [190/0], 00:10:23, hmm
198.51.100.11/32, ubest/mbest: 1/0
    *via 203.0.113.1%default, [200/0], 1w4d, bgp-65000, internal, tag 65000, segid: 10001 tunnelid: 0xcb007101 encap: VXLAN

Note that there is no entry for Server-4. Note that there is a host entry for Server-1 (198.51.100.11). Before diving into our scenario, there is one thing that we need to do. NX-OS does not advertise any RT5 routes until we configure it. Let’s do this on Leaf-4 and Leaf-1 by configuring redistribute direct under the VRF:

Leaf1(config)# route-map PERMIT_ALL permit 10
Leaf1(config-route-map)# router bgp 65000
Leaf1(config-router)# vrf Tenant1
Leaf1(config-router-vrf)# address-family ipv4 unicast
Leaf1(config-router-vrf-af)# redistribute direct route-map PERMIT_ALL
Leaf4(config)# route-map PERMIT_ALL permit 10
Leaf4(config-route-map)# router bgp 65000
Leaf4(config-router)# vrf Tenant1
Leaf4(config-router-vrf)# address-family ipv4 unicast
Leaf4(config-router-vrf-af)# redistribute direct route-map PERMIT_ALL

Note that a route-map must be used. Either allow all as above or be more selective. It’s possible to tag prefixes and match on that in the route-map.

Leaf-2 will now have two routes available, one via Leaf-1 and one via Leaf-4:

Leaf2# show bgp l2vpn evpn route-type 5
BGP routing table information for VRF default, address family L2VPN EVPN
Route Distinguisher: 192.0.2.3:3
BGP routing table entry for [5]:[0]:[0]:[24]:[198.51.100.0]/224, version 6421
Paths: (2 available, best #2)
Flags: (0x000002) (high32 00000000) on xmit-list, is not in l2rib/evpn, is not in HW

  Path type: internal, path is valid, not best reason: Neighbor Address, no labeled nexthop
  Gateway IP: 0.0.0.0
  AS-Path: NONE, path sourced internal to AS
    203.0.113.1 (metric 81) from 192.0.2.12 (192.0.2.2)
      Origin incomplete, MED 0, localpref 100, weight 0
      Received label 10001
      Extcommunity: RT:65000:10001 ENCAP:8 Router MAC:00ad.e688.1b08
      Originator: 192.0.2.3 Cluster list: 192.0.2.2 

  Advertised path-id 1
  Path type: internal, path is valid, is best path, no labeled nexthop
             Imported to 2 destination(s)
             Imported paths list: Tenant1 L3-10001
  Gateway IP: 0.0.0.0
  AS-Path: NONE, path sourced internal to AS
    203.0.113.1 (metric 81) from 192.0.2.11 (192.0.2.1)
      Origin incomplete, MED 0, localpref 100, weight 0
      Received label 10001
      Extcommunity: RT:65000:10001 ENCAP:8 Router MAC:00ad.e688.1b08
      Originator: 192.0.2.3 Cluster list: 192.0.2.1 

  Path-id 1 not advertised to any peer

Route Distinguisher: 192.0.2.6:3
BGP routing table entry for [5]:[0]:[0]:[24]:[198.51.100.0]/224, version 6423
Paths: (2 available, best #2)
Flags: (0x000002) (high32 00000000) on xmit-list, is not in l2rib/evpn, is not in HW

  Path type: internal, path is valid, not best reason: Neighbor Address, no labeled nexthop
  Gateway IP: 0.0.0.0
  AS-Path: NONE, path sourced internal to AS
    203.0.113.4 (metric 81) from 192.0.2.12 (192.0.2.2)
      Origin incomplete, MED 0, localpref 100, weight 0
      Received label 10001
      Extcommunity: RT:65000:10001 ENCAP:8 Router MAC:00ad.7083.1b08
      Originator: 192.0.2.6 Cluster list: 192.0.2.2 

  Advertised path-id 1
  Path type: internal, path is valid, is best path, no labeled nexthop
             Imported to 2 destination(s)
             Imported paths list: Tenant1 L3-10001
  Gateway IP: 0.0.0.0
  AS-Path: NONE, path sourced internal to AS
    203.0.113.4 (metric 81) from 192.0.2.11 (192.0.2.1)
      Origin incomplete, MED 0, localpref 100, weight 0
      Received label 10001
      Extcommunity: RT:65000:10001 ENCAP:8 Router MAC:00ad.7083.1b08
      Originator: 192.0.2.6 Cluster list: 192.0.2.1 

  Path-id 1 not advertised to any peer

Because there are two RRs, Leaf-2 has the same route from both of them. Note that it first selects a route from both Leaf-1 and Leaf-4 from one of the RRs based on the neighbor IP. However, only one path is then imported as we have not configured BGP for ECMP:

Route Distinguisher: 192.0.2.4:3    (L3VNI 10001)
BGP routing table entry for [5]:[0]:[0]:[24]:[198.51.100.0]/224, version 6424
Paths: (2 available, best #2)
Flags: (0x000002) (high32 00000000) on xmit-list, is not in l2rib/evpn, is not in HW

  Path type: internal, path is valid, not best reason: Router Id, no labeled nexthop
             Imported from 192.0.2.6:3:[5]:[0]:[0]:[24]:[198.51.100.0]/224 
  Gateway IP: 0.0.0.0
  AS-Path: NONE, path sourced internal to AS
    203.0.113.4 (metric 81) from 192.0.2.11 (192.0.2.1)
      Origin incomplete, MED 0, localpref 100, weight 0
      Received label 10001
      Extcommunity: RT:65000:10001 ENCAP:8 Router MAC:00ad.7083.1b08
      Originator: 192.0.2.6 Cluster list: 192.0.2.1 

  Advertised path-id 1
  Path type: internal, path is valid, is best path, no labeled nexthop
             Imported from 192.0.2.3:3:[5]:[0]:[0]:[24]:[198.51.100.0]/224 
  Gateway IP: 0.0.0.0
  AS-Path: NONE, path sourced internal to AS
    203.0.113.1 (metric 81) from 192.0.2.11 (192.0.2.1)
      Origin incomplete, MED 0, localpref 100, weight 0
      Received label 10001
      Extcommunity: RT:65000:10001 ENCAP:8 Router MAC:00ad.e688.1b08
      Originator: 192.0.2.3 Cluster list: 192.0.2.1 

  Path-id 1 not advertised to any peer

The route from Leaf-1 is currently selected as best based on Router ID. We can see it in the routing table:

Leaf2# show ip route 198.51.100.0/24 vrf Tenant1
IP Route Table for VRF "Tenant1"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

198.51.100.0/24, ubest/mbest: 1/0
    *via 203.0.113.1%default, [200/0], 00:27:05, bgp-65000, internal, tag 65000, segid: 10001 tunnelid: 0xcb007101 encap: VXLAN

What if we wanted to enable ECMP for this route? We would then configure maximum paths under the VRF in BGP:

Leaf2(config)# router bgp 65000
Leaf2(config-router)# vrf Tenant1
Leaf2(config-router-vrf)# address-family ipv4 unicast
Leaf2(config-router-vrf-af)# maximum-paths ibgp 2

There are now two routes installed:

Leaf2# show ip route 198.51.100.0/24 vrf Tenant1
IP Route Table for VRF "Tenant1"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

198.51.100.0/24, ubest/mbest: 2/0
    *via 203.0.113.1%default, [200/0], 00:00:03, bgp-65000, internal, tag 65000, segid: 10001 tunnelid: 0xcb007101 encap: VXLAN
 
    *via 203.0.113.4%default, [200/0], 00:00:03, bgp-65000, internal, tag 65000, segid: 10001 tunnelid: 0xcb007104 encap: VXLAN

Now let’s back to our scenario. Server-2 is going to ping Server-4. Server-2 has the MAC of Leaf-2 in its ARP cache so there is no need to ARP for the MAC of the gateway:

server2:~$ ip neighbor | grep 10.0.0.1
10.0.0.1 dev ens192 lladdr 00:01:00:01:00:01 STALE

Server-2 sends the ICMP Echo:

When the packet arrives at Leaf-2, it has a destination MAC of SVI for VLAN 20 so the packet is consumed. It will do a lookup in its route table for 198.51.100.44:

Leaf2# show ip route 198.51.100.0/24 vrf Tenant1
IP Route Table for VRF "Tenant1"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

198.51.100.0/24, ubest/mbest: 2/0
    *via 203.0.113.1%default, [200/0], 00:00:03, bgp-65000, internal, tag 65000, segid: 10001 tunnelid: 0xcb007101 encap: VXLAN
 
    *via 203.0.113.4%default, [200/0], 00:00:03, bgp-65000, internal, tag 65000, segid: 10001 tunnelid: 0xcb007104 encap: VXLAN

This is where RT5 comes into play. Leaf-2 can now send the packet to either Leaf-1 or Leaf-4 (where Server-4 is) based on ECMP algorithm. In this case Leaf-4 was selected. This is shown below:

There are some interesting things to note here. In the outer Ethernet header (blue) the source MAC is the MAC of outgoing interface of Leaf-2 towards Spine-1 and destination MAC of Spine-1’s interface. For the inner Ethernet header (gray), the source MAC is the Router MAC that is used to forward to the correct SVI. We can see the router MACs with the show nve peers command:

Leaf2# show nve peers
Interface Peer-IP                                 State LearnType Uptime   Router-Mac       
--------- --------------------------------------  ----- --------- -------- -----------------
nve1      203.0.113.1                             Up    CP        1w6d     00ad.e688.1b08   
nve1      203.0.113.3                             Up    CP        1w6d     n/a              
nve1      203.0.113.4                             Up    CP        1w6d     00ad.7083.1b08

The packet as seen in Wireshark:

Frame 42: 148 bytes on wire (1184 bits), 148 bytes captured (1184 bits) on interface ens161, id 0
Ethernet II, Src: 00:ad:b3:fd:1b:08, Dst: 00:ad:70:83:1b:08
Internet Protocol Version 4, Src: 203.0.113.2, Dst: 203.0.113.4
User Datagram Protocol, Src Port: 64182, Dst Port: 4789
Virtual eXtensible Local Area Network
    Flags: 0x0800, VXLAN Network ID (VNI)
    Group Policy ID: 0
    VXLAN Network Identifier (VNI): 10001
    Reserved: 0
Ethernet II, Src: 00:ad:f3:bb:1b:08, Dst: 00:ad:70:83:1b:08
Internet Protocol Version 4, Src: 10.0.0.22, Dst: 198.51.100.44
Internet Control Message Protocol

When the packet arrives at Leaf-4, it is consumed by SVI for VLAN 100. The packet is routed and as 198.51.100.0/24 is available locally, the packet could be forwarded to 198.51.100.44, but there is nothing in the ARP cache. Leaf-4 must ARP for 198.51.100.44. This frame can be seen below:

The frame as seen in Wireshark:

Frame 60: 60 bytes on wire (480 bits), 60 bytes captured (480 bits) on interface ens194, id 8
Ethernet II, Src: 00:01:00:01:00:01, Dst: ff:ff:ff:ff:ff:ff
Address Resolution Protocol (request)
    Hardware type: Ethernet (1)
    Protocol type: IPv4 (0x0800)
    Hardware size: 6
    Protocol size: 4
    Opcode: request (1)
    Sender MAC address: 00:01:00:01:00:01
    Sender IP address: 198.51.100.1
    Target MAC address: ff:ff:ff:ff:ff:ff
    Target IP address: 198.51.100.44

Note that this frame does not only get flooded locally. It will also get flooded to all the members of L2 VNI 10000 via ingress replication. This packet is sent to all the NVE peers for that L2 VNI, 203.0.113.1, 203.0.113.2, and 203.0.113.3 in my lab:

It’s interesting to note that two of the packets had one destination MAC, that of Spine-1 (00:ad:b3:fd:1b:08) and the other had the destination MAC of Spine-2 (00:ad:7b:30:1b:08). This is due to ECMP towards the NVE IP addresses. Note that all packets had the same source MAC of 00:ad:70:83:1b:08. This is due to the IP unnumbered setup in my lab.

Server-4 will then send a response to the ARP request:

The frame as seen in Wireshark:

Frame 61: 60 bytes on wire (480 bits), 60 bytes captured (480 bits) on interface ens194, id 8
Ethernet II, Src: 00:50:56:ad:7d:68, Dst: 00:01:00:01:00:01
Address Resolution Protocol (reply)
    Hardware type: Ethernet (1)
    Protocol type: IPv4 (0x0800)
    Hardware size: 6
    Protocol size: 4
    Opcode: reply (2)
    Sender MAC address: 00:50:56:ad:7d:68
    Sender IP address: 198.51.100.44
    Target MAC address: 00:01:00:01:00:01
    Target IP address: 198.51.100.1

The ARP cache of Leaf-4 is now populated with the MAC of Server-4:

Leaf4# show ip arp vrf Tenant1

Flags: * - Adjacencies learnt on non-active FHRP router
       + - Adjacencies synced via CFSoE
       # - Adjacencies Throttled for Glean
       CP - Added via L2RIB, Control plane Adjacencies
       PS - Added via L2RIB, Peer Sync
       RO - Re-Originated Peer Sync Entry
       D - Static Adjacencies attached to down interface

IP ARP Table for context Tenant1
Total number of entries: 1
Address         Age       MAC Address     Interface       Flags
198.51.100.44   00:06:48  0050.56ad.7d68  Vlan10           

This means that the ICMP Echo can be forwarded to Server-4:

Packet as seen by Wireshark:

Frame 91: 98 bytes on wire (784 bits), 98 bytes captured (784 bits) on interface ens194, id 8
Ethernet II, Src: 00:ad:70:83:1b:08, Dst: 00:50:56:ad:7d:68
Internet Protocol Version 4, Src: 10.0.0.22, Dst: 198.51.100.44
Internet Control Message Protocol

From here on forwarding will take place as usual. It’s interesting to note though that as Leaf-4 now knows both the MAC and the IP of Server-4, it will generate a BGP update of type RT2 which contains MAC and IP of Server-4. On Leaf-2 we can now see this route:

Leaf2# show bgp l2vpn evpn 198.51.100.44
Route Distinguisher: 192.0.2.6:32777
BGP routing table entry for [2]:[0]:[0]:[48]:[0050.56ad.7d68]:[32]:[198.51.100.44]/272, version 6444
Paths: (2 available, best #2)
Flags: (0x000202) (high32 00000000) on xmit-list, is not in l2rib/evpn, is not in HW

  Path type: internal, path is valid, not best reason: Neighbor Address, no labeled nexthop
  AS-Path: NONE, path sourced internal to AS
    203.0.113.4 (metric 81) from 192.0.2.12 (192.0.2.2)
      Origin IGP, MED not set, localpref 100, weight 0
      Received label 10000 10001
      Extcommunity: RT:65000:10000 RT:65000:10001 ENCAP:8 Router MAC:00ad.7083.1b08
      Originator: 192.0.2.6 Cluster list: 192.0.2.2 

  Advertised path-id 1
  Path type: internal, path is valid, is best path, no labeled nexthop
             Imported to 3 destination(s)
             Imported paths list: Tenant1 L3-10001 L2-10000
  AS-Path: NONE, path sourced internal to AS
    203.0.113.4 (metric 81) from 192.0.2.11 (192.0.2.1)
      Origin IGP, MED not set, localpref 100, weight 0
      Received label 10000 10001
      Extcommunity: RT:65000:10000 RT:65000:10001 ENCAP:8 Router MAC:00ad.7083.1b08
      Originator: 192.0.2.6 Cluster list: 192.0.2.1 

  Path-id 1 not advertised to any peer

This means that Leaf-2, and the other switches, can now import this route into the VRF:

Leaf2# show ip route 198.51.100.44 vrf Tenant1
IP Route Table for VRF "Tenant1"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

198.51.100.44/32, ubest/mbest: 1/0
    *via 203.0.113.4%default, [200/0], 05:49:47, bgp-65000, internal, tag 65000, segid: 10001 tunnelid: 0xcb007104 encap: VXLAN

The next time traffic needs to be sent towards 198.51.100.44, it can be sent directly to Leaf-4. To summarize:

  • RT5 is decoupled from MAC.
  • It is mainly used for external prefixes and handling of silent/shy hosts.
  • Traffic towards silent/shy host could go to a leaf where the host is not connected when following the RT5.
  • When the leaf learns about the host, it will send BGP update RT2 with MAC and IP.
  • Leafs can then install the /32 for more optimal forwarding.

I hope you enjoyed this deepdive! See you in the next one.

EVPN Route Type 5
Tagged on:             

4 thoughts on “EVPN Route Type 5

  • January 24, 2024 at 10:58 pm
    Permalink

    Hello ,

    Will you be discussing GW IP Address in EVPN Type in your next post.

    Regards,
    Nabeel

    Reply
  • January 25, 2024 at 9:09 pm
    Permalink

    Hello Daniel!
    Absolutely majestic series! Very good for getting into and learning about VXLAN and the benefits of using alongside EVPN.

    I have a request if you don’t mind, this series is about NX-os (obviously DC is the main usage) but what about Catalyst 9000 series? It differs quite a bit but is still supported and functional.

    If you by any chance could take up a summary on some setups in catalyst switches and use cases that would be awesome!

    //Cristian

    Reply
    • January 26, 2024 at 2:39 pm
      Permalink

      Hi Cristian 🙂

      I hope to cover Catalyst 9000 in the future. I’m not sure what the virtual platform supports, though.

      Reply
  • January 28, 2024 at 3:58 am
    Permalink

    Hi Daniel,
    I’d like to echo Cristian’s comment above.
    The documentation I’ve seen so far related to BGP EVPN mostly discusses DC environments where you have a leaf-spine topology. Even though there’s been support for BGP EVPN in the Catalyst 9000 switches, I haven’t really seen any detailed use cases or examples for Campus use (where, in my experience, the network topology is not so strict and stable).
    Also, as far as I’m aware, Cat9Kv doesn’t yet support it (hopefully new releases will come this year that we can use for testing and labbing!).
    I’m hoping to find out more about it at Cisco Live in Amsterdam.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *