It is a common design to have an internet Edge router connected to two different internet service providers to protect against the failure of an ISP bringing the office down. The topology may look something like this:

Internet Edge HA scenario

The two ISPs are used in an active/standby fashion using static routes. This is normally implemented by using two default routes where one of the routes is a floating static route. It will look something like this:

ip route 0.0.0.0 0.0.0.0 203.0.113.1 name PRIMARY
ip route 0.0.0.0 0.0.0.0 203.0.113.9 200 name SECONDARY

With this configuration, if the interface to ISP1 goes down, the floating static route which has an administrative distance (AD) of 200 will be installed and traffic will flow via ISP2. The drawback to this configuration is that it only works if the physical interface goes down. What happens if ISP1’s CPE has the interface towards the customer up but the interface towards the ISP Core goes down? What happens if there is a failure in another part of the ISP’s network? What if all interfaces are up but they are having BGP issues in the network?

In scenarios like these, since the customer’s interface towards ISP1 is still up, traffic would flow to ISP1 but would not reach their final destination. The traffic will then blackholed. To prevent failures like these, the IP SLA feature can be implemented to track something of importance, such as a service provided by the ISP or one outside of the ISP network to have the static route only installed if the service is available. How do we select what to track, though?

Selecting what to track

What services are important to an ISP? They normally provide a resolver service. If the resolvers are down, that prevents people from browsing the internet so the resolver service is very important to the ISP. What else? The ISP has a web page where you order products and interact with them, for example verizon.com. Ff that site is down, they lose money so it will be an important service to them. Using Verizon as an example, let’s find out what IP addresses are interesting to track. We can do this using dig on a Linux host. First let’s see what verizon.com resolves to:

dig verizon.com

; <<>> DiG 9.16.1-Ubuntu <<>> verizon.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52122
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;verizon.com.                   IN      A

;; ANSWER SECTION:
verizon.com.            600     IN      A       192.16.31.89

;; Query time: 40 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: sön aug 21 08:54:45 CEST 2022
;; MSG SIZE  rcvd: 56

The IP address of interest is 192.16.31.89. When it comes to resolvers, this will most likely vary per region and service, but on the other hand we can check what name servers Verizon uses for verizon.com domain. These are guaranteed to be important as well.

daniel@devasc:~$ dig ns verizon.com

; <<>> DiG 9.16.1-Ubuntu <<>> ns verizon.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55703
;; flags: qr rd ra; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;verizon.com.                   IN      NS

;; ANSWER SECTION:
verizon.com.            3600    IN      NS      s1ns1.verizon.com.
verizon.com.            3600    IN      NS      ns2.edgecastdns.net.
verizon.com.            3600    IN      NS      s3ns3.verizon.com.
verizon.com.            3600    IN      NS      s2ns2.verizon.com.
verizon.com.            3600    IN      NS      ns1.edgecastdns.net.
verizon.com.            3600    IN      NS      s4ns4.verizon.com.
verizon.com.            3600    IN      NS      ns3.edgecastdns.net.
verizon.com.            3600    IN      NS      ns4.edgecastdns.net.

;; Query time: 624 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: sön aug 21 09:00:26 CEST 2022
;; MSG SIZE  rcvd: 207

There are several servers here. Note that it seems Verizon is using a 3rd party DNS service to provide resiliency for their name servers. Pick a server and check the IP:

dig s1ns1.verizon.com

; <<>> DiG 9.16.1-Ubuntu <<>> s1ns1.verizon.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12917
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;s1ns1.verizon.com.             IN      A

;; ANSWER SECTION:
s1ns1.verizon.com.      3573    IN      A       192.16.16.5

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: sön aug 21 09:00:52 CEST 2022
;; MSG SIZE  rcvd: 62

The IP address 192.16.16.5 is what we are looking for here. We now have two IP addresses we can track. Conceptually, it looks like this:

Internet Edge ISP services

This is a good first step. We have identified important services to the ISP. However, this is not the best way of tracking availability of your internet service. Why?

  • The availability of these services does not guarantee availability towards the greater internet
  • These services may not respond to ICMP Echo packets or be rate limited

While we did achieve measuring availability beyond the CPE of the ISP, we want to make sure that we can reach services that are not local to the ISP. This is usually where people start tracking something like 8.8.8.8, which is Google’s well known resolver service. This would look like this:

Internet Edge track Google resolver

Compared to the previous scenario, the health of 8.8.8.8 should be more relevant as it’s not local to the ISP. However, as this is a resolver service, responding to ICMP Echo is not in the job description, meaning that ICMP may get rate limited. Let’s implement tracking of 8.8.8.8 and then describe some of the challenges/caveats.

Basic implementation and caveats

Let’s start with a standard implementation of IP SLA tracking and then I’ll describe some of the challenges/caveats. Here is the basic configuration:

interface GigabitEthernet1
 ip address 203.0.113.2 255.255.255.248
!
interface GigabitEthernet2
 ip address 203.0.113.10 255.255.255.248
!
ip route 0.0.0.0 0.0.0.0 203.0.113.1 name PRIMARY track 1
ip route 0.0.0.0 0.0.0.0 203.0.113.9 200 name SECONDARY
!
ip sla 1
 icmp-echo 8.8.8.8 source-ip 203.0.113.2
ip sla schedule 1 life forever start-time now
!
track 1 ip sla 1 reachability

GigabitEthernet1 is towards ISP1 and GigabitEthernet2 is towards ISP2. The following commands can be used to verify the IP SLA setup:

Edge#show ip sla sum
IPSLAs Latest Operation Summary
Codes: * active, ^ inactive, ~ pending
All Stats are in milliseconds. Stats with u are in microseconds

ID           Type        Destination       Stats       Return      Last
                                                       Code        Run 
-----------------------------------------------------------------------
*1           icmp-echo   8.8.8.8           RTT=8       OK          44 seconds ag
                                                                   o            

Edge#show ip sla statistics 
IPSLAs Latest Operation Statistics

IPSLA operation id: 1
        Latest RTT: 7 milliseconds
Latest operation start time: 07:10:40 UTC Mon Aug 22 2022
Latest operation return code: OK
Number of successes: 14
Number of failures: 0
Operation time to live: Forever

Edge#show ip route track-table 
 ip route 0.0.0.0 0.0.0.0 203.0.113.1 name PRIMARY track 1 state is [up]

Edge#show ip route 0.0.0.0
Routing entry for 0.0.0.0/0, supernet
  Known via "static", distance 1, metric 0, candidate default path
  Routing Descriptor Blocks:
  * 203.0.113.1
      Route metric is 0, traffic share count is 1

This is behaving as expected. The IP SLA is up. The tracker is up. The default route towards ISP1 is installed. Now, let’s simulate a failure of ISP1. I will implement this in the background using an ACL in my lab filtering the ICMP Echo packets. Let’s check the logs:

Aug 22 07:37:22.171: %TRACK-6-STATE: 1 ip sla 1 reachability Down -> Up
Aug 22 07:37:32.171: %TRACK-6-STATE: 1 ip sla 1 reachability Up -> Down
Aug 22 07:37:42.171: %TRACK-6-STATE: 1 ip sla 1 reachability Down -> Up
Aug 22 07:37:52.171: %TRACK-6-STATE: 1 ip sla 1 reachability Up -> Down
Aug 22 07:38:02.171: %TRACK-6-STATE: 1 ip sla 1 reachability Down -> Up

This doesn’t look too good. Why is the reachability flapping? Let’s check some of the routes:

Edge#show ip route 0.0.0.0
Routing entry for 0.0.0.0/0, supernet
  Known via "static", distance 200, metric 0, candidate default path
  Routing Descriptor Blocks:
  * 203.0.113.9
      Route metric is 0, traffic share count is 1

Edge#show ip cef 8.8.8.8                         
0.0.0.0/0
  nexthop 203.0.113.9 GigabitEthernet2

Edge#show ip cef exact-route 203.0.113.2 8.8.8.8 
203.0.113.2 -> 8.8.8.8 =>IP adj out of GigabitEthernet2, addr 203.0.113.9

This looks like expected. The default route is now pointing towards 203.0.113.9. Notice what the next-hop is for 8.8.8.8, though… I’ll come back to this but first let’s check the routing a few seconds later:

Edge#show ip route 0.0.0.0
Routing entry for 0.0.0.0/0, supernet
  Known via "static", distance 1, metric 0, candidate default path
  Routing Descriptor Blocks:
  * 203.0.113.1
      Route metric is 0, traffic share count is 1

Edge#show ip cef 8.8.8.8
0.0.0.0/0
  nexthop 203.0.113.1 GigabitEthernet1

Edge#show ip cef exact-route 203.0.113.2 8.8.8.8 
203.0.113.2 -> 8.8.8.8 =>IP adj out of GigabitEthernet1, addr 203.0.113.1

The routing is flapping between 203.0.113.1 and 203.0.113.9. Why is this happening? This is because initially the SLA packets flow through GigabitEthernet1. This path then fails so the SLA packets are sent towards GigabitEthernet2 as the default route towards Gi1 is removed. When the SLA packets are sent towards Gi2, they succeed. Since the SLA is successful, the tracked default route gets installed again. And the process repeats…

To prevent this from happening, we must ensure that SLA packets only get sent towards Gi1. How can we do this? What happens if we set the source interface in our SLA configuration?

ip sla 1
 icmp-echo 8.8.8.8 source-interface GigabitEthernet1
  frequency 10
ip sla schedule 1 life forever start-time now

Unfortunately, the results are still the same. We still have flapping. When configuring the SLA, it does say ingress interface, not egress:

Edge(config-ip-sla)#icmp-echo 8.8.8.8 ?
  source-interface  Source Interface (ingress icmp packet interface)
  source-ip         Source Address

We can verify that packets are using GigabitEthernet2 as the next-hop:

Edge#show ip cef exact-route 203.0.113.2 8.8.8.8
203.0.113.2 -> 8.8.8.8 =>IP adj out of GigabitEthernet2, addr 203.0.113.9

Unfortunately we can’t specify the egress interface of the SLA packets. So what can we do? The options we have are:

  • Create a static route for the destination of the SLA packets
  • Use policy-based routing to force SLA packets out GigabitEthernet1

Let’s try using the static route approach first. A static route for 8.8.8.8 is added:

Edge(config)#ip route 8.8.8.8 255.255.255.255 203.0.113.1 name SLA

Are packets to 8.8.8.8 should only be flowing via 203.0.113.1 now, right? Initially, this looks promising:

Edge#show ip cef exact-route 203.0.113.2 8.8.8.8
203.0.113.2 -> 8.8.8.8 =>IP adj out of GigabitEthernet1, addr 203.0.113.1
Edge#show ip cef exact-route 203.0.113.2 8.8.8.8
203.0.113.2 -> 8.8.8.8 =>IP adj out of GigabitEthernet1, addr 203.0.113.1
Edge#show ip cef exact-route 203.0.113.2 8.8.8.8
203.0.113.2 -> 8.8.8.8 =>IP adj out of GigabitEthernet1, addr 203.0.113.1
Edge#show ip cef exact-route 203.0.113.2 8.8.8.8
203.0.113.2 -> 8.8.8.8 =>IP adj out of GigabitEthernet1, addr 203.0.113.1

Packets are only being sent out GigabitEthernet1. Not so fast, though! The current way of simulating a failure is by filtering packets. What happens if Gi1 actually goes down? I will shut down the interface to simulate failure where interface towards ISP1 goes down. I added some debugs to show what goes on in the background:

Aug 22 08:18:05.209: RT: interface GigabitEthernet1 removed from routing table
Aug 22 08:18:05.209: RT: del 203.0.113.0 via 0.0.0.0, connected metric [0/0]
Aug 22 08:18:05.209: RT: delete subnet route to 203.0.113.0/29
Aug 22 08:18:05.209: CONN: delete conn route, idb: GigabitEthernet1, addr: 203.0.113.2, mask: 255.255.255.248
Aug 22 08:18:05.210: CONN(multicast): connected_route: FALSE
Aug 22 08:18:05.210: RT: interface GigabitEthernet1 topo state DOWN, afi 0
Aug 22 08:18:05.210: IP-ST-EV(default): queued adjust on GigabitEthernet1
Aug 22 08:18:05.210: RT: del 203.0.113.2 via 0.0.0.0, connected metric [0/0]
Aug 22 08:18:05.210: RT: delete subnet route to 203.0.113.2/32
Aug 22 08:18:05.221: RT: del 8.8.8.8 via 203.0.113.1, static metric [1/0]
Aug 22 08:18:05.221: RT: delete subnet route to 8.8.8.8/32
Aug 22 08:18:06.108: %SYS-5-CONFIG_I: Configured from console by daniel on vty0 (10.254.255.2)
Aug 22 08:18:07.202: %LINK-5-CHANGED: Interface GigabitEthernet1, changed state to administratively down
Aug 22 08:18:07.208: CONN: connected_route: FALSE
Aug 22 08:18:07.208: is_up: GigabitEthernet1 0 state: 6 sub state: 1 line: 0
Aug 22 08:18:08.203: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet1, changed state to down
Aug 22 08:18:08.204: CONN: connected_route: FALSE
Aug 22 08:18:08.204: is_up: GigabitEthernet1 0 state: 6 sub state: 1 line: 0

From the debug above, we can see that Gi1 goes down, it removes the connected subnet 203.0.113.0/29, but also the route to 8.8.8.8 is removed. This means that the SLA packets are now flowing via Gi2:

Edge#show ip cef exact-route 8.8.8.8 203.0.113.2
8.8.8.8 -> 203.0.113.2 =>IP adj out of GigabitEthernet2, addr 203.0.113.9

The default route via Gi1 can’t be installed as the interface is down, but if we were tracking SLA statistics, it would skew the data as these packets are now going through even though ISP1 is down.

There is a more elegant solution, though. It is possible to add a permanent static route using the permanent keyword:

Edge(config)#ip route 8.8.8.8 255.255.255.255 203.0.113.1 permanent name SLA

Notice the permanent keyword in the output below:

Edge#show ip route 8.8.8.8
Routing entry for 8.8.8.8/32
  Known via "static", distance 1, metric 0
  Routing Descriptor Blocks:
  * 203.0.113.1, permanent
      Route metric is 0, traffic share count is 1

When the interface is shut down, this route will remain:

Edge(config)#int gi1
Edge(config-if)#sh
Edge(config-if)#^Z
Edge#show ip route 8.8.8.8
Routing entry for 8.8.8.8/32
  Known via "static", distance 1, metric 0
  Routing Descriptor Blocks:
  * 203.0.113.1, permanent
      Route metric is 0, traffic share count is 1

This ensures that SLA packets can only ever use Gi1. This works well. Keep in mind one thing, though. What we just configured will apply for ALL packets towards 8.8.8.8. Not only the SLA packets. If people are using 8.8.8.8 as their resolver, and ISP1 is down or having issues, packets towards 8.8.8.8 will NOT be able to use ISP2. We have essentially made our secondary link unavailable for any packets towards 8.8.8.8. This is not that great. What if we could change the routing towards 8.8.8.8 for only the packets generated by the router itself? We can, but it means we’ll have to use our old friend/foe policy based routing. Let’s configure PBR:

ip access-list extended G1-ICMP-TO-GOOGLE-DNS
 permit icmp host 203.0.113.2 host 8.8.8.8 echo
 !
route-map LOCAL-POLICY permit 10
 match ip address G1-ICMP-TO-GOOGLE-DNS
 set ip next-hop 203.0.113.1
!
ip local policy route-map LOCAL-POLICY

The policy is being used:

Edge#show ip local policy 
Local policy routing is enabled, using route map LOCAL-POLICY
route-map LOCAL-POLICY, permit, sequence 10
  Match clauses:
    ip address (access-lists): G1-ICMP-TO-GOOGLE-DNS 
  Set clauses:
    ip next-hop 203.0.113.1
  Policy routing matches: 2 packets, 128 bytes
Edge#show ip access-lists 
Extended IP access list G1-ICMP-TO-GOOGLE-DNS
    10 permit icmp host 203.0.113.2 host 8.8.8.8 echo (3 matches)

Packets to 8.8.8.8 can use the secondary path:

Edge#show ip cef 8.8.8.8
0.0.0.0/0
  nexthop 203.0.113.9 GigabitEthernet2

This all looks great. We have pinned the SLA packets to Gi1 but user traffic to 8.8.8.8 can still use the secondary path. So far we have only used ICMP Echo, which is not the best way of determining if a path is healthy. Let’s look into some more advanced options.

IP SLA using DNS

Instead of sending ICMP Echo packets to DNS servers. What if we just sent DNS queries instead? Wouldn’t this be better? It would indeed as the job of a DNS server is to respond to DNS queries, not ICMP Echo packets. It’s possible to configure IP SLA to send DNS queries. Define the name to be queried and the name server in the IP SLA configuration:

ip sla 1
 dns google.com name-server 8.8.8.8
 frequency 10
!
ip sla schedule 1 life forever start-time now
Edge#show ip sla statistics 
IPSLAs Latest Operation Statistics

IPSLA operation id: 1
        Latest RTT: 8 milliseconds
Latest operation start time: 13:08:24 SWE Mon Aug 22 2022
Latest operation return code: OK
Number of successes: 4
Number of failures: 0
Operation time to live: Forever


Edge#show ip sla
Edge#show ip route trac
Edge#show ip route track-table 
 ip route 0.0.0.0 0.0.0.0 203.0.113.1 name PRIMARY track 1 state is [up]

The default route is now installed based on that 8.8.8.8 responds to our query for the google.com name. This is a lot better! Consider this, though:

  • What happens if there is no response for google.com?
  • What happens if 8.8.8.8 does not respond at all?
  • What is the packet length of a DNS query?

Before answering the first two questions, why would we care about the packet size? What if DNS queries can go through but other user traffic can’t? Maybe the path does not allow for 1500 bytes packets due to some upstream issue… First, let’s check what the size of the SLA packet is by using some packet capturing magic:

Edge#debug platform condition ipv4 8.8.8.8/32 both
Edge#debug platform packet-trace packet 256
 Please remember to turn on 'debug platform condition start' for packet-trace to work
Edge#debug platform condition start 
Edge#show platform packet-trace sum
Pkt   Input                     Output                    State  Reason
0     INJ.2                     Gi1                       FWD    
1     Gi1                       internal0/0/rp:0          PUNT   11  (For-us data)

Edge#show platform packet-trace packet 0
Packet: 0           CBUG ID: 0

IOSd Path Flow: 
  Feature: UDP
  Pkt Direction: OUTsrc=203.0.113.2(53883), dst=8.8.8.8(53), length=36

  Feature: UDP
  Pkt Direction: OUT
  FORWARDED 
        UDP: Packet Handoff to IP
        Source      : 203.0.113.2(53883)
        Destination : 8.8.8.8(53)


  Feature: IP
  Pkt Direction: OUTRoute out the generated packet.srcaddr: 203.0.113.2, dstaddr: 8.8.8.8
Summary
  Input     : INJ.2  
  Output    : GigabitEthernet1
  State     : FWD 
  Timestamp
    Start   : 97239727399057 ns (08/22/2022 11:10:03.807558 UTC)
    Stop    : 97239727755013 ns (08/22/2022 11:10:03.807914 UTC)
Path Trace
  Feature: IPV4(Input)
    Input       : internal0/0/rp:0
    Output      : <unknown>
    Source      : 203.0.113.2
    Destination : 8.8.8.8
    Protocol    : 17 (UDP)
      SrcPort   : 53883
      DstPort   : 53

This is quite a small packet, only 36 bytes. Let’s get back to this later. For our first problem, how can we track something more than google.com and also to send queries to more than one DNS server? That can be implemented using multiple IP SLA statements:

ip sla 1
 dns google.com name-server 8.8.8.8
  frequency 10
ip sla schedule 1 life forever start-time now
ip sla 2
 dns amazon.com name-server 208.67.220.220
  frequency 10
ip sla schedule 2 life forever start-time now
ip sla 3
 dns microsoft.com name-server 1.1.1.1
  frequency 10
ip sla schedule 3 life forever start-time now
!
track 2 ip sla 2 reachability
!
track 3 ip sla 3 reachability

Then, configure a track statement that uses Boolean logic for all of these SLA statements:

track 10 list boolean or
 object 1
 object 2
 object 3

Finally, update the default route to use the new tracker:

Edge(config)#no ip route 0.0.0.0 0.0.0.0 203.0.113.1 name PRIMARY track 1
Edge(config)#ip route 0.0.0.0 0.0.0.0 203.0.113.1 name PRIMARY track 10

Let’s verify:

Edge#show track 10
Track 10
  List boolean or
  Boolean OR is Up
    2 changes, last change 00:02:25
    object 1 Up
    object 2 Up
    object 3 Up
  Tracked by:
    Static IP Routing 0
Edge#show ip route 0.0.0.0
Routing entry for 0.0.0.0/0, supernet
  Known via "static", distance 1, metric 0, candidate default path
  Routing Descriptor Blocks:
  * 203.0.113.1
      Route metric is 0, traffic share count is 1
Edge#show ip route track-table 
 ip route 0.0.0.0 0.0.0.0 203.0.113.1 name PRIMARY track 10 state is [up]
Edge#show ip cef exact-route 203.0.113.2 8.8.8.8
203.0.113.2 -> 8.8.8.8 =>IP adj out of GigabitEthernet1, addr 203.0.113.1

This looks great! We now have three SLA statements and one of them failing is not enough to move traffic to the secondary path. If just one of them fails, it’s more likely that there is a temporary issue with a DNS server or DNS zone.

Remember I mentioned something about packet sizes? Let’s discuss this in the next section.

IP SLA using HTTP

Tracking connectivity using DNS is definitely more useful than a simple ICMP Echo. What if we could move even further up the stack? This can be achieved by using HTTP probes. Rather than just checking that we get responses to DNS queries, let’s try to actually connect to a web site. The syntax is similar to that of DNS:

ip sla 5
 http secure get https://amazon.com name-server 8.8.8.8 source-interface GigabitEthernet1
!
ip sla schedule 5 life forever start-time now
!
track 5 ip sla 5 reachability 
!
no ip route 0.0.0.0 0.0.0.0 203.0.113.1 name PRIMARY track 10
ip route 0.0.0.0 0.0.0.0 203.0.113.1 name PRIMARY track 5    

Then let’s verify:

Edge#show ip sla statistics 5
IPSLAs Latest Operation Statistics

IPSLA operation id: 5
        Latest RTT: 1175 milliseconds
Latest operation start time: 13:57:12 SWE Mon Aug 22 2022
Latest operation return code: OK
Latest DNS RTT: 7 ms
Latest HTTP Transaction RTT: 1168 ms
Number of successes: 2
Number of failures: 0
Operation time to live: Forever


Edge#show ip route track-table
 ip route 0.0.0.0 0.0.0.0 203.0.113.1 name PRIMARY track 5 state is [up]

The router is now sending HTTP GET for the name amazon.com by first resolving the name through a DNS query to 8.8.8.8. This tests more of the stack than a DNS query. What’s the size of our SLA packets now? Let’s use another packet capturing utility to check:

ip access-list extended CAP-HTTP
 10 permit tcp host 203.0.113.2 any
 20 permit tcp any host 203.0.113.2
!
Edge#monitor capture CAP interface GigabitEthernet1 both
Edge#monitor capture CAP access-list CAP-HTTP 
Edge#monitor capture CAP start
Started capture point : CAP
Edge#monitor capture CAP export flash:/CAP.pcap 
Exported Successfully
Edge#copy flash:/CAP.pcap ftp://redacted:[email protected]

Let’s have a look at the PCAP:

PCAP of SLA HTTP probe

The packets are definitely larger than when just using DNS. We see packets approaching 600 bytes of size. Not quite 1500 bytes, though! For production use we should of course track more than one web server. What else can we do when it comes to IP SLA? Time to get fancy!

Getting fancy with IP SLA

There’s a lot we can do with IP SLA. Time to get fancy! Let’s implement this:

  • Resolve google.com via 8.8.8.8
  • Resolve amazon.com via 208.67.220.220
  • HTTP GET to amazon.com
  • Ping sdwan.measure.office.com (O365 beacon service) with 1500 bytes ICMP Echo

If everything succeeds, the circuit is functioning as we can send ICMP, resolve DNS, and browse to web sites. This is the configuration:

track 1 ip sla 1 reachability
!
track 2 ip sla 2 reachability
!
track 5 ip sla 5 reachability
!
track 20 ip sla 20 reachability
!
track 100 list boolean and
 object 1
 object 2
 object 5
 object 20
!
ip sla 1
 dns google.com name-server 8.8.8.8
  frequency 10
ip sla schedule 1 life forever start-time now
ip sla 2
 dns amazon.com name-server 208.67.220.220
  frequency 10
ip sla schedule 2 life forever start-time now
ip sla 5
 http secure get https://amazon.com name-server 8.8.8.8 source-interface GigabitEthernet1
ip sla schedule 5 life forever start-time now
ip sla 20
 icmp-echo sdwan.measure.office.com source-interface GigabitEthernet1
  request-data-size 1450
  frequency 10
ip sla schedule 20 life forever start-time now

Note that the request data size is set to 1450 which will generate 1500 byte ICMP Echo packets. I’m not sure how the math maths here but I verified it with a packet capture. The ICMP Echo SLA when using a DNS name will resolve to an IP when configuring and be put into the running configuration. Let’s see if this all works:

Edge#show track 100
Track 100
  List boolean and
  Boolean AND is Up
    2 changes, last change 00:00:03
    object 1 Up
    object 2 Up
    object 5 Up
    object 20 Up
  Tracked by:
    Static IP Routing 0
Edge#show ip route track-table 
 ip route 0.0.0.0 0.0.0.0 203.0.113.1 name PRIMARY track 100 state is [up]
Edge#show ip cef exact-route 203.0.113.2 8.8.8.8
203.0.113.2 -> 8.8.8.8 =>IP adj out of GigabitEthernet1, addr 203.0.113.1

Pretty cool! Using only static routes and IP SLA we now have a pretty good mechanism for verifying connectivity. A lot better than the simple ICMP Echos we started with.

Conclusion

You can make it as simple or complex as you want to using IP SLA. It all comes down to your requirements. Keep the following in mind:

  • Consider what you want to measure
  • How do you ensure SLA packets are flowing towards the correct ISP?
  • How many things do you want to measure to ensure the path is good?

I hope this post has been informative and that you have learned some new IP SLA tricks as well as some good debugging and packet capture commands.

Internet Edge IP SLA Deep Dive
Tagged on:         

9 thoughts on “Internet Edge IP SLA Deep Dive

  • August 22, 2022 at 5:21 pm
    Permalink

    Thank you very much for this wonderful piece.

    Reply
  • August 22, 2022 at 8:52 pm
    Permalink

    This was indeed a deep-dive. I’ve always thought the IP SLA feature was great but I’ve never seen it explored in all its glory as seen here. Looking forward to more educative and thorough breakdown of technologies and their real world applicability to solving real-life/production environment problems.

    Reply
  • August 23, 2022 at 12:33 am
    Permalink

    There are two potential gotchas when doing HTTP requests.

    First, you will typically use DNS for hitting a highly available HTTP server like ‘google.com’. However, DNS will return the IPv4 and IPv6 entries for that address. If your router is not configured for IPv6, or doesn’t have full IPv6 access, and the router tries to use to the IPv6 for IPSLA, the test will now fail. In my experience, IPSLA always starts on IPv4, but it doesn’t stick and at some point switches to IPv6. (I’ve opened a TAC case on this, and there is no way to force IPSLA to IPv4 only.)

    Second, most large sites are not HTTPS by default. The only HTTP response you will typically get is a 301 redirect response. Sometimes IPSLA is OK with this, but there seem to be some situations where it doesn’t work.

    Oh, and be sure to set your response timers fairly high for determining a failure. Between the DNS lookup and HTTP response, it may take a few seconds if things are iffy on the servers.

    Reply
    • August 23, 2022 at 8:34 am
      Permalink

      This is great feedback, Greg. Thank you! There are a lot of caveats, for sure.

      Reply
  • August 23, 2022 at 12:35 am
    Permalink

    oops, meant ‘large sites ARE HTTPS be default’. if you could fix that please if you post the comment.

    Reply
  • Pingback:Internet Edge IP SLA Deep Dive – permit-any-any.com

  • August 24, 2022 at 1:57 pm
    Permalink

    This post is awesome. Thanks Daniel!

    Reply
    • August 24, 2022 at 8:13 pm
      Permalink

      Thanks, Eric!

      Reply
  • September 3, 2023 at 7:11 pm
    Permalink

    Great article! I have two below questions/comments though.
    1) Could we use PBR based approach for VRF? If yes, what would be commands for that, please?
    2) Wen we do the fancy IP SLA, should we modify the PBR accordingly? If we don’t, the DNS and HTTPS would still be reachable via the second default route, wouldn’t they?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *