It is a common design to have an internet Edge router connected to two different internet service providers to protect against the failure of an ISP bringing the office down. The topology may look something like this:
The two ISPs are used in an active/standby fashion using static routes. This is normally implemented by using two default routes where one of the routes is a floating static route. It will look something like this:
ip route 0.0.0.0 0.0.0.0 203.0.113.1 name PRIMARY ip route 0.0.0.0 0.0.0.0 203.0.113.9 200 name SECONDARY
With this configuration, if the interface to ISP1 goes down, the floating static route which has an administrative distance (AD) of 200 will be installed and traffic will flow via ISP2. The drawback to this configuration is that it only works if the physical interface goes down. What happens if ISP1’s CPE has the interface towards the customer up but the interface towards the ISP Core goes down? What happens if there is a failure in another part of the ISP’s network? What if all interfaces are up but they are having BGP issues in the network?
In scenarios like these, since the customer’s interface towards ISP1 is still up, traffic would flow to ISP1 but would not reach their final destination. The traffic will then blackholed. To prevent failures like these, the IP SLA feature can be implemented to track something of importance, such as a service provided by the ISP or one outside of the ISP network to have the static route only installed if the service is available. How do we select what to track, though?
Selecting what to track
What services are important to an ISP? They normally provide a resolver service. If the resolvers are down, that prevents people from browsing the internet so the resolver service is very important to the ISP. What else? The ISP has a web page where you order products and interact with them, for example verizon.com. Ff that site is down, they lose money so it will be an important service to them. Using Verizon as an example, let’s find out what IP addresses are interesting to track. We can do this using dig on a Linux host. First let’s see what verizon.com resolves to:
dig verizon.com ; <<>> DiG 9.16.1-Ubuntu <<>> verizon.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52122 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 65494 ;; QUESTION SECTION: ;verizon.com. IN A ;; ANSWER SECTION: verizon.com. 600 IN A 192.16.31.89 ;; Query time: 40 msec ;; SERVER: 127.0.0.53#53(127.0.0.53) ;; WHEN: sön aug 21 08:54:45 CEST 2022 ;; MSG SIZE rcvd: 56
The IP address of interest is 192.16.31.89. When it comes to resolvers, this will most likely vary per region and service, but on the other hand we can check what name servers Verizon uses for verizon.com domain. These are guaranteed to be important as well.
daniel@devasc:~$ dig ns verizon.com ; <<>> DiG 9.16.1-Ubuntu <<>> ns verizon.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55703 ;; flags: qr rd ra; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 65494 ;; QUESTION SECTION: ;verizon.com. IN NS ;; ANSWER SECTION: verizon.com. 3600 IN NS s1ns1.verizon.com. verizon.com. 3600 IN NS ns2.edgecastdns.net. verizon.com. 3600 IN NS s3ns3.verizon.com. verizon.com. 3600 IN NS s2ns2.verizon.com. verizon.com. 3600 IN NS ns1.edgecastdns.net. verizon.com. 3600 IN NS s4ns4.verizon.com. verizon.com. 3600 IN NS ns3.edgecastdns.net. verizon.com. 3600 IN NS ns4.edgecastdns.net. ;; Query time: 624 msec ;; SERVER: 127.0.0.53#53(127.0.0.53) ;; WHEN: sön aug 21 09:00:26 CEST 2022 ;; MSG SIZE rcvd: 207
There are several servers here. Note that it seems Verizon is using a 3rd party DNS service to provide resiliency for their name servers. Pick a server and check the IP:
dig s1ns1.verizon.com ; <<>> DiG 9.16.1-Ubuntu <<>> s1ns1.verizon.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12917 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 65494 ;; QUESTION SECTION: ;s1ns1.verizon.com. IN A ;; ANSWER SECTION: s1ns1.verizon.com. 3573 IN A 192.16.16.5 ;; Query time: 0 msec ;; SERVER: 127.0.0.53#53(127.0.0.53) ;; WHEN: sön aug 21 09:00:52 CEST 2022 ;; MSG SIZE rcvd: 62
The IP address 192.16.16.5 is what we are looking for here. We now have two IP addresses we can track. Conceptually, it looks like this:
This is a good first step. We have identified important services to the ISP. However, this is not the best way of tracking availability of your internet service. Why?
- The availability of these services does not guarantee availability towards the greater internet
- These services may not respond to ICMP Echo packets or be rate limited
While we did achieve measuring availability beyond the CPE of the ISP, we want to make sure that we can reach services that are not local to the ISP. This is usually where people start tracking something like 8.8.8.8, which is Google’s well known resolver service. This would look like this:
Compared to the previous scenario, the health of 8.8.8.8 should be more relevant as it’s not local to the ISP. However, as this is a resolver service, responding to ICMP Echo is not in the job description, meaning that ICMP may get rate limited. Let’s implement tracking of 8.8.8.8 and then describe some of the challenges/caveats.
Basic implementation and caveats
Let’s start with a standard implementation of IP SLA tracking and then I’ll describe some of the challenges/caveats. Here is the basic configuration:
interface GigabitEthernet1 ip address 203.0.113.2 255.255.255.248 ! interface GigabitEthernet2 ip address 203.0.113.10 255.255.255.248 ! ip route 0.0.0.0 0.0.0.0 203.0.113.1 name PRIMARY track 1 ip route 0.0.0.0 0.0.0.0 203.0.113.9 200 name SECONDARY ! ip sla 1 icmp-echo 8.8.8.8 source-ip 203.0.113.2 ip sla schedule 1 life forever start-time now ! track 1 ip sla 1 reachability
GigabitEthernet1 is towards ISP1 and GigabitEthernet2 is towards ISP2. The following commands can be used to verify the IP SLA setup:
Edge#show ip sla sum IPSLAs Latest Operation Summary Codes: * active, ^ inactive, ~ pending All Stats are in milliseconds. Stats with u are in microseconds ID Type Destination Stats Return Last Code Run ----------------------------------------------------------------------- *1 icmp-echo 8.8.8.8 RTT=8 OK 44 seconds ag o Edge#show ip sla statistics IPSLAs Latest Operation Statistics IPSLA operation id: 1 Latest RTT: 7 milliseconds Latest operation start time: 07:10:40 UTC Mon Aug 22 2022 Latest operation return code: OK Number of successes: 14 Number of failures: 0 Operation time to live: Forever Edge#show ip route track-table ip route 0.0.0.0 0.0.0.0 203.0.113.1 name PRIMARY track 1 state is [up] Edge#show ip route 0.0.0.0 Routing entry for 0.0.0.0/0, supernet Known via "static", distance 1, metric 0, candidate default path Routing Descriptor Blocks: * 203.0.113.1 Route metric is 0, traffic share count is 1
This is behaving as expected. The IP SLA is up. The tracker is up. The default route towards ISP1 is installed. Now, let’s simulate a failure of ISP1. I will implement this in the background using an ACL in my lab filtering the ICMP Echo packets. Let’s check the logs:
Aug 22 07:37:22.171: %TRACK-6-STATE: 1 ip sla 1 reachability Down -> Up Aug 22 07:37:32.171: %TRACK-6-STATE: 1 ip sla 1 reachability Up -> Down Aug 22 07:37:42.171: %TRACK-6-STATE: 1 ip sla 1 reachability Down -> Up Aug 22 07:37:52.171: %TRACK-6-STATE: 1 ip sla 1 reachability Up -> Down Aug 22 07:38:02.171: %TRACK-6-STATE: 1 ip sla 1 reachability Down -> Up
This doesn’t look too good. Why is the reachability flapping? Let’s check some of the routes:
Edge#show ip route 0.0.0.0 Routing entry for 0.0.0.0/0, supernet Known via "static", distance 200, metric 0, candidate default path Routing Descriptor Blocks: * 203.0.113.9 Route metric is 0, traffic share count is 1 Edge#show ip cef 8.8.8.8 0.0.0.0/0 nexthop 203.0.113.9 GigabitEthernet2 Edge#show ip cef exact-route 203.0.113.2 8.8.8.8 203.0.113.2 -> 8.8.8.8 =>IP adj out of GigabitEthernet2, addr 203.0.113.9
This looks like expected. The default route is now pointing towards 203.0.113.9. Notice what the next-hop is for 8.8.8.8, though… I’ll come back to this but first let’s check the routing a few seconds later:
Edge#show ip route 0.0.0.0 Routing entry for 0.0.0.0/0, supernet Known via "static", distance 1, metric 0, candidate default path Routing Descriptor Blocks: * 203.0.113.1 Route metric is 0, traffic share count is 1 Edge#show ip cef 8.8.8.8 0.0.0.0/0 nexthop 203.0.113.1 GigabitEthernet1 Edge#show ip cef exact-route 203.0.113.2 8.8.8.8 203.0.113.2 -> 8.8.8.8 =>IP adj out of GigabitEthernet1, addr 203.0.113.1
The routing is flapping between 203.0.113.1 and 203.0.113.9. Why is this happening? This is because initially the SLA packets flow through GigabitEthernet1. This path then fails so the SLA packets are sent towards GigabitEthernet2 as the default route towards Gi1 is removed. When the SLA packets are sent towards Gi2, they succeed. Since the SLA is successful, the tracked default route gets installed again. And the process repeats…
To prevent this from happening, we must ensure that SLA packets only get sent towards Gi1. How can we do this? What happens if we set the source interface in our SLA configuration?
ip sla 1 icmp-echo 8.8.8.8 source-interface GigabitEthernet1 frequency 10 ip sla schedule 1 life forever start-time now
Unfortunately, the results are still the same. We still have flapping. When configuring the SLA, it does say ingress interface, not egress:
Edge(config-ip-sla)#icmp-echo 8.8.8.8 ? source-interface Source Interface (ingress icmp packet interface) source-ip Source Address
We can verify that packets are using GigabitEthernet2 as the next-hop:
Edge#show ip cef exact-route 203.0.113.2 8.8.8.8 203.0.113.2 -> 8.8.8.8 =>IP adj out of GigabitEthernet2, addr 203.0.113.9
Unfortunately we can’t specify the egress interface of the SLA packets. So what can we do? The options we have are:
- Create a static route for the destination of the SLA packets
- Use policy-based routing to force SLA packets out GigabitEthernet1
Let’s try using the static route approach first. A static route for 8.8.8.8 is added:
Edge(config)#ip route 8.8.8.8 255.255.255.255 203.0.113.1 name SLA
Are packets to 8.8.8.8 should only be flowing via 203.0.113.1 now, right? Initially, this looks promising:
Edge#show ip cef exact-route 203.0.113.2 8.8.8.8 203.0.113.2 -> 8.8.8.8 =>IP adj out of GigabitEthernet1, addr 203.0.113.1 Edge#show ip cef exact-route 203.0.113.2 8.8.8.8 203.0.113.2 -> 8.8.8.8 =>IP adj out of GigabitEthernet1, addr 203.0.113.1 Edge#show ip cef exact-route 203.0.113.2 8.8.8.8 203.0.113.2 -> 8.8.8.8 =>IP adj out of GigabitEthernet1, addr 203.0.113.1 Edge#show ip cef exact-route 203.0.113.2 8.8.8.8 203.0.113.2 -> 8.8.8.8 =>IP adj out of GigabitEthernet1, addr 203.0.113.1
Packets are only being sent out GigabitEthernet1. Not so fast, though! The current way of simulating a failure is by filtering packets. What happens if Gi1 actually goes down? I will shut down the interface to simulate failure where interface towards ISP1 goes down. I added some debugs to show what goes on in the background:
Aug 22 08:18:05.209: RT: interface GigabitEthernet1 removed from routing table Aug 22 08:18:05.209: RT: del 203.0.113.0 via 0.0.0.0, connected metric [0/0] Aug 22 08:18:05.209: RT: delete subnet route to 203.0.113.0/29 Aug 22 08:18:05.209: CONN: delete conn route, idb: GigabitEthernet1, addr: 203.0.113.2, mask: 255.255.255.248 Aug 22 08:18:05.210: CONN(multicast): connected_route: FALSE Aug 22 08:18:05.210: RT: interface GigabitEthernet1 topo state DOWN, afi 0 Aug 22 08:18:05.210: IP-ST-EV(default): queued adjust on GigabitEthernet1 Aug 22 08:18:05.210: RT: del 203.0.113.2 via 0.0.0.0, connected metric [0/0] Aug 22 08:18:05.210: RT: delete subnet route to 203.0.113.2/32 Aug 22 08:18:05.221: RT: del 8.8.8.8 via 203.0.113.1, static metric [1/0] Aug 22 08:18:05.221: RT: delete subnet route to 8.8.8.8/32 Aug 22 08:18:06.108: %SYS-5-CONFIG_I: Configured from console by daniel on vty0 (10.254.255.2) Aug 22 08:18:07.202: %LINK-5-CHANGED: Interface GigabitEthernet1, changed state to administratively down Aug 22 08:18:07.208: CONN: connected_route: FALSE Aug 22 08:18:07.208: is_up: GigabitEthernet1 0 state: 6 sub state: 1 line: 0 Aug 22 08:18:08.203: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet1, changed state to down Aug 22 08:18:08.204: CONN: connected_route: FALSE Aug 22 08:18:08.204: is_up: GigabitEthernet1 0 state: 6 sub state: 1 line: 0
From the debug above, we can see that Gi1 goes down, it removes the connected subnet 203.0.113.0/29, but also the route to 8.8.8.8 is removed. This means that the SLA packets are now flowing via Gi2:
Edge#show ip cef exact-route 8.8.8.8 203.0.113.2 8.8.8.8 -> 203.0.113.2 =>IP adj out of GigabitEthernet2, addr 203.0.113.9
The default route via Gi1 can’t be installed as the interface is down, but if we were tracking SLA statistics, it would skew the data as these packets are now going through even though ISP1 is down.
There is a more elegant solution, though. It is possible to add a permanent static route using the permanent keyword:
Edge(config)#ip route 8.8.8.8 255.255.255.255 203.0.113.1 permanent name SLA
Notice the permanent keyword in the output below:
Edge#show ip route 8.8.8.8 Routing entry for 8.8.8.8/32 Known via "static", distance 1, metric 0 Routing Descriptor Blocks: * 203.0.113.1, permanent Route metric is 0, traffic share count is 1
When the interface is shut down, this route will remain:
Edge(config)#int gi1 Edge(config-if)#sh Edge(config-if)#^Z Edge#show ip route 8.8.8.8 Routing entry for 8.8.8.8/32 Known via "static", distance 1, metric 0 Routing Descriptor Blocks: * 203.0.113.1, permanent Route metric is 0, traffic share count is 1
This ensures that SLA packets can only ever use Gi1. This works well. Keep in mind one thing, though. What we just configured will apply for ALL packets towards 8.8.8.8. Not only the SLA packets. If people are using 8.8.8.8 as their resolver, and ISP1 is down or having issues, packets towards 8.8.8.8 will NOT be able to use ISP2. We have essentially made our secondary link unavailable for any packets towards 8.8.8.8. This is not that great. What if we could change the routing towards 8.8.8.8 for only the packets generated by the router itself? We can, but it means we’ll have to use our old friend/foe policy based routing. Let’s configure PBR:
ip access-list extended G1-ICMP-TO-GOOGLE-DNS permit icmp host 203.0.113.2 host 8.8.8.8 echo ! route-map LOCAL-POLICY permit 10 match ip address G1-ICMP-TO-GOOGLE-DNS set ip next-hop 203.0.113.1 ! ip local policy route-map LOCAL-POLICY
The policy is being used:
Edge#show ip local policy Local policy routing is enabled, using route map LOCAL-POLICY route-map LOCAL-POLICY, permit, sequence 10 Match clauses: ip address (access-lists): G1-ICMP-TO-GOOGLE-DNS Set clauses: ip next-hop 203.0.113.1 Policy routing matches: 2 packets, 128 bytes Edge#show ip access-lists Extended IP access list G1-ICMP-TO-GOOGLE-DNS 10 permit icmp host 203.0.113.2 host 8.8.8.8 echo (3 matches)
Packets to 8.8.8.8 can use the secondary path:
Edge#show ip cef 8.8.8.8 0.0.0.0/0 nexthop 203.0.113.9 GigabitEthernet2
This all looks great. We have pinned the SLA packets to Gi1 but user traffic to 8.8.8.8 can still use the secondary path. So far we have only used ICMP Echo, which is not the best way of determining if a path is healthy. Let’s look into some more advanced options.
IP SLA using DNS
Instead of sending ICMP Echo packets to DNS servers. What if we just sent DNS queries instead? Wouldn’t this be better? It would indeed as the job of a DNS server is to respond to DNS queries, not ICMP Echo packets. It’s possible to configure IP SLA to send DNS queries. Define the name to be queried and the name server in the IP SLA configuration:
ip sla 1 dns google.com name-server 8.8.8.8 frequency 10 ! ip sla schedule 1 life forever start-time now
Edge#show ip sla statistics IPSLAs Latest Operation Statistics IPSLA operation id: 1 Latest RTT: 8 milliseconds Latest operation start time: 13:08:24 SWE Mon Aug 22 2022 Latest operation return code: OK Number of successes: 4 Number of failures: 0 Operation time to live: Forever Edge#show ip sla Edge#show ip route trac Edge#show ip route track-table ip route 0.0.0.0 0.0.0.0 203.0.113.1 name PRIMARY track 1 state is [up]
The default route is now installed based on that 8.8.8.8 responds to our query for the google.com name. This is a lot better! Consider this, though:
- What happens if there is no response for google.com?
- What happens if 8.8.8.8 does not respond at all?
- What is the packet length of a DNS query?
Before answering the first two questions, why would we care about the packet size? What if DNS queries can go through but other user traffic can’t? Maybe the path does not allow for 1500 bytes packets due to some upstream issue… First, let’s check what the size of the SLA packet is by using some packet capturing magic:
Edge#debug platform condition ipv4 8.8.8.8/32 both Edge#debug platform packet-trace packet 256 Please remember to turn on 'debug platform condition start' for packet-trace to work Edge#debug platform condition start Edge#show platform packet-trace sum Pkt Input Output State Reason 0 INJ.2 Gi1 FWD 1 Gi1 internal0/0/rp:0 PUNT 11 (For-us data) Edge#show platform packet-trace packet 0 Packet: 0 CBUG ID: 0 IOSd Path Flow: Feature: UDP Pkt Direction: OUTsrc=203.0.113.2(53883), dst=8.8.8.8(53), length=36 Feature: UDP Pkt Direction: OUT FORWARDED UDP: Packet Handoff to IP Source : 203.0.113.2(53883) Destination : 8.8.8.8(53) Feature: IP Pkt Direction: OUTRoute out the generated packet.srcaddr: 203.0.113.2, dstaddr: 8.8.8.8 Summary Input : INJ.2 Output : GigabitEthernet1 State : FWD Timestamp Start : 97239727399057 ns (08/22/2022 11:10:03.807558 UTC) Stop : 97239727755013 ns (08/22/2022 11:10:03.807914 UTC) Path Trace Feature: IPV4(Input) Input : internal0/0/rp:0 Output : <unknown> Source : 203.0.113.2 Destination : 8.8.8.8 Protocol : 17 (UDP) SrcPort : 53883 DstPort : 53
This is quite a small packet, only 36 bytes. Let’s get back to this later. For our first problem, how can we track something more than google.com and also to send queries to more than one DNS server? That can be implemented using multiple IP SLA statements:
ip sla 1 dns google.com name-server 8.8.8.8 frequency 10 ip sla schedule 1 life forever start-time now ip sla 2 dns amazon.com name-server 208.67.220.220 frequency 10 ip sla schedule 2 life forever start-time now ip sla 3 dns microsoft.com name-server 1.1.1.1 frequency 10 ip sla schedule 3 life forever start-time now ! track 2 ip sla 2 reachability ! track 3 ip sla 3 reachability
Then, configure a track statement that uses Boolean logic for all of these SLA statements:
track 10 list boolean or object 1 object 2 object 3
Finally, update the default route to use the new tracker:
Edge(config)#no ip route 0.0.0.0 0.0.0.0 203.0.113.1 name PRIMARY track 1 Edge(config)#ip route 0.0.0.0 0.0.0.0 203.0.113.1 name PRIMARY track 10
Let’s verify:
Edge#show track 10 Track 10 List boolean or Boolean OR is Up 2 changes, last change 00:02:25 object 1 Up object 2 Up object 3 Up Tracked by: Static IP Routing 0 Edge#show ip route 0.0.0.0 Routing entry for 0.0.0.0/0, supernet Known via "static", distance 1, metric 0, candidate default path Routing Descriptor Blocks: * 203.0.113.1 Route metric is 0, traffic share count is 1 Edge#show ip route track-table ip route 0.0.0.0 0.0.0.0 203.0.113.1 name PRIMARY track 10 state is [up] Edge#show ip cef exact-route 203.0.113.2 8.8.8.8 203.0.113.2 -> 8.8.8.8 =>IP adj out of GigabitEthernet1, addr 203.0.113.1
This looks great! We now have three SLA statements and one of them failing is not enough to move traffic to the secondary path. If just one of them fails, it’s more likely that there is a temporary issue with a DNS server or DNS zone.
Remember I mentioned something about packet sizes? Let’s discuss this in the next section.
IP SLA using HTTP
Tracking connectivity using DNS is definitely more useful than a simple ICMP Echo. What if we could move even further up the stack? This can be achieved by using HTTP probes. Rather than just checking that we get responses to DNS queries, let’s try to actually connect to a web site. The syntax is similar to that of DNS:
ip sla 5 http secure get https://amazon.com name-server 8.8.8.8 source-interface GigabitEthernet1 ! ip sla schedule 5 life forever start-time now ! track 5 ip sla 5 reachability ! no ip route 0.0.0.0 0.0.0.0 203.0.113.1 name PRIMARY track 10 ip route 0.0.0.0 0.0.0.0 203.0.113.1 name PRIMARY track 5
Then let’s verify:
Edge#show ip sla statistics 5 IPSLAs Latest Operation Statistics IPSLA operation id: 5 Latest RTT: 1175 milliseconds Latest operation start time: 13:57:12 SWE Mon Aug 22 2022 Latest operation return code: OK Latest DNS RTT: 7 ms Latest HTTP Transaction RTT: 1168 ms Number of successes: 2 Number of failures: 0 Operation time to live: Forever Edge#show ip route track-table ip route 0.0.0.0 0.0.0.0 203.0.113.1 name PRIMARY track 5 state is [up]
The router is now sending HTTP GET for the name amazon.com by first resolving the name through a DNS query to 8.8.8.8. This tests more of the stack than a DNS query. What’s the size of our SLA packets now? Let’s use another packet capturing utility to check:
ip access-list extended CAP-HTTP 10 permit tcp host 203.0.113.2 any 20 permit tcp any host 203.0.113.2 ! Edge#monitor capture CAP interface GigabitEthernet1 both Edge#monitor capture CAP access-list CAP-HTTP Edge#monitor capture CAP start Started capture point : CAP Edge#monitor capture CAP export flash:/CAP.pcap Exported Successfully Edge#copy flash:/CAP.pcap ftp://redacted:[email protected]
Let’s have a look at the PCAP:
The packets are definitely larger than when just using DNS. We see packets approaching 600 bytes of size. Not quite 1500 bytes, though! For production use we should of course track more than one web server. What else can we do when it comes to IP SLA? Time to get fancy!
Getting fancy with IP SLA
There’s a lot we can do with IP SLA. Time to get fancy! Let’s implement this:
- Resolve google.com via 8.8.8.8
- Resolve amazon.com via 208.67.220.220
- HTTP GET to amazon.com
- Ping sdwan.measure.office.com (O365 beacon service) with 1500 bytes ICMP Echo
If everything succeeds, the circuit is functioning as we can send ICMP, resolve DNS, and browse to web sites. This is the configuration:
track 1 ip sla 1 reachability ! track 2 ip sla 2 reachability ! track 5 ip sla 5 reachability ! track 20 ip sla 20 reachability ! track 100 list boolean and object 1 object 2 object 5 object 20 ! ip sla 1 dns google.com name-server 8.8.8.8 frequency 10 ip sla schedule 1 life forever start-time now ip sla 2 dns amazon.com name-server 208.67.220.220 frequency 10 ip sla schedule 2 life forever start-time now ip sla 5 http secure get https://amazon.com name-server 8.8.8.8 source-interface GigabitEthernet1 ip sla schedule 5 life forever start-time now ip sla 20 icmp-echo sdwan.measure.office.com source-interface GigabitEthernet1 request-data-size 1450 frequency 10 ip sla schedule 20 life forever start-time now
Note that the request data size is set to 1450 which will generate 1500 byte ICMP Echo packets. I’m not sure how the math maths here but I verified it with a packet capture. The ICMP Echo SLA when using a DNS name will resolve to an IP when configuring and be put into the running configuration. Let’s see if this all works:
Edge#show track 100 Track 100 List boolean and Boolean AND is Up 2 changes, last change 00:00:03 object 1 Up object 2 Up object 5 Up object 20 Up Tracked by: Static IP Routing 0 Edge#show ip route track-table ip route 0.0.0.0 0.0.0.0 203.0.113.1 name PRIMARY track 100 state is [up] Edge#show ip cef exact-route 203.0.113.2 8.8.8.8 203.0.113.2 -> 8.8.8.8 =>IP adj out of GigabitEthernet1, addr 203.0.113.1
Pretty cool! Using only static routes and IP SLA we now have a pretty good mechanism for verifying connectivity. A lot better than the simple ICMP Echos we started with.
Conclusion
You can make it as simple or complex as you want to using IP SLA. It all comes down to your requirements. Keep the following in mind:
- Consider what you want to measure
- How do you ensure SLA packets are flowing towards the correct ISP?
- How many things do you want to measure to ensure the path is good?
I hope this post has been informative and that you have learned some new IP SLA tricks as well as some good debugging and packet capture commands.
Thank you very much for this wonderful piece.
This was indeed a deep-dive. I’ve always thought the IP SLA feature was great but I’ve never seen it explored in all its glory as seen here. Looking forward to more educative and thorough breakdown of technologies and their real world applicability to solving real-life/production environment problems.
There are two potential gotchas when doing HTTP requests.
First, you will typically use DNS for hitting a highly available HTTP server like ‘google.com’. However, DNS will return the IPv4 and IPv6 entries for that address. If your router is not configured for IPv6, or doesn’t have full IPv6 access, and the router tries to use to the IPv6 for IPSLA, the test will now fail. In my experience, IPSLA always starts on IPv4, but it doesn’t stick and at some point switches to IPv6. (I’ve opened a TAC case on this, and there is no way to force IPSLA to IPv4 only.)
Second, most large sites are not HTTPS by default. The only HTTP response you will typically get is a 301 redirect response. Sometimes IPSLA is OK with this, but there seem to be some situations where it doesn’t work.
Oh, and be sure to set your response timers fairly high for determining a failure. Between the DNS lookup and HTTP response, it may take a few seconds if things are iffy on the servers.
This is great feedback, Greg. Thank you! There are a lot of caveats, for sure.
oops, meant ‘large sites ARE HTTPS be default’. if you could fix that please if you post the comment.
Pingback:Internet Edge IP SLA Deep Dive – permit-any-any.com
This post is awesome. Thanks Daniel!
Thanks, Eric!
Great article! I have two below questions/comments though.
1) Could we use PBR based approach for VRF? If yes, what would be commands for that, please?
2) Wen we do the fancy IP SLA, should we modify the PBR accordingly? If we don’t, the DNS and HTTPS would still be reachable via the second default route, wouldn’t they?