I’m currently working on a design and needed to verify some failover behavior of the Cisco ASA firewall.
The ASA can run in active/active or active/standby mode where most deployments I see run in active/standby mode. When in a failover pair the firewalls will share an IP address and MAC address, very similar to HSRP or VRRP but it also synchronizes the state of TCP sessions, IPSec SA’s, routes and so on. The secondary firewall gets its config from the primary firewall so everything is configured exactly the same on both firewalls.
To verify if the other firewalls is reachable and to synchronize state, a failover link is used between the firewalls. The firewalls use a keepalive to verify if the other firewall is still there. This works just like any routing protocol running over a link where you expect to see a hello from your neighbor and if you miss 3 hello’s, the other firewall is gone. This timer can be configured and in my tests I used a hello of 333 ms and a holdtime of 999 ms which means that convergence should happen within one second.
The first scenario I was testing was to manually trigger a failover. This is normally done when you need to upgrade the firewalls, you can then perform a hitless (almost) upgrade where you trigger a failover to the secondary firewall and can upgrade the primary firewall, then trigger the failover again and then upgrade the secondary firewall. It can also be done if a firewall seems to be misbehaving or that you suspect that it’s faulty.
When triggering a failover, the primary firewall will send a message to the secondary firewall basically saying: “Hey, I need to go away. It’s your time to go active”. Because this message is sent, we don’t need to rely on timers for the failover to take place. This means that this process is very fast, my tests indicated that the convergence was below 200 ms.
The second scenario I wanted to test was to simulate a power failure or that a device went up in flames. To do this I did a reload on the active firewall. My thought here was that a reload wouldn’t send a message to the other firewall to take over, which it doesn’t. So far so good but I ran into another issue. My tests showed that I was seeing a convergence of 3 seconds. Out of these 3 seconds, 1 second was used for the holdtime. This is expected since the active firewall doesn’t message the secondary firewall to take over, we have to rely on keepalives. What I couldn’t figure out was where the 2 additional seconds were coming from.
In my scenario I’m running IPSec. I had a feeling that the extra 2 seconds had something to do with IPSec. That the IPSec SA was being torn down but these should be replicated between the firewalls. I tried a test without IPSec and it converged in 1 second which meant that the extra 2 seconds were indeed due to IPSec.
A colleague suggested to me (thanks Johan) that I should check the lifetime of the IPSec SA and for how long the tunnel had been established. I was able to verify that the tunnel was being torn down, both based on how long the tunnel had been alive and by debugging the other side where I saw a message like the one below.
DELETE_REASONIKEv2-PROTO-5: Delete Reason received with error code:IKEV2_DELETE_SG_REBOOT severity:INFORMATIONAL
This further led me to believe that my active firewall was somehow tearing down the IPSec tunnel. I then looked at the message it prints out when rebooting.
*** *** --- START GRACEFUL SHUTDOWN --- Shutting down isakmp Shutting down webvpn Shutting down License Controller Shutting down File system
Shutting down isakmp… That’s not so good when you have an IPSec tunnel that is active. Then it hit me that doing a reload is not a good way of simulating a power failure. When doing a reload, the firewall will attempt to shutdown everything gracefully. As a part of this, the firewall will shut down all processes such as ISAKMP before reloading. This meant that my IPSec tunnel was being torn down and it took an extra 2 seconds for the secondary firewall to establish the IPSec tunnel again. Mystery solved!
Normally before doing a reload you would trigger the failover process so it’s not really an issue. I still think it would make sense if the firewall would check if it is active before doing the reload though and if it is, tell the secondary firewall to take over.
I then ran another test where I simply shut down the node (ASAv) and then I saw the expected results, the network converged within one second because now my IPSec tunnel stayed up as the primary firewall did not have the chance to gracefully shut it down.
Lesson learned: When you do these kind of tests, make sure that you are testing what you think that you are testing or you may see unexpected results.
I hope there is something to learn from this for my readers as well 🙂