I have done many network migrations over the years. Now a days it’s a more rare event but this weekend we migrated some Core switches with very little down time. What are some of the things that you should do to maximize the odds of a successful migration?
If your migration went successful without planning, that doesn’t mean you are smart, just lucky. Every migration requires planning. What steps are involved in the migration? How do you validate each step? Who needs to be involved in the migration? Who needs to validate services when the migration is done? What are the criteria for a successful migration? How much time do you need to perform the migration? At what point do roll back? What are the steps involved in rolling back?
A migration plan can have varying levels of detail. I’ve worked with some very critical networks where we have had to describe each and every step in detail including every command that is involved in the migration. This takes a lot of time but you can’t cut corners when you are working with networks that can affect people’s health and lives.
Prepare as much as you can. This involves things like racking the new gear, making sure it powers on, and applying the configuration to the devices. You don’t want to go into a change and discover that the devices don’t power on, or that the commands you were going to use have the incorrect syntax, or worse, are not recognized by the device at all. You also don’t want to discover that the SFPs you are using are not compatible with the device, or with the cabling where you have the gear. Have the device as connected and ready as is possible without it serving any traffic. Which leads us to the next point.
Connect the device
If you can, connect the device in advance. Like mentioned previously, you don’t want to be in a time stressed migration where you can’t the interfaces to link up. There are many things that can prevent an interface coming up such as cabling, light levels, dirty connectors, incompatible settings with other devices, misconfigured LAG settings and many other things. Knowing that the interfaces are up will bring you some peace of mind and make the migration a lot faster than it otherwise would be. Just make sure you don’t attract any traffic to the device before you intend to.
Pre migration checks
There is nothing worse than being in a call and someone says “Application X is down and we have to roll back unless it comes up again” blaming the network and then it turns out that the application was already down before the migration started. Define what applications/systems you want to check and have them checked before the migration. You don’t want to have a system down and not know if it is at all related to the migration or not. From a networking perspective, there are tools like pyATS that could help in this step and then comparing to the state afterwards. From a networking perspective you will want to know things like:
- What interfaces were up before
- Roughly how much traffic across different interfaces
- What routes are in the routing table?
- How many MAC addresses do you see?
- How many ARP entries?
- What routing adjacencies are up?
These are not exact facts but should give you clues how the migration is going. If you saw thousands of MAC addresses before but only a few now, then something might not be right.
Access to making calls
This may sound stupid but in some facilities you have no cell service. How do you make calls? Is there internet access in the facility? You may need that both to setup a call as well as do research during the migration or chat with people etc. Is the internet service also available during the migration? Or does it rely upon the infrastructure that you are migrating? Try to consider these scenarios in advance so you don’t end up in a scenario where things are not going well and you can’t get hold of people due to no cell service or internet access.
Have the right people on the call or on standby
You want to have the right amount of people in the call when you perform the migration. Too few is not good but too many is not good either. That will just slow you down and perhaps confuse things. Define what people are criticial to the process and have them on the call but also have others be reachable in case you need them later. Sometimes the migration comes to a halt because you don’t have the right people to verify services are up.
Consider authentication scenarios
Most organizations have some form of central authentication using TACACS or RADIUS. During a migration, those services may not be reachable. Did you configure a fallback account? Do you remember the password for it? Do you also have logins for the other systems that you need to access? Do you know the IP addresses of them? Maybe DNS is not reachable during the migration… Also, you may have MFA set up for some services. Maybe even for authenticating to devices. What do you do if the MFA sends you SMS but you have no cell service? Are there other means of authenticating via MFA?
Post migration checks
Like mentioned previously, make sure that you have defined the success criteria for the migration. Validate that you are fullfilling them and that you are seeing what you are expecting. Bonus points if you use some form of automation to run the checks for you.
Performing a successful migration involves a lot of planning and a bit of luck as well. I wish you the best for your next migration!