For the last year I have been working a lot with IWAN which is Cisco’s SD-WAN implementation (before Viptela acquisition).
One of the important aspects of SD-WAN is to be able to load balance the traffic. Load balancing traffic is not trivial in all situations though. Why not?
If you have a site where you have two MPLS circuits or two internet circuits and they both have the same amount of bandwidth, then things are simple. Or at least, relatively simple. Let’s say that you have a site with two 100 Mbit/s internet circuits. This means that we can do equal cost multi pathing (ECMP). If a flow ends up on link A or link B doesn’t matter. The flow will have an equal chance of utilizing as much bandwidth as it needs on either link. Now, there are still some things we need to consider even in the case of ECMP.
The size of flows – Some flows are going to be much larger than others, such as transfering files through CIFS or other protocols, downloading something from the internet versus something like Citrix traffic which is generally smaller packets and don’t consume a lot of bandwidth.
The number of flows – When doing ECMP, some form of hashing algorithm will be used to decide on which link the flow gets placed. This is normally done by looking at the source and destination IP and may in some cases add more entropy by looking at port numbers as well. If we have just a few flows then the proportion of how many flows get placed on each link may not be so even.
To have ECMP load share well, meaning that the links have a similar level of utilization, we need enough flows to increase the probability of having a balance of large flows on each of the links. ECMP is thus fairly straight forward but we may have to monitor link utilization to factor in this when deciding which link to put a flow on.
Now imagine that we have another situation where link A is 100 Mbit/s and link B is 10 Mbit/s. This means that we have to do unequal cost multi pathing (UCMP). Now, this was not really supported in other protocols than EIGRP previously although there were some nerd knobs in BGP to do a similar thing as well. Running UCMP hasn’t been common before SD-WAN though so I didn’t run into the situation I will describe below.
When we don’t have equal bandwidth links things become much more challenging. Assigning a flow to a link is done at the start of the flow, meaning that for example a TCP session is about to establish. At the start of the flow we don’t know how large the flow is going to become. This could be a background flow such as SSH or it could be a large file being downloaded. If the flow gets placed on the 10 Mbit/s link, then the flow can never grow larger than 10 Mbit/s. If the users before moving to SD-WAN had an active/standby setup, they could always use up to 100 Mbit/s but when doing UCMP, there is a total of 110 Mbit/s available to use but if a user ends up on the “wrong” link that flow could end up being “punished” because it can’t grow to the size it would want to.
Even if we monitor the link utilization we can never achieve a perfect balance since we can’t predict the size of flows. One option to work around this could be to move flows once the grow past a certain size. The challenge then becomes how to move the flows and not moving them too often which would cause churn and possibly packet loss. Do you move the flow again once it decreases in size? UCMP is a lot more complex than ECMP.
Another option is to do per packet load balancing but this is much more complex and can lead to packet reordering. To make this work, some additional intelligence is needed so that a router knows when packets enter in which order they should be sent out. There are some SD-WAN implementations today doing this but be aware that this is a much more complex feature.
When considering SD-WAN from different vendors you should think about how all of this works. Don’t be afraid to ask the difficult questions. The situation above can be somewhat alleviated by selecting which type of applications end up on which link. The point of this post is that even with more intelligent solutions you still have to make design choices and consider what your most important traffic is.
great post, these seem to be obvious things, but typically come to light after a solution has been deployed and people are screaming. This is what we need – battle the magic dust with operations experience….
Thanks, Pavel.
Indeed. They are obvious but still people forget them. I’ve enjoyed reading your posts in the past. Thanks for reading.
Hi Daniel,
do you really have a client that is looking to solve this particular problem outlined above, i.e. actively balance traffic over two links, one of which has 10x more available bw than the other? Because in that case I would argue that this is either a poor network design or that client education is needed. I.e. not worth solving imo by trying to utilize the slower link with dynamic load balancing.
Now, you could ask “But why not?”. Because, of all the reasons you outlined above the solution might well turn out to be costlier than simply making both links equal and configure some form of ECMP.
There are presentations out there that show how you can solve this with a more intelligent SDN controller, but this is still out of reach for most enterprises.
Cheers