We are pleased to share with you all an interesting article contributed by Mina G. Nasry who has decent experience in Data Carrier/ISP scope.
Mina G. Nasry Network Engineer, Core Team at NOOR Data Network
|
|
Let's admit it, most of us - the networking folks - have been confused, at least for sometimes, when it comes to load-balancing. Especially when the area of focus expands to a multi-vendor environment. The reason I think this is true, is that not all of its facts are standardized that everyone should be familiar with. Besides, many changes may differ from a network perspective to another, regarding where the traffic flow is passing, and how it is processed.
Of course, most of the load-balancing mechanisms are standards, but some of the really not-that-easy-to-get-it tricks are vendors proprietary. In addition, many of its facts may change by different factors depending on many specs, such payload type and vendor implementation boundaries.
Today, I'll try - through this long post - to scratch beneath the surface for two of the most well known transport models [IP/MPLS], and how load-balancing should work considering different scenarios for Cisco and Juniper implementations as an example of a multi-vendor environment.
In a nutshell
Generally, in order for load-balancing to work, an algorithm must be chosen to provide the way how load-balancing occurs. Every algorithm has some inputs that should be fulfilled further to achieve a load-balancing technique. Most popular transport models that benefits from load-balancing mechanisms are IPv4/IPv6 and MPLS models, whether on segregated or ether-channels. Forwarding in each model should be treated accordingly to its type and network perspective where the load-sharing decision is made, whether passing by the edge portions, or at the core portion for a production network.
RIB/LIB installation into data-plane
In essence of the separation between the control and data plane, each vendor is using a proprietary technique to install active entries from Routing/Label information base to the hardware ASICs. This is mainly where forwarding decisions take place, and hence this is the place where load-sharing occurs.
Cisco devices use CEF as a program for such translation. In CEF, each device has a number of buckets [depending on the H/W capability] for the traffic that is forwarded to a certain installed FIB entry, and each bucket will encode a different result of the hash combination. Let’s say that for a given example, we have 16 buckets, and we have 4 FIB entries, that means that each FIB entry will acquire 4 buckets for each incoming flow. But these buckets won’t be really in use until they are already encoded with different hash results.
For Juniper devices, an alternative implementation is being used to achieve the CEF functions, along with additional features. Where routing/label information are being installed into a preserved amount of memory located into the PFE [Packet Forwarding Engine] called JTREE memory, which is divided into two main segments, the first stores related routing information, and the second stores related firewall filters information. Whenever a new RIB/LIB entry gets installed into JTREE, and a forwarding decision is required, a JTREE lookup should be performed first.
Despite the fact that MPLS signaling implementation has two complete different approaches between the two vendors, CEF and JTREE still work in a similar way.
IPv4 / IPv6 transport models
For an ordinary setup that considers only segregated links, to achieve load-balancing for IPv4 / IPv6 transport models, two or more ECMPs (Equal Cost Multiple Paths) must be available in the control-plane for as particular device in order for the forwarding-plane to install multiple FIB [Forwarding Information Base] entries, each one with a particular exit-interface for a given prefix. The load-balancing mechanism will be chosen according to the H/W perspective of how each FIB entry is being seen/treated. Each mechanism has an algorithm that it is based on.
Per-packet load-sharing
Per-destination load-sharing This mechanism uses a different strategy for load-balancing traffic flows over available FIB entries. It involves a more efficient way to deliver these flows to multiple exit interfaces without any overhead caused to the receiving device’s CPU. It uses a simple hash algorithm that combines the source and destination IP address together for each incoming packet. The result of this combination should reflects on the variance in load-sharing probability. In other words, the more the source/destination addresses are different from a packet to another in the same traffic flow, the near to even load-sharing results we can achieve, and vice versa. Even load-balancing results using this mechanism is not as accurate as the per-packet mechanism, but it fixes its drawbacks.
Per-flow load-sharing This mechanism introduces even more efficient, yet introduced more entries to be fulfilled that can potentially affects the load-sharing decision of a device over multiple FIB entries. Two potential entries are added to the algorithm hashing equation, which are source/destination ports for every incoming traffic flow [source-address XOR destination-address XOR source-port XOR destination-port]. The same goes as its predecessor, as the more those two entries are varying for each traffic flow, the near to even load-sharing results we can achieve. The fact that there are two additional entries to the hash equation can guarantee more enhanced load-balancing results than per-destination mechanism. It should also mean that each traffic flow will be forwarded on a specific FIB entry.
Per-prefix load-sharing This mechanism is only available on JUNOS devices, where it is the default load-sharing behavior for BGP traffic, unless it has been told otherwise. The way it works is simple, as the amount of received prefixes for a given peer device will result in distributing the traffic for those prefixes on the available FIB entries. So in other words, the more BGP prefixes you receive from a peer, the more distribution rate you can achieve for the traffic directed to these prefixes.
IPv4 / IPv6 models load-balancing behavior over ether-channels Load-balancing on ether-channels is a bit trickier than segregated links. Depending on the traffic type, the load-balancing will be performed accordingly. The default behavior for IP traffic on an ether-channel is that the traffic will load-balance between members using the suitable load-balancing algorithm that we can choose.
The main difference here is that the distribution of the traffic load is not the same as segregated links. Unlike the last one, we can only achieve a near-to-perfect load-sharing rates if an ether-channel have an exponential (2, 4, 8...etc) member count. Otherwise, the IP traffic will be unevenly load-shared between members. However, per-destination and per-flow load-balancing mechanisms are also available in this case considering the limitation for members count.
As when it comes to L2 nodes, some Ethernet switches will allow you to load-balance incoming non-IP traffic flows (e.g. PPP) on a per-destination mechanism, depending on a hashing algorithm that combines source/destination mac-address. This feature is allowed as L2 headers are the only headers that are relevant in this case. The hashing result of the equation will be encoded in buckets as mentioned above, and distributed along ether-channel members considering the mentioned limitation for members count.
MPLS transport model
MPLS traffic also benefits from load-balancing whether on the edges or at the core. Bearing in mind that forwarding MPLS traffic needs IGP recursion first, in order for each entry to be installed in the forwarding plane. That means that the more ECMPs are available for that IP recursion, the more exit interfaces will be available for each MPLS LFIB entry in the forwarding-plane. MPLS Load-sharing decision is generally decided by only one aspect, the bottom-most label value for a given FEC [Forwarding Equivalent Class].
In General, most of the above mentioned load-balancing mechanisms are being used in MPLS, but the algorithm is the one that changes in this case. For each given FEC, the bottom-most label of the stack should be evaluated everytime a load-sharing decision needs to be taken. The more variation we have in the bottom-most label values, the more variation in the hashing results we should get, and hence the near-to-perfect load-sharing rate we can achieve.
Some vendors have some refinements in that hashing algorithm, such as Cisco & Juniper, which also have a way to predict the MPLS payload, whether it's IPv4/IPv6 traffic or non-IP traffic. As both IPv4 and IPv6 headers begin with "version" field, Cisco by default has decided to look for the first four bits of the MPLS payload. If it's not "0x4" or "0x6" then the MPLS payload is considered as non-IP traffic, and hence the load-sharing decision will be taken regarding only the bottom-most label.
If the first four bits are "0x4" or "0x6", then MPLS payload is considered IPv4 or IPv6 respectively, and hence the hashing algorithm will also include source and destination addresses along with the bottom-most label value combined. This will definitely result in more variance for the hashing equation, which will translates into more near-to-perfect load-balancing results. For other implementations, such as Juniper, you have to manually choose which labels you want to be included in the hash algorithm, for example, you may choose the bottom-most and the first label from it only to be included in the equation. You also have to manually choose to include the MPLS payload source/destination addresses in the hashing equation. Otherwise, all the MPLS incoming traffic for Juniper boxes will be load-shared using only the bottom-most label by default.
Bearing in mind that also some vendors has some implementation limitations for that hashing algorithm. For example, Cisco skip load-sharing decisions if MPLS total labels on a given stack is more than 4 Labels.
L2VPNs & VPLS load-sharing dilemma in the core One of the well-known issues in forwarding L2VPN & VPLS traffic is that unlike regular and L3VPN MPLS traffic, there is no support for load-sharing when it comes to P/PE LSRs by default. Focusing on P routers, the reason behind this is that when signaling a pseudowire/site-id, the bottom-most label in this case, which is AC [Attachment Circuit] label, will always be the same for each individual P/PE that forms the two [or more] LSPs for that pseudowire/site-id. So that when the hashing equation computes the results, it will be always a combination for the same label count and values, which will obviously lead to the same hashing result, and by consequences, the pseudowire/site traffic will be forwarded on one link by default, even two or more ECMPs exist.
Considering Cisco & Juniper implementations, as it was mentioned, an additional input is being processed in the hashing equation, which is the source/destination addresses if they MPLS payload has been identified as IPv4 or IPv6 [Interworking-IP]. But since the L2VPN psueudowires always represent a point-to-point network on their ends, so the hashing result will not change as well. A workaround for a load-sharing decision to change, is the FAT (Flow-Aware-Transport) label.
The main idea behind the FAT label is an LDP/BGP enhancement that allows an additional random label value to be signaled along with the regular AC label for each pseudowire/site-id. This label value variation will be directly proportional to an additional hashing algorithm at the ingress PE. The exact position of the new FAT label is the bottom-most of the stack, that’s why all PEs should support this feature in order to distinguish the FAT label as it is being signaled from the actual AC Label that is being used for certain a pseudowire/site-id.
This new algorithm will help to categorize the traffic flows based on source/destination addresses, ports, and MAC addresses. The ingress PE will now be able to assign a random flow label for each identified traffic flow, and by consequences, that will allow load-sharing for L2VPN psuedowires & VPLS sites as a new variant label value now exists.
Regarding Cisco & Juniper implementations, a problem still remains when it comes to L2VPN pseudowires. As it was mentioned, an additional investigation for the incoming traffic is being made, so that the first four bits for the MPLS payload identifies its type. Frankly, there are many cases nowadays that an MPLS payload may start with "0x4" or "0x6" while it's not an IPv4/IPv6 traffic. A workaround for this problem is to include a control-word in the pseudowire traffic encapsulation, as its first four bits is a mandatory to start with "0x0". So that will help to avoid any confusion for an LSR to identify the incoming MPLS payload.
MPLS model load-balancing behavior over ether-channels For L3 ether-channels, hashing equation for load-sharing the MPLS traffic will always include the bottom-most label, additionally, some vendors such as Cisco will include MPLS payload header information in the hashing equation by default as a refinement for the equation. Also some L2 nodes, such as Cisco Ethernet switches will allow an "MPLS or IP" algorithm for the MPLS load-balancing over ether-channels by default, and that should work as exactly the same way as segregated links.
However, it's important to remember that near-to-perfect load-balancing on ether-channels will be only if the ether-channel member count is exponential.
Load-balancing RSVP-TE traffic over ether-channels In an ordinary scenario, where only segregated links are considered, load-balancing decision will be skipped for RSVP-TE traffic, as path-options are limited between explicitly configuring a manual path to the LSP termination, or dynamically let the CSPF choose a path for you according to the termination available IGP constrains, so there should be always one calculated path at a time for RSVP-TE traffic. However, load-balancing for RSVP-TE over multiple ether-channel members has proved to be possible, but its behavior is not predicted unless the RSVP tunnel payload is already known. For further refinements, you have to set a load-balancing algorithm that suits the setup that should includes your choice of the available inputs, such as source/destination mac-address/ip-address/port. The default behavior for Cisco implementation is to combine bottom-most label value with source/destination IP-address & for Juniper implementation is only bottom-most label.
Meaning that for a given RSVP-tunnel, if it's payload is pure IP, then for Cisco implementation, the hashing equation by default should combine the bottom-most label with the source/destination IP address of the MPLS payload IP header. But for Juniper implementation, you should add the IP payload to the hash-key in order for the load-sharing to take place, as the RSVP label will be always the same. Load-balancing will be considered efficient in this case will be only if there is a variance for the source/destination IP addresses, so that the hashing results could vary as well.
As for a L3VPN payload within an RSVP-Tunnel, both Cisco and Juniper implementations are fine by default, as the bottom-most label here varies most of the time. But it's also a good idea if you add at least the ip header information to the hash-key in order to refine the hashing results.
Regarding L2VPN payload within an RSVP-Tunnel, load-sharing will not take place obviously, for the same mentioned reason it doesn't take place for segregated links, as no inputs actually varies.
Entropy Label Newer implementations for Cisco & Juniper now support this feature. When using Entropy label [EL], no DPI should be performed at each MPLS hop for more MPLS payload header information, like it is the case on the regular MPLS forwarding. This saves a significant amount of LSRs processing power, as the way Entropy label works for load-balancing is that on a particular device, instead of considering only the bottom-most label, all labels will be investigated, count, and then their values are combined together to form a hashing equation. This should be significantly improving the load-sharing performance for any MPLS application.
Entropy label values are assigned from the same range regular MPLS labels are assigned, so there is an overlap potential here. To solve this issue, a label indicator [ELI] is pushed before the Entropy label itself, so that it can be distinguished easily. The Entropy label identifier exact position lies right after the LDP/RSVP transport LSP label. The EL is usually POPped at the egress LSR.
Similar to the FAT label, the Entropy label is mainly signaled by the LER via BGP/LDP, each label value is characterizing a flow. Giving the chance for the supported Entropy label LSRs to a load-sharing result, as now the load-sharing decision will be taken after considering all the label stack, and not just the bottom-most label.
Despite the fact that this feature is MPLS application agnostic, its only drawback is that the all of the LSRs must support it in order to benefit from its presence.
At this point, I should say that i'm finished. I should also thank you if you've reached this far, and I hope I've made something useful from such a roughly-padded-yet-no-pics-included content.
Please feel free to share your thoughts! |
||