Internet Engineering Task Force D. Thaler INTERNET-DRAFT Microsoft Expires September 1999 C. Hopps Merit Network 14 April 1999 Multipath Issues in Unicast and Multicast Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. This document is an Internet Draft. Internet Drafts are working documents of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute working documents as Internet Drafts. Internet Drafts are valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet Drafts as reference material or to cite them other than as a "work in progress". To view the list Internet-Draft Shadow Directories, see http://www.ietf.org/shadow.html. Copyright Notice Copyright (C) The Internet Society (1999). All Rights Reserved. 1. Introduction Various routing protocols, including OSPF [1] and ISIS, explicitly allow "Equal-Cost Multipath" routing. Some router implementations also allow equal-cost multipath usage with RIP and other routing protocols. Using equal-cost multipath means that if multiple equal-cost routes to the Expires September 1999 [Page 1] Draft Multipath Issues April 1999 same destination exist, they can be discovered and used to provide load balancing among redundant paths. The effect of multipath routing on a forwarder is that the forwarder potentially has several next-hops for any given destination and must use some method to choose which next-hop should be used for a given data packet. This memo summarizes current practices, problems, and solutions. 2. Concerns Several router implementations allow multipath forwarding. This is sometimes done naively via round-robin, where each packet matching a given destination route is forwarded using the subsequent next-hop, in a round-robin fashion. This does provide a form of load balancing, but there are several problems with approaches such as round-robin or random: Variable Path MTU Since each of the redundant paths may have a different MTU, this means that the overall path MTU can change on a packet-by-packet basis, negating the usefulness of path MTU discovery. Variable Bandwidth Since each of the redundant paths may have a different amount of bandwidth available, bandwidth may also change on a packet-by- packet basis. Rate-adaptive protocols such as TCP are designed to optimize their performance to adapt to the available bandwidth. Varying the bandwidth on a packet-by-packet basis causes problems with TCP's congestion control mechanisms, resulting in much lower throughputs. Variable Latencies Since each of the redundant paths may have a different latency involved, having packets take separate paths can cause packets to always arrive out of order, increasing delivery latency and buffering requirements. Debugging Common debugging utilities such as ping and traceroute are much Expires September 1999 [Page 2] Draft Multipath Issues April 1999 less reliable in the presence of multiple paths and may even present completely wrong results. In multicast routing, the problem with multiple paths is that multicast routing protocols prevent loops and duplicates by constructing a single tree to all receivers of the same group address. Multicast routing protocols deployed today (DVMRP, PIM-DM, PIM-SM) [2] construct shortest- path trees rooted at either the source, or another router known as a Core or Rendezvous Point. Hence, the way they ensure that duplicates will not arise is that a given tree must use only a single next-hop towards the root of the tree. 3. Requirements All of the problems outlined in the previous section arise when packets in the same unicast or multicast "flow" (or session) are split among multiple paths. The natural solution is therefore to ensure that packets for the same flow always use the same path. Two additional features are desirable: Minimal disruption When multipath is used, meaning that multiple routes contribute valid next-hops, the chances are higher of routes being added and deleted from consideration than when only the "best" route is used (in which case metric changes in alternate routes have no effect on traffic paths). Hence, it is desirable to minimize the number of active flows affected by the addition or deletion of another next- hop. Fast implementation The amount of additional computation required to forward a packet must be as small as possible. For example, when doing round-robin, this computation might consist of incrementing (modulo the number of next-hops) a next-hop index. 4. Solutions We now provide three possible methods for improving the performance of multipath and then discuss their applicability to unicast and multicast forwarding. Expires September 1999 [Page 3] Draft Multipath Issues April 1999 Modulo-N Hash To select a next-hop from the list of N next-hops, the router performs a modulo-N hash over the packet header fields that identify a flow. This has the advantage of being fast, at the expense of (N-1)/N of all flows changing paths whenever a next-hop is added or removed. Hash-Threshold The router first selects a key by performing a hash (e.g., modulo-K where K is large, or CRC16) over the packet header fields that identify the flow. The N next-hops have been assigned unique regions in the key space. By comparing the key against region boundaries the router can determine which region the key belongs to and thus which next-hop to use. This method has the advantage of only affecting flows near the region boundaries (or thresholds) when next-hops are added or removed. Hash-threshold's lookup can be done in software using a binary search yielding O(logN), or in hardware in parallel for O(1). When a next-hop is added or removed, between 1/4 and 1/2 of all flows change paths. An analysis of this method can be found in [3]. Highest Random Weight (HRW) The router uses a simple pseudo-random number function seeded with the packet header fields that identify a flow, as well as a next- hop identifier (address or index), to assign a weight to each of the N next-hops. The next-hop receiving the highest weight is chosen as the next-hop. This has the advantage of minimizing the number of flows affected by a next-hop addition or deletion (only 1/N of them), but is approximately N times as expensive as a modulo-N hash. An analysis of various deterministic weight functions can be found in [4]. The applicability of these three alternatives depends on (at least) two factors: whether the forwarder maintains per-flow state, and how precious CPU is to a multipath forwarder. If per-flow state is maintained in a multipath forwarder, then computation of the next-hop can be done by the router at state creation time. This entails no additional computations at packet forwarding time, since the next-hop is precomputed. In this case, any method can be used, including round-robin, random, modulo-N, hash-threshold or HRW. Hash functions such as modulo-N, hash-threshold and HRW are better if the forwarder state may be deleted for any reason during the lifetime of Expires September 1999 [Page 4] Draft Multipath Issues April 1999 a flow since subsequent next-hop computations by the router will always select the same path. This also improves the usefulness of debugging utilities such as traceroute. Finally, to maximize the stability of paths (and hence the usefulness of traceroute, etc.), the use of HRW is recommended over the other methods mentioned herein. If per-flow state is not maintained by the forwarder, then using multiple next-hops requires that the next-hop be calculated at packet arrival time. When CPU is more precious than stability of flow paths, hash-threshold is recommended over the other methods mentioned herein. 4.1. Unicast Forwarding Depending on the implementation, unicast forwarding may or may not keep per-flow state. We recommend that where forwarder implementations keep flow state, routers should use HRW at state creation time (and next-hop deletion time) to select the next-hop, and that forwarders without per- flow state use hash-threshold. 4.2. Multicast Forwarding Today's multicast forwarding engines use a cache of forwarding entries indexed by group (or group prefix) and source (or source prefix). This means that today's multicast forwarder's always keep per-flow state, although for some multicast routing protocols, the "flow" may be fairly coarse (e.g., traffic from all sources to the same destination). Since per-flow state is kept by the forwarder, it is recommended that the router always use HRW to select the next-hop. Routers using explicit-joining protocols such as PIM-SM [5] should thus use the multipath information when determining to which neighbor a join message should be sent. For example, when multiple next-hops exist for a given Rendezvous Point (RP) toward which a (*,G) Join should be sent, it is recommended that HRW be used to select the next-hop to use for each group. 5. Applicability The algorithms discussed above (except round-robin) all rely on some form of hash function. Equal flow distribution is achieved when the hash function is uniformly distributed. Since the commonly used hash functions only become uniformly distributed when the number of inputs is Expires September 1999 [Page 5] Draft Multipath Issues April 1999 relatively large, these algorithms are more applicable to routers used to route many flows, than in, for example, a small business setting. 6. Redundant Parallel Links A related problem occurs when multiple parallel links are used between the same pair of routers. A common solution is to bundle the two links together into a "super"-link when is then used for routing. For multicast forwarding, this results in the two links being reduced to a single next-hop (over the combined link) which can be used to prevent duplicates. When a unicast or multicast packet is queued to the combined link, some method, such as those discussed earlier, is still required to determine the physical link on which to transmit the packet. If the parallel links are identical, then most of the concerns discussed in this document are avoided with the combined link. The exception is packet reordering, which can still occur with round-robin, adversely affecting TCP. 7. Security Considerations This document discusses issues with various methods of choosing a next- hop from among multiple valid next-hops. As such, it does not directly impact the security of the Internet infrastructure or its applications. 8. References [1] Moy, J., "OSPF Version 2", RFC 2178, July 1997. [2] Maufer, T., "Deploying IP Multicast in the Enterprise", Prentice- Hall, 1998. [3] Hopps, C., "Analysis of an Equal-Cost Multi-Path Algorithm",, draft-hopps-ecmp-algo-analysis-03.txt, April 1999. [4] Thaler, D., and C.V. Ravishankar, "Using Name-Based Mappings to Increase Hit Rates", IEEE/ACM Transactions on Networking, February 1998. [5] Estrin, D., Farinacci, D., Helmy, A., Thaler, D., Deering, S., Handley, M., Jacobson, V., Liu, C., Sharma, P., and L. Wei, "Protocol Independent Multicast-Sparse Mode (PIM-SM): Protocol Specification", RFC 2362, June 1998. Expires September 1999 [Page 6] Draft Multipath Issues April 1999 9. Authors' Addresses Dave Thaler Microsoft One Microsoft Way Redmond, WA 98052 Phone: +1 425 703 8835 EMail: dthaler@microsoft.com Christian E. Hopps Merit Network 4251 Plymouth Road, Suite C. Ann Arbor, MI 48105 Phone: +1 734 936 0291 EMail: chopps@merit.edu 10. Full Copyright Statement Copyright (C) The Internet Society (1999). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR Expires September 1999 [Page 7] Draft Multipath Issues April 1999 FITNESS FOR A PARTICULAR PURPOSE." Expires September 1999 [Page 8]