<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- $Id$ -->
<!DOCTYPE rfc SYSTEM 'rfc2629.dtd' [
	  <!ENTITY rfc2119 PUBLIC ''
		   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml'>
	  ]>
<?rfc toc="yes"?>
<?rfc tocompact="no"?>
<?rfc tocdepth="6"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<?rfc strict="yes" ?>


<rfc category="std" docName="draft-mohanty-bess-weighted-hrw-02"
     ipr="trust200902" updates="">

  <front>
    <title abbrev="Weighted HRW and its Applications">
      Weighted HRW and its applications 
    </title>

    <author fullname="Satya Ranjan Mohanty" initials="S R."
            surname="Mohanty">
      <organization>Cisco Systems, Inc.</organization>
      <address>
        <postal>
          <street>225 West Tasman Drive</street>
          <street/>
          <city>San Jose</city>
          <code>95134</code>
          <region>CA</region>
          <country>USA</country>
        </postal>
        <email>satyamoh@cisco.com</email>
      </address>
    </author>

    <author fullname="Mankamana Misra" initials="M."
            surname="Misra">
      <organization>Cisco Systems, Inc.</organization>
      <address>
        <postal>
          <street>170 W. Tasman Drive</street>
          <street/>
          <city>San Jose</city>
          <code>95134</code>
          <region>CA</region>
          <country>USA</country>
        </postal>
        <email>mankamis@cisco.com</email>
      </address>
    </author>
    <author fullname="Acee Lindem" initials="A."
            surname="Lindem">
      <organization>Cisco Systems, Inc.</organization>
      <address>
        <postal>
          <street>170 West Tasman Drive</street>
          <street/>
          <city>San Jose</city>
          <code>95134</code>
          <region>CA</region>
          <country>USA</country>
        </postal>
        <email>acee@cisco.com</email>
      </address>
    </author>
    <author fullname="Ali Sajassi" initials="A."
            surname="Sajassi">
      <organization>Cisco Systems, Inc.</organization>
      <address>
        <postal>
          <street>170 West Tasman Drive</street>
          <street/>
          <city>San Jose</city>
          <code>95134</code>
          <region>CA</region>
          <country>USA</country>
        </postal>
        <email>sajassi@cisco.com</email>
      </address>
    </author>
    <author fullname="John Drake" initials="J."
            surname="Drake">
      <organization>Juniper Networks, Inc.</organization>
      <address>
        <postal>
          <street>1194 N. Mathilda Drive</street>
          <street/>
          <city>Sunnyvale</city>
          <code>94089</code>
          <region>CA</region>
          <country>USA</country>
        </postal>
        <email>jdrake@juniper.net</email>
      </address>
    </author>

    <date month="December" day="08" year="2020"/>
    <area>Routing</area>

    <workgroup>BESS Working Group</workgroup>

<abstract>
<t> Rendezvous Hashing also known as Highest Random Weight (HRW) has been used in many load balancing applications where the central problem is how to map an object to as server such that the mapping is uniform and also minimally affected by the change in the server set. Recently, it has found use in DF election algorithms in the EVPN context and load balancing using DMZ. This draft deals with the problem of achieving load balancing with minimal disruption when the servers have different weights. It provides an algorithm to do so and also describes a few use-case scenarios where this algorithmic technique can apply. 
</t>
</abstract>



  </front>


  <middle>
<section anchor="Req" title="Requirements Language">
	<t>The key words &quot;MUST&quot;, &quot;MUST NOT&quot;,
          &quot;REQUIRED&quot;, &quot;SHALL&quot;,
          &quot;SHALL NOT&quot;, &quot;SHOULD&quot;,
          &quot;SHOULD NOT&quot;, &quot;RECOMMENDED&quot;,
          &quot;MAY&quot;, and &quot;OPTIONAL&quot; in this document
          are to be interpreted as described in <xref target="RFC2119"/>.
	</t>

      </section> <!-- EO Req -->
    <section anchor="Intro" title="Introduction">
      <t>
	Given an object O, a set of servers and a set of clients, a fundamental problem is how do the set of clients, independently and unanimously agree in a distributed framework, which server to assign O? This is the distributed hash table problem. The assignment should be "minimally disruptive" which means that there should be a minimal remapping of objects whenever a server is down or a new server comes up or the object set changes. This is a very common problem in practice in the Internet load balancing and web caching as described in the 'Akamai' paper  <xref target='CHASH'/>, database <xref target='DYNAMODB'/> and networking context.
       <figure anchor="figure_equal">
           <preamble></preamble>
           <artwork>
                            
               +----+     +----+      +----+       +-----+
               |    |     |    |      |    |       |     | 
               | S0 |     | S1 |      | S2 |       |  Sn |
               |    |     |    |      |    |       |     |
               +----+     +----+      +----+       +-----+
                  |          |          |             |
                  |          |          |             |
                  |          |          |             |
                  |          |          |             |
                  |          |          |             |
                  +----------+----------+-------------|   
                                           
                        O0, O1, O2 ... ON   
          Set of Objects need to be assigned to the set of servers.
          All the servers are of same capacities           
                                          

             Figure 1 The object to server assignment problem


           </artwork>
           <postamble></postamble>
       </figure>
</t>
<t>In the Fig 1, we show a set of servers, S0,..,Sn and object pool O0,..,On and the requirement is to assign Oi to Sj such that the servers are uniformly loaded.
In addition, when any server goes down or a new one is introduced, there should be minimal reassignments.
</t>
<t>
There are two standard techniques to address this problem.
<list style="numbers"><t>Consistent Hashing</t><t>Rendezvous Hashing</t></list>
</t>

</section> <!-- EO Intro -->
<section anchor="HRW-A" title="HRW Introduction">
 <t>
Highest Random Weight (HRW) as defined in <xref target="HRW1999"/>is originally proposed in the context of Internet Caching and
proxy Server load balancing.  Given an object name and a set of servers, HRW maps a request to a server using the object-id (Oi) and server-id(Sj) rather than the state of the server states. HRW computes a hash, Hash(Oi, Sj) from the server-id and the object-id; this hash value can be considered as a score, and forms an ordered list of the servers based on the hash value (i.e. score) in decreasing order. The server for which the score is the highest, serves as the primary responsible for that particular object, and the server with the next highest score serves as the backup server. HRW always maps a given object object name to the same server within a given cluster; consequently it can be used at client sites to achieve global consensus on object-server mappings. When that server goes down, the backup server becomes the responsible designate. 
</t>

<t>
Choosing an appropriate hash function that is statistically oblivious to the key distribution and imparts a good uniform distribution of the hash output is an important aspect of the algorithm. The original HRW <xref target="HRW1999"/> provides pseudorandom functions based on Unix utilities rand and srand and easily constructed XOR functions that perform considerably well. Any good uniform hash function like the Jenkins hash for instance will also work. HRW already finds use in multicast and ECMP <xref target='RFC2991'/>,<xref target='RFC2992'/>.
</t>
      
    </section> <!-- EO HRW-A -->
<section anchor="HRW-U" title="HRW with weights">
<t>The issue when the servers are not of the same capacity is also quite a common problem. However this problem has not gained as much attention as it should. In such a case, an obvious approach is to take the normalized weight factor into account, fi=wi/Sum(wi)and multiply the Hash(Oi, Sj) with that value i.e. the value fi*Hash(Oi, Sj). The Cache Array Routing Protocol <xref target='CARP'/> used this method. However there is a problem with this approach, since any change in weight of any of the servers, will result in a change in the normalized weights for everyone. This will necessitate re-computing all the weighted hash values all over again. Therefore this approach does not have the minimal disruption property of the HRW.
We address this issue of the weighted HRW with minimal disruption in this draft.
</t>
<t> Instead of re-normalizing the weights, or, in other words relatively scaling them, the approach taken by <xref target='WHRW'/> is to adjust the score before weighing them. When a server is added, removed or modified (its weight changes), only the score for that server changes. That server may win or lose some objects. Other servers remain affected. There is no needless transfer of objects between servers whose weight did not change. <xref target='WHRW'/> uses a clever way to accomplish this by defining the score function as:
<list style="numbers"><t>Score(Oi, Sj) = -wi/log(Hash(Oi, Sj)/Hmax); where Hmax is the maximum hash value.</t>
</list>
The author provides a mathematical proof as to why this choice of the Score function works with very mild assumptions on the probability distribution of the hash function.
</t>
       <figure anchor="figure_unequal">
           <preamble></preamble>
           <artwork>
                            
               +----+     +----+      +----+       +----+
               |    |     |    |      |    |       |    | 
               | S0 |     | S1 |      | S2 |-------| Sn |
               | w0 |     | w1 |      | w2 |       | wn |
               +----+     +----+      +----+       +----+
                  |          |          |             |
                  |          |          |             |
                  |          |          |             |
                  |          |          |             |
                  |          |          |             |
                  +----------+----------+-------------|   
                                           
                        O1, O2 ... ON   
          Set of Objects need to be assigned to the set of servers.
          Each server is now associated with a weight           
                                          

             Figure 1 The object to server assignment problem


           </artwork>
           <postamble></postamble>
       </figure>
</section> <!-- EO HRW-U -->
     <section anchor="HRW-CH"
             title="HRW and Consistent Hashing">
      <t>
	HRW is not the only algorithm that
	addresses the object to server mapping problem with goals of
	fair load distribution, redundancy and fast access. There is
	another family of algorithms that also addresses this problem;
	these fall under the umbrella of the Consistent Hashing
	Algorithms <xref target='CHASH'/>. These will not be considered here.  </t>
    </section> <!-- EO HRW-CH -->
<section anchor="HRW-EVPN"
             title="Weighted HRW and its application to the EVPN DF Election">
 <t>The notion and need for the Designated Forwarder is described in <xref target='RFC7432'/>.
Consider a CE that is a host or a router that is multi-homed directly to more than one PE in an EVPN instance on a given Ethernet segment.
One or more Ethernet Tags may be configured on the Ethernet segment. In this scenario only one of the PEs, referred to as the Designated Forwarder (DF), is responsible for certain actions:
<list style="letters">
   <t>Sending multicast and broadcast traffic, on a given Ethernet Tag on a particular Ethernet segment, to the CE.</t>
   <t>Flooding unknown unicast traffic (i.e. traffic for which an PE does not know the destination MAC address), on a given Ethernet Tag on a particular Ethernet     segment to the CE, if the environment requires flooding of unknown unicast traffic.</t>
</list></t>
       <figure anchor="figure_example">
           <preamble></preamble>
           <artwork>
                             +---------------+
                             |   IP/MPLS     |
                             |   CORE        |
               +----+ ES1 +----+           +----+
               | CE1|-----|    |-----------|    |____ES2
               +----+     | PE1|           | PE2|    \
                          |    |--------   +----+     \+----+
                          +----+        |    |         | CE2|
                             |          |  +----+     /+----+
                             |          |__|    |____/   |
                             |             | PE3|    ES2 /
                             |             +----+       /
                             |               |         /
                             +-------------+----+     /
                                           | PE4|____/ES2
                                           |    |
                                           +----+


                 Figure 3 Multi-homing Network of EVPN


           </artwork>
           <postamble></postamble>
       </figure>

       <t>  <xref target='figure_example'/> illustrates a case where there are two Ethernet Segments, ES1 and ES2.
       PE1 is attached to CE1 via Ethernet Segment ES1 whereas PE2, PE3 and PE4 are attached to CE2 via ES2 i.e. PE2,      
       PE3 and PE4 form a redundancy group. Since CE2 is multi-homed to different PEs on the same Ethernet Segment, it 
       is  necessary for PE2, PE3 and PE4 to agree on a DF to satisfy the above mentioned requirements. 
       </t>
<t> The use of HRW in the EVPN DF Election is described in <xref target="I-D.ietf-bess-evpn-df-election-framework"/>. In that draft it is explained how the HRW DF Election performs better than the modulo DF Election algorithm in <xref target='RFC7432'/>. However, it is implicitly assumed there that all the PEs are of the same capacity (weights equal).
</t>
<t>DMZ  link bandwidth for load balancing flows across multiple EBGP egress points is described in <xref target="I-D.ietf-idr-link-bandwidth"/>. It has been extended to the case of cumulative DMZ load balancing <xref target="I-D.mohanty-bess-ebgp-dmz"/> in the case of an all EBGP network in the data center. <xref target="I-D.ietf-bess-evpn-unequal-lb"/> describes the use of the DMZ in the EVPN DF Election. The argument is made that ideally one should be able to change the link bandwidth in one or more of the multi-homed PEs rather than have to change in all of the multi-homed PEs simultaneously. The draft describes the bandwidth increments to be taken into consideration and proposes an iterative way to assign the score function. The description in Section 4.3.2 of <xref target="I-D.ietf-bess-evpn-unequal-lb"/> is an non-optimal solution and somewhat empirical. It does not obey the minimal disruption property of the HRW. 
</t>
<t>In contrast to the procedures for weighted HRW in 4.3.2 of <xref target="I-D.ietf-bess-evpn-unequal-lb"/>, we can achieve an optimal solution for weighted HRW in <xref target="I-D.ietf-bess-evpn-unequal-lb"/> using the score function as described in <xref target="HRW-U"/> above and obviating the need to take bandwidth increments. It is an order of magnitude faster and efficient and minimally disruptive.
</t>
</section> <!-- EO HRW-EVPN -->
<section anchor="HRW-RESILIENT"
             title="Weighted HRW and its application to Resilient Hashing">
<t>
With the exponential increase in the number of physical links used in data centers, there is also the potential for an increase in the number of failed physical links. In systems that employ static hashing for load balancing flows across members of port channels or Equal Cost Multipath (ECMP) groups, each flow is hashed to a link. When a link fails, all flows including those that were previously mapped to the non-failed links are rehashed across the remaining working links. This causes packet reordering of flows that were in fact not mapped to the link that failed. A similar rehashing with packet re-ordering also happens when a link is added to the port channel or Equal Cost Multipath (ECMP) group. With the ever increasing number of physical links used in the data centers there the possibility for increasing number of failed links only increases. Hence the resilient hashing is very important.
</t>

<t>However when the links are not of the same speed, Resilient hashing for ECMP does not apply per-se. However, one can use the method explained in <xref target="HRW-U"/> to achieve resilient hashing even in the Unequal Cost Multipath (UCMP)case or when member links are of different bandwidths.</t> 

</section> <!-- EO HRW-RESILIENT -->
<section anchor="HRW-MULTICAST"
             title="Weighted HRW and its application to Multicast DR Election">
<t> <xref target="I-D.mankamana-pim-bdr"/>propose a mechanism to elect backup DR on a shared LAN. A backup DR on LAN would be useful for faster convergence. When the access bandwidth is different for the PIM routers and we want to do a load balancing among the PIM routers for DR/backup DR functionality with regards to the various  (S,G) flow, technique similar to <xref target="HRW-U"/> can be applied. The details of the problem is out of the scope of the current draft and is being worked on separately at this time.
</t>
</section> <!-- EO HRW-MULTICAST -->
   <section anchor="proto"
               title="Protocol Considerations">
   <t>  A request needs to registered with IANA registry for the weighted HRW EVPN DF Election Algorithm in the DF Alg field in the DF Election Extended Community in draft <xref target="I-D.ietf-bess-evpn-df-election-framework"/>.
   </t>
   </section>


      <section anchor="Oper"
               title="Operational Considerations">
      <t>
        TBD.
      </t>
      </section> <!-- EO Oper -->



    <section anchor="Security"
             title="Security Considerations">
      <t>
	This document raises no new security issues for EVPN.
      </t>

    </section> <!-- EO Security -->



    <section anchor="Acknowledgements"
             title="Acknowledgements">
      <t>
	The authors would like to thank Shyam Sethuram and Peter Psenak for useful discussions related to this draft.
      </t>

    </section> <!-- Ack -->


  </middle>
  <back>
    <references title="Normative References">

      <reference anchor="HRW1999">
	<front>
          <title>
	   Using Name-Based Mappings to Increase Hit Rates
	  </title> 
          <author initials='D.' surname='Thaler' fullname='David THaler'>
	    <organization >Univ. of Michigan, Ann arbor</organization>
	  </author>
          <author initials='C.' surname='Ravishankar' fullname='Chinya Ravishankar'>
	    <organization >Univ. of Michigan, Ann arbor</organization>
	  </author>
          <date year='1998' month='February' /> 
          <area>General</area> 
          <keyword>keyword</keyword> 
	</front>
        <seriesInfo name="IEEE/ACM Transactions in networking" value="Volume 6 Issue 1"/>
      </reference>
     <reference anchor="WHRW">
	<front>
          <title>
	   New hashing Algorithms for Data Storage
	  </title> 
          <author initials='J.' surname='Resch' fullname='Jason Resch'>
	    <organization >Cleversafe</organization>
	  </author>
          <date year='2015' month='November' /> 
          <area>General</area> 
          <keyword>keyword</keyword> 
	</front>
        <seriesInfo name="Storage Developer Conference" value="18"/>
      </reference>
      <?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml"?>
      <?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.4271.xml"?>
      <?rfc include="http://xml.resource.org/public/rfc/bibxml3/reference.I-D.draft-ietf-idr-extcomm-iana-02.xml"?>
      <?rfc include="http://xml.resource.org/public/rfc/bibxml3/reference.I-D.draft-ietf-bess-evpn-df-election-framework-09.xml"?>
      <?rfc include="http://xml.resource.org/public/rfc/bibxml3/reference.I-D.draft-ietf-idr-link-bandwidth-07.xml"?>
      <?rfc include="http://xml.resource.org/public/rfc/bibxml3/reference.I-D.draft-ietf-bess-evpn-unequal-lb-00.xml"?> 
      <?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.7432.xml"?>  
    </references>

    <references title="Informative References">
      <reference anchor="CLRS2009">
	<front>
          <title>
	   Introduction to Algorithms (3rd ed.)
	  </title> 
          <author initials='T.' surname='Cormen' fullname='Thomas Cormen'>
	  </author>
          <author initials='C.' surname='Leiserson' fullname='Charles Leiserson'>
	  </author>
          <author initials='R.' surname='Rivest' fullname='Ronald Rivest'>
	  </author>          
          <author initials='C.' surname='Stein' fullname='Clifford stein'>
	  </author>
          <date year='2009' month='February' /> 
          <area>General</area> 
          <keyword>keyword</keyword> 
	</front>
        <seriesInfo name="MIT Press and McGraw-Hill" value="ISBN 0-262-03384-4."/>
      </reference>
     <reference anchor="CHASH">
	<front>
          <title>
	   Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web
	  </title> 
          <author initials='D.' surname='Karger' fullname='David Thaler'>
	    <organization >Massachusetts Institute of Technology, Cambridge</organization>
	  </author>
          <author initials='E.' surname='Lehman' fullname='Eric lehman'>
	    <organization >Massachusetts Institute of Technology, Cambridge</organization>
	  </author>
          <author initials='T.' surname='Leighton' fullname='Tom Leighton'>
	    <organization >Massachusetts Institute of Technology, Cambridge</organization>
	  </author>
          <author initials='R.' surname='Panigrahy' fullname='Rina Panigrahy'>
	    <organization >Massachusetts Institute of Technology, Cambridge</organization>
	  </author>
          <author initials='M.' surname='Levine' fullname='Matthew Levine'>
	    <organization >Massachusetts Institute of Technology, Cambridge</organization>
	  </author>
          <author initials='D.' surname='Lewin' fullname='Daniel lewin'>
	    <organization >Massachusetts Institute of Technology, Cambridge</organization>
	  </author>

          <date year='1997' month='May' /> 
          <area>General</area> 
          <keyword>keyword</keyword> 
	</front>
        <seriesInfo name="ACM Symposium on Theory of Computing" value="ACM Press New York"/>
      </reference>
     <reference anchor="CARP">
	<front>
          <title>
	   Cache Array Routing Protocol v1.1
	  </title> 
          <author initials='V.' surname='Valloppillil' fullname='Vinod Vallopillil'>
	    <organization >Microsoft Corporation</organization>
	  </author>
          <author initials='K.' surname='Ross' fullname='Keith Ross'>
	    <organization >Univ. of Pennsylvania</organization>
	  </author>
          <date year='1998' month='February' /> 
          <area>General</area> 
          <keyword>keyword</keyword> 
	</front>
        <seriesInfo name="IEEE/ACM Transactions in networking" value="Volume 6 Issue 1"/>
      </reference>

     <reference anchor="DYNAMODB">
	<front>
          <title>
	   Dynamo: Amazon’s Highly Available Key-value Store
	  </title> 
          <author initials='G.' surname='Decennia' fullname='Giuseppe DeCandia'>
          <organization >Amazon</organization>
          </author>
          <author initials='D.' surname='Hastorun' fullname='Deniz Hastorun'>
          <organization >Amazon</organization>
          </author>
          <author initials='M.' surname='Jampani' fullname='Madan Jampani'>
          <organization >Amazon</organization>
          </author>
          <author initials='G.' surname='Kakulapati' fullname=' Gunavardhan Kakulapati'>
          <organization >Amazon</organization>
          </author>
          <author initials='A.' surname='Lakshman' fullname='Avinash Lakshman'>
          <organization >Amazon</organization>
          </author>
          <author initials='A.' surname='Pilchin' fullname='Alex Pilchin'>
          <organization >Amazon</organization>
          </author>
          <author initials='S.' surname='Sivasubramanian' fullname='Swaminathan Sivasubramanian'>
          <organization >Amazon</organization>
          </author>
          <author initials='P.' surname='Vosshall' fullname='Peter Vosshall'>
          <organization >Amazon</organization>
          </author>
          <author initials='W.' surname='Vogels' fullname='Werner Vogels'>
	    <organization >Amazon</organization>
	  </author>
          <date year='2007' month='October' /> 
          <area>General</area> 
          <keyword>keyword</keyword> 
	</front>
        <seriesInfo name="SOSP" value="07"/>
      </reference>
      <?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.2991.xml"?>
      <?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.2992.xml"?>
      <?rfc include="http://xml.resource.org/public/rfc/bibxml3/reference.I-D.draft-mohanty-bess-ebgp-dmz-00.xml"?>
      <?rfc include="http://xml.resource.org/public/rfc/bibxml3/reference.I-D.draft-mankamana-pim-bdr-00.xml"?>
    </references>

  </back>
</rfc>
