<?xml version="1.0" encoding="US-ASCII"?>
<!-- Convert to HTML and Text with xml2rfc: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!--   <!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml"> -->
  <!ENTITY RFC6824 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.6824.xml">
  <!ENTITY RFC0793 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.0793.xml">
  <!ENTITY RFC6181 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.6181.xml">
  <!ENTITY RFC5681 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5681.xml">
  <!ENTITY RFC3465 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3465.xml">
  <!ENTITY RFC0791 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.0791.xml">
  <!ENTITY RFC7323 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7323.xml">
  <!ENTITY I-D.ietf-mptcp-multiaddressed SYSTEM "http://xml.resource.org/public/rfc/bibxml3/reference.I-D.draft-ietf-mptcp-multiaddressed-04">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<?rfc strict="yes" ?>
<?rfc toc="yes"?>
<?rfc tocdepth="4"?>
<?rfc symrefs="yes"?>

<?rfc sortrefs="yes" ?>
<?rfc compact="yes" ?>
<?rfc subcompact="no" ?>

<rfc category="exp" docName="draft-olteanu-mptcp-loadbalance-00" ipr="trust200902">
  <front>  
    <title abbrev="Layer 4 Loadbalancing for MPTCP">Layer 4 Loadbalancing for MPTCP</title>
    
    <author fullname="Vladimir Olteanu" initials="V." surname="Olteanu">
      <organization>University Politehnica of Bucharest</organization>
      <address>
        <postal>
          <street>Splaiul Independentei 313</street>
          <city>Bucharest</city>
          <code></code>
          <country>Romania</country>
        </postal>
        <email>vladimir.olteanu@cs.pub.ro</email>
      </address>
    </author>
        
    <author fullname="Costin Raiciu" initials="C." surname="Raiciu">
      <organization>University Politehnica of Bucharest</organization>
      <address>
        <postal>
          <street>Splaiul Independentei 313</street>
          <city>Bucharest</city>
          <code></code>
          <country>Romania</country>
        </postal>
        <email>costin.raiciu@cs.pub.ro</email>
      </address>
    </author>

    <date day="8" month="July" year="2016" />

    <area>General</area>
    
    <workgroup>Internet Engineering Task Force</workgroup>
    
    <keyword>tcp mptcp loadbalancing</keyword>
    
    <abstract>
<!-- 	     	123456789012345678901234567890123456789012345678901234567890123456789-->
	<t>
		Layer 4 loadbalancers are widely used in the deployment of large-scale
		web services. A large number of servers accept incoming connections from
		clients, while multiple loadbalancers make sure that traffic is spread evenly
		across the servers.
	</t>
	<t>
		Due to its use of multiple subflows, Multipath TCP poses several issues to the design
		of a scalable layer 4 loadbalancer that supports it.
		This document presents two ways in which MPTCP connections can be loadbalanced across
		a large pool of servers. Both approaches entail using a slightly modified server stack
		and work well with unmodified MPTCP clients.
	</t>
    </abstract>
    
  </front>

  <middle>
  
<!--  <section title="Requirements Language">
  <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
  "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
  document are to be interpreted as described in <xref target="RFC2119">RFC 2119</xref>.</t>
  </section>-->
  
  <section title="Introduction" anchor="sec_intro">
<!-- 	     	123456789012345678901234567890123456789012345678901234567890123456789-->
	<t>
		Layer 4 loadbalancing is widely used in datacenters. In order to ensure the smooth
		operation of a high-capacity web service, incoming requests from clients
		must be evenly spread across a large number of servers that service said requests.
	</t>
	
	<t>
		Datacenter operators place a number of loadbalancers between the border routers and
		the servers. The loadbalancers are tasked with ensuring that all packets belonging
		to a connection reach the same server. This is typically achieved via a hashing-based
		scheme: a hash function is applied to each packet's 5-tuple and a destination server is
		chosen based on the result.
	</t>
	
	<t>
		Multipath TCP <xref target="NSDI-12" /> is an extension to TCP designed to take advantage
		of the multiple paths
		that may exist between two hosts by using an arbitrary number of subflows.
		Current loadbalancer designs do not work with Multipath TCP, and this has hampered
		the adoption of the protocol, especially on the server side.
	</t>
	
	<t>
		The main reason why loadbalancing MPTCP connections is not trivial is because a connection's
		individual subflows look like independent TCP connections. There is no discernible relationship
		between their 5-tuples, and loadbalancing them based on a hash of the 5-tuple will most likely
		result in them reaching different servers.
		Furthermore, connection-identifying information can only be extracted from the initial 3-way handshake
		of each flow.
	</t>
	  
	<t>
		This document proposes two mutually-exclusive solutions to these problems.
		They rely to varying degrees on getting the client to embed connection or server-identifying
		information in the packets that it sends out. This extra information can be used statelessly by the loadbalancers.
		
		Both solutions require modifications only to the server stack and work well with
		existing MPTCP clients.
	</t>
      
  </section>
  
  <section title="Proposal 1" anchor="sec_port">
<!-- 	     	123456789012345678901234567890123456789012345678901234567890123456789-->
	<t>
		Our first proposal revolves around controlling the destination port
		that the client uses in all subflows aside from the initial one.
		It is possible for the server to advertise an additional port via the
		ADD_ADDR option <xref target="RFC6824" />. This informs the client
		that it can send an MP_JOIN to this new port and initiate a new subflow.
	</t>
	  
	<t>
		To take advantage of this,
		each server is be assigned a unique 16-bit ID, which must
		be different from the port on which the service is being hosted (e.g. 80).
		As soon as a connection is initiated, the server sends an
		ADD_ADDR to the client advertising a new port equal to said ID.
		<!--Packets belonging to the subflows that use the new port can be
		treated statelessly by the loadbalancer.-->
	</t>
	
	<t>
		Packets that arrive at the loadbalancer are treated as follows:
		<list style="symbols">
			<t>
				Packets destined to the port that the service is being hosted on will be forwarded
				to a server based on a hash of the 5-tuple.
			</t>
			<t>
				Packets destined to any other port are forwarded to the server whose ID matches
				the destination port.
			</t>
		  </list>
	</t>
	
	<t>
		  This approach has two drawbacks:
		  <list style="symbols">
			<t>
				The client will most likely also try to initiate subflows using the server's original port.
				Because these subflows are loadbalanced based on a hash of their 5-tuple, they will almost certainly reach a different server and break.
				(Using REMOVE_ADDR to prevent the creation of these subflows would entail
				the destruction of the original subflow.)
			</t>
			<t>
				If the client is behind a firewall that restricts access to certain destination ports,
				it might not succeed in establishing any new subflows.
			</t>
		  </list>
		  
	</t>
      
  </section>
  
  <section title="Proposal 2" anchor="sec_ts">
<!-- 	     	123456789012345678901234567890123456789012345678901234567890123456789-->
	<t>
		Our second proposal is to loadbalance packets based on the server's token.
		
		The token's most significant 14 bits are treated as a hash value for the connection.
		They are embedded in all outgoing TCP timestamps, and subsequently echoed back by the client.
		Incoming packets that do not contain timestamps (such as FINs) are dealt with
		via redirection between the servers.
	</t>
	
	<section title="Connection Initiation" anchor="sec_mpcapable">
<!-- 		     	123456789012345678901234567890123456789012345678901234567890123456789-->
		<t>
			The client initiates an MPTCP connection by sending a SYN with the MP_CAPABLE option.
			Under normal operation, the server then picks a random 64-bit key for the connection,
			and uses it to compute its token.
		</t>
		<t>
			To forward the packet appropriately, the loadbalancer must know the token before
			deciding what server to send it to. To accomplish this, we move the key generation
			to the loadbalancer. The connection's token can be computed based on the generated key.
		</t>
		<t>
			The loadbalancer places the generated key, along with the IP of the server that would
			be responsible for the subflow under normal 5-tuple hashing (which we call the alternate
			server IP) in an IP option and forwards the SYN to the server.
		</t>
		<figure>
			<artwork align="left"><![CDATA[           

                            1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +---------------+---------------+---------------+---------------+
      |   Type = 96   |  Length = 16  |             Unused            |
      +---------------+---------------+---------------+---------------+
      |                                                               |
      +                          Server Key                           +
      |                                                               |
      +---------------+---------------+---------------+---------------+
      |                      Alternate Server IP                      |
      +---------------+---------------+---------------+---------------+
         
             Figure 1: IP Option Used for MP_CAPABLE packets

			]]></artwork>
<!-- 		<postamble>IP Option Used by the Loadbalancer</postamble> -->
		</figure>
		
		
		<t>
			The figure above depicts the IP option that is inserted into the MP_CAPABLE packet
			before it is sent to the server. We have chosen an IP option despite the fact
			that the data contained therein pertains to the transport layer, because TCP
			option space is very limited. IP option type 96 is currently classified as reserved
			<xref target="RFC0791" />.
		</t>
		
		<t>
			Upon receipt of the packet, the server uses the key provided to compute the token
			for the connection. If no connection with the same token exists, the server uses
			the key provided. Otherwise, it takes a brute-force approach and randomly generates
			multiple keys and selects one that yields a token with the same 14 highest-order bits.
		</t>
		<t>
			The use of the alternate server IP will be discussed in a later section.
		</t>
	
	</section>
	
	<section title="Handling MP_JOIN packets" anchor="sec_mpjoin">
		<t>
			Additional subflows are initiated by the client by sending MP_JOIN packets.
			These packets contain the server's token.
		</t>
		
		<t>
			Similarly to how MP_CAPABLE packets are treated, the loadbalancer uses an IP option
			to inform the server about which other server would be responsible for the subflow
			under normal 5-tuple hashing.
		</t>
		
		
		<figure>
			<artwork align="left"><![CDATA[           

                            1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +---------------+---------------+---------------+---------------+
      |   Type = 97   |   Length = 8  |             Unused            |
      +---------------+---------------+---------------+---------------+
      |                      Alternate Server IP                      |
      +---------------+---------------+---------------+---------------+
         
               Figure 2: IP Option Used for MP_JOIN packets

			]]></artwork>
		</figure>
		
		<t>
			IP option type 97 is also classified as reserved <xref target="RFC0791" />.
		</t>
		
	</section>
	
	<section title="Embedding the token in the timestamp" anchor="sec_tsecr">
		<t>
			The TCP timestamp option <xref target="RFC7323" /> is present in most packets
			and is comprised of two fields: the TSval, which is set by the packet's sender,
			and TSecr, which contains a timestamp recently received from the other end.
		</t>
		<t>
			Taking advantage of the fact that timestamps set by the server are echoed back
			by the client, the server shifts its timestamp clock left by 14 bits, and
			embeds the 14 highest-order bits of the token into the
			14 lowest-order bits of the TSval.
			When a packet with the ACK flag set and with the TS option present arrives at the
			loadbalancer, it is forwarded based on the 14 least significant bits of the TSecr field.
		</t>
		
		<section title="Impact on PAWS" anchor="sec_paws">
		<t>
			Timestamps supplied by the server are used by the client for
			protection against wrapped sequence numbers (PAWS).
		</t>
		<t>
			We assume that the server uses a timestamp clock frequency of 1 tick per ms,
			which is the highest frequency recommended by <xref target="RFC7323" />.
			The recycling time of the timestamp clock's sign bit is required
			to be greater than the Maximum Segment Lifetime of 255 seconds.
			Given that the clock ticks once every ms in increments of 2 ^ 14, its recycling
			time is roughly 262 s, which is within the bounds set by the standard.
		</t>
		<t>
			While the quickly-increasing timestamp is benign to active subflows,
			PAWS will still cause segments to be dropped if the subflow in question had been idle for
			a period longer than the clock's recycling time.
			To solve this, the server periodically sends keepalive messages during idle periods.
		</t>
		</section>
		
<!--		<section title="Impact on RTT Measurements" anchor="sec_rtt">
		<t>
			Timestamps echoed by the client are used by the server to measure the RTT. 
		</t>
		</section>-->
	</section>
	
	<section title="Redirecting packets without timestamps" anchor="sec_redir">
		<t>
			Some packets (most notably FINs) do not contain timestamps or any other
			connection-identifying information.
			As such, they are forwarded to a server based on a hash of the 5-tuple.
		</t>
		<t>
			As seen in <xref target="sec_mpcapable" /> and <xref target="sec_mpjoin" />,
			whenever a new subflow is setup, the server responsible for it (A) also knows
			which other server (B) would be hit by the packets in case 5-tuple hashing is used.
		</t>
		<t>
			A will use a simple peer-to-peer protocol to inform B to setup a redirection rule for
			the 5-tuple in question. The redirection rule will be deleted by B either at A's request,
			after the subflow has finished, or after a timeout. We do not discuss the specifics of the
			protocol in this document.
		</t>
		<t>
			Redirection of a packet is performed using IP-in-IP encapsulation.
		</t>
	</section>
  </section>
  
  <section title="Conclusions" anchor="sec_conclusions">
<!-- 	     	123456789012345678901234567890123456789012345678901234567890123456789-->
	<t>
		The ability to perform layer 4 loadbalancing in a scalable manner is crucial
		for the adoption of Multipath TCP. This document explored two ways in which
		this can be accomplished.
		
		We put forth that loadbalancing is feasible with the current version of MPTCP
		and that significant changes to the protocol for loadbalancing support are unnecessary.
	</t>
  </section>

  </middle>
  
  <back>
    <references title="Normative References">
<!--       &RFC2119; -->
	    &RFC6824;
	    &RFC0791;
	    &RFC7323;
    </references>
    
    <references title="Informative References">
<!--       <reference anchor="RFC5944"><front><title>IP Mobility Support for IPv4, Revised</title><author initials="C." surname="Perkins" fullname="C. Perkins"><organization/></author><date year="2010" month="November"/><abstract><t>This document specifies protocol enhancements that allow transparent routing of IP datagrams to mobile nodes in the Internet.  Each mobile node is always identified by its home address, regardless of its current point of attachment to the Internet.  While situated away from its home, a mobile node is also associated with a care-of address, which provides information about its current point of attachment to the Internet.  The protocol provides for registering the care-of address with a home agent.  The home agent sends datagrams destined for the mobile node through a tunnel to the care-of address.  After arriving at the end of the tunnel, each datagram is then delivered to the mobile node. [STANDARDS-TRACK]</t></abstract></front><seriesInfo name="RFC" value="5944"/><format type="TXT" octets="239935" target="http://www.rfc-editor.org/rfc/rfc5944.txt"/></reference> -->
      <reference anchor="NSDI-12" target="http://dl.acm.org/citation.cfm?id=2228298.2228338"><front><title>How hard can it be? designing and implementing a deployable multipath tcp</title><author fullname="Costin Raiciu" initials="C." surname="Raiciu" /><author fullname="Christoph Paasch" initials="C." surname="Paasch" /><author fullname="Sebastien Barre" initials="S." surname="Barre" /><author fullname="Alan Ford" initials="A." surname="Ford" /><author fullname="Michio Honda" initials="M." surname="Honda" /><author fullname="Fabien Duchene" initials="F." surname="Duchene" /><author fullname="Olivier Bonaventure" initials="O." surname="Bonaventure" /><author fullname="Mark Handley" initials="M." surname="Handley" /><date year="2012" /></front></reference>
<!--       <reference anchor="IMC-11" target="http://doi.acm.org/10.1145/2068816.2068834"><front><title>Is it still possible to extend tcp?</title><author fullname="Michio Honda" initials="M." surname="Honda" /><author fullname="Yoshifumi Nishida" initials="Y." surname="Nishida" /><author fullname="Costin Raiciu" initials="C." surname="Raiciu" /><author fullname="Adam Greenhalgh" initials="A." surname="Greenhalgh" /><author fullname="Mark Handley" initials="M." surname="Handley" /><author fullname="Hideyuki Tokuda" initials="H." surname="Tokuda" /><date year="2011" /><keyword>TCP</keyword><keyword> measurements</keyword><keyword> middleboxes</keyword><keyword> protocol design</keyword></front></reference> -->
    </references>
  </back>
</rfc>
