<?xml version="1.0" encoding="UTF-8"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
     which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
     There has to be one entity for each item to be referenced. 
     An alternate method (rfc include) is described in the references. -->

<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC5666 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5666.xml">
]>
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>
<?rfc strict="yes" ?>
<?rfc toc="yes"?>
<?rfc tocdepth="3"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes" ?>
<?rfc compact="yes" ?>
<?rfc subcompact="no" ?>
<rfc ipr="trust200902" 
     category="info"
     docName="draft-dnoveck-nfsv4-rpcrdma-rtissues-02">
  <front>
    <title abbrev="RPC/RDMA Round-trip Issues">
      Issues Related to RPC-over-RDMA Internode Round Trips
    </title>
    <author initials="D." surname="Noveck" fullname="David Noveck">
      <organization abbrev="HPE">
        Hewlett Packard Enterprise
      </organization>
      <address>
        <postal>
          <street>165 Dascomb Road</street> 
          <city>Andover</city>
          <region>MA</region>
          <code>01810</code>
          <country>USA</country>
        </postal>
        <phone>+1 781-572-8038</phone>
        <email>davenoveck@gmail.com</email>
      </address>
    </author>
    <date year="2017"/>

    <area>Transport</area>
    <workgroup>Network File System Version 4</workgroup>
    <abstract>
      <t>
        As currently designed and implemented, the RPC-over-RDMA 
        protocol requires use of multiple internode round trips to 
        process some common operations.  For example,
        NFS WRITE operations require use of three internode
        round trips.  This document looks at this issue and discusses
        what can and what should be done to address it, both within the
        context of an extensible version of RPC-over-RDMA and potentially 
        outside that framework.
      </t>
    </abstract>
  </front>
  <middle>
    <section title="Preliminaries" anchor="PRELIM">	
      <section title="Requirements Language" anchor="INTRO-req">
        <t>
          The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", 
          "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", 
          "MAY", and "OPTIONAL" in this document are to be interpreted 
          as described in <xref target="RFC2119"/>.  
        </t>
      </section>
      <section title="Introduction" anchor="PRELIM-intro">
        <t>
          When many common operations are performed using RPC-over-RDMA,
          additional  inter-node round-trip latencies are required
          to take advantage of the performance benefits provided by
          RDMA Functionality.  
        </t>	
        <t>
          While the latencies involved are generally small, they are a reason
          for concern for two reasons.
        <list style="symbols">
          <t>
            With the ongoing improvement of persistent memory 
            technologies, such internode latencies, being fixed,
            can be expected to consume an increasing portion of the
            total latency required for processing NFS requests using 
            RPC-over-RDMA.
          </t>
          <t>
            High-performance transfers using NFS may be needed outside
            of a machine-room environment.  As RPC-over-RDMA is used in
            networks of campus and metropolitan scale, the internode
            round-trip time of sixteen microseconds per mile becomes 
            an issue.
          </t>
        </list>
        </t>	
        <t>
          Given this background, round trips beyond the minimum necessary
          need to be justified by corresponding benefits.  If they are
          not, work needs to be done to eliminate those excess round trips.
        </t>	
        <t>
          We are going to look at the existing situation with regard
          to round-trip latency and make some suggestions as to  
          how the issue might be best addressed. We will consider 
          things that could be done in the near 
          future and also explore further possibilities that would
          require a longer-term approach to be adopted.
        </t>	
      </section>
    </section>
    <section title="Review of the Current Situation" anchor="CUR">
      <section title="Troublesome Requests" anchor="CUR-trouble">
        <t>
          We will be looking at four sorts of situations:
        <list style="symbols">
          <t>
            An RPC operation involving Direct Data Placement of request 
            data (e.g., an NFSv3 WRITE or corresponding NFSv4 COMPOUND).  
          </t>
          <t>
            An RPC operation involving Direct Data Placement of response 
            data (e.g., an NFSv3 READ or corresponding NFSv4 COMPOUND).  
          </t>
          <t>
            An RPC operation where the request data is longer than the 
            inline buffer limit.
          </t>
          <t>
            An RPC operation where the response data is longer than the 
            inline buffer limit.
          </t>
        </list>
        </t>
        <t>
          These are all simple examples of situations in which explicit RDMA
          operations are used, either to effect Direct Data Placement or to
          respond to message size limits that derive from a limited 
          receive buffer
          size. 
        </t>
        <t>
          We will survey the resulting latency and overhead issues
          in an RPC-over-RDMA
          Version One environment in Sections 
          <xref target="CUR-wdetails" format="counter"/> and
          <xref target="CUR-rdetails" format="counter"/>
          below.  
        </t>
      </section>
      <section title="WRITE Request Processing Details" 
               anchor="CUR-wdetails">
        <t>
          We'll start with the case of a request involving direct placement 
          of request data.  In this case, an RDMA READ is used to transfer
          a DDR-eligible data item (e.g. the data to be written) from its
          location in requester memory, to a location selected by the 
          responder.
        </t>
        <t>
          Processing proceeds as described below.  Although
          we are focused on internode latency, the time to perform
          a request also includes such things as interrupt latency, overhead
          involved in interacting with the RNIC, and the time for the server
          to execute the requested operation.
        <list style="symbols">
          <t>
            First, the memory to be accessed remotely is registered.  
            This is a local operation. 
          </t>
          <t>
            Once the registration has been done, 
            the initial send of the request
            can proceed.  Since this is in the
            context of connected operation, there is an internode 
            round trip involved.  However, the next step can proceed
            after the initial transmission is received by the responder.  
            As a result, only
            the responder-bound side of the transmission contributes to
            overall operation latency.
          </t>
          <t>
            The responder, after being notified of the receipt of the request,
            uses RDMA READ to fetch the bulk data.  This
            involves an internode round-trip latency.  After the fetch of the
            data, the responder
            needs to be notified of the completion of the explicit RDMA
            operation
          </t>
          <t>
            The responder (after performing the requested operation) 
            sends the response.  Again,  as this is in the
            context of connected operation, there is an internode 
            round trip involved.  However, the next step can proceed
            after the initial transmission is received by the requester.
          </t>
          <t>
            The memory registered before the request was issued needs to be
            deregistered, before the request is considered complete and
            the sending process restarted.  When remote invalidation is
            not available, the requester, after being notified of the 
            receipt of the response, performs a local operation to
            deregister the memory in question.  Alternatively, the 
            responder will use Send With Invalidate and the responder's
            RNIC will effect the deregistration before notifying the
            requester of the response which has been received.
          </t>
        </list>
        </t>
        <t>
          To summarize, if we exclude the actual server execution of the 
          request,  the latency consists of two internode round-trip
          latencies plus two responder-side interrupt latencies 
          plus one requester-side interrupt latency plus any necessary 
          registration/de-registration overhead.  This is in contrast
          to a request not using explicit RDMA operations in which 
          there is a single inter-node round-trip latency and one 
          interrupt latency on the requester and the responder. 
        </t>
        <t>
          The processing of the other sorts of requests mentioned in 
          <xref target="CUR-trouble" /> show both similarities and
          differences: 
        <list style="symbols">
          <t>
            Handling of a long request is similar to the above.  The
            memory associated with a position-zero read chunk is registered,
            transferred using RDMA READ, and deregistered.  As a result, 
            we have
            the same overhead and latency issues noted in the case of
            direct data placement, without the corresponding benefits.
          </t>
          <t>
            The case of direct data placement of response data follows 
            a similar pattern.  The important difference is that 
            the transfer of the
            bulk data is performed using RDMA WRITE, rather than RDMA READ.
            However, because of the way that RDMA WRITE is effected over the
            wire, the latency consequences are different.  
            See <xref target="CUR-rdetails" /> for a detailed discussion.
          </t>
          <t>
            Handling of a long response is similar to the previous case.
          </t>
        </list>
        </t>
      </section>
      <section title="READ Request Processing Details" 
               anchor="CUR-rdetails">
        <t>
          We'll now discuss the case of a request involving direct placement 
          of response data.  In this case, an RDMA WRITE is used to transfer
          a DDR-eligible data item (e.g. the data being read) from its
          location in responder memory, to a location previously selected 
          by the requester.
        </t>
        <t>
          Processing proceeds as described below.  Although
          we are focused on internode latency, the time to perform
          a request also includes such things as interrupt latency, overhead
          involved in interacting with the RNIC, and the time for the server
          to execute the requested operation.
        <list style="symbols">
          <t>
            First, the memory to be accessed remotely is registered.  
            This is a local operation. 
          </t>
          <t>
            Once the registration has been done, 
            the initial send of the request
            can proceed.  Since this is in the
            context of connected operation, there is an internode 
            round trip involved.  However, the next step can proceed
            after the initial transmission is received.  As a result, only
            the responder-bound side of the transmission contributes to
            overall operation latency.
          </t>
          <t>
            The responder, after being notified of the receipt of the request,
            proceeds to process the request until the data to be read 
            is available in its own memory, with its location determined and
            fixed.
            It then uses RDMA WRITE to transfer the bulk data to the
            location in requester memory selected previously.  This
            involves an internode latency, but there is no round trip 
            and thus no round-trip latency,
          </t>
          <t>
            The responder continues processing and sends the inline
            portion of the response.  Again,  as this is in the
            context of connected operation, there is an internode 
            round trip involved.  However, the next step can proceed
            immediately.  If the RDMA WRITE or the send of the inline
            portion of the response were to fail, the responder can
            be notified subsequently. 
          </t>
          <t>
            The requester, after being notified of the receipt of the response,
            can analyze it and can access the data written into its 
            memory. Deregistration of the memory originally registered before 
            the request was issued can be done using remote invalidation
            or can be done by the requester as a local 
            operation
          </t>
        </list>
        </t>
        <t>
          To summarize, in this case the additional latency that we saw in
          <xref target="CUR-wdetails" /> does not arise.  Except
          for the additional overhead due to memory registration and
          invalidation, the situation is the same as for a request
          not using explicit RDMA operations in which 
          there is a single inter-node round-trip latency and one 
          interrupt latency on the requester and the responder. 
        </t>
      </section>
    </section>
 
    <section title="Near-term Work" anchor="NEAR">
      <t>
        We are going to consider how the latency and overhead issues discussed
        in <xref target="CUR"/> might be addressed in the context of
        an extensible version of RPC-over-RDMA, such as that proposed 
        in <xref target="rpcrdmav2"/>.
      </t>
      <t>
        In <xref target="NEAR-target"/>, we will establish a performance
        target for the troublesome requests, based on the performance of
        requests that do not involve long messages or direct data placement. 
      </t>
      <t>
        We will then consider how extensions 
        might be defined to bring latency and overhead for the requests 
        discussed in <xref target="CUR-trouble"/> into line with those
        for other requests.  There will be two specific classes of 
        requests to address:
      <list style="symbols">
        <t>
          Those that do not involve direct data placement will be addressed
          in <xref target="NEAR-cont"/>.  In this case, there are no 
          compensating benefits justifying the higher overhead and, in some
          cases, latency. 
        </t>
        <t>
          The more complicated case of requests that do involve direct data
          placement is discussed in <xref target="NEAR-sbddp"/>.  In this case,
          direct data placement could serve as a compensating benefit, and the 
          important question to be addressed is whether Direct Data Placement
          can be effected without the use of explicit RDMA operations.
        </t>
      </list>
      </t>
      <t>
        The optional features to deal with each of the classes of messages
        discussed above could be implemented separately.  However, in 
        the handling of RPCs with very large amounts of bulk data, the
        two features are synergistic.  This fact makes it desirable to 
        define the
        features as part of the same extension.  See Sections
        <xref target="NEAR-syn" format="counter"/> and
        <xref target="NEAR-sel" format="counter"/>
        for details.
      </t>
      <section title="Target Performance" anchor="NEAR-target"> 
        <t>
          As our target, we will look at the latency and overhead 
          associated with other sorts of RPC requests, i.e. those that 
          do not use DDP, and that have request and response messages 
          which do fit within the receive buffer limit.
        </t>
        <t>
          Processing proceeds as follows: 
        <list style="symbols">
          <t>
            The initial send of the request is done.  Since this is in the
            context of connected operation, there is an internode 
            round-trip involved.  However, the next step can proceed
            after the initial transmission is received.  As a result, only
            the responder-bound side of the transmission contributes to
            overall operation latency.
          </t>
          <t>
            The responder, after being notified of the receipt of the request,
            performs the requested operation and sends the reply.
            As in the case of the request, there is an internode round trip
            involved. However, the request can be considered complete upon
            receipt of the requester-bound transmission.  The 
            responder-bound acknowledgment does not contribute to request
            latency.
          </t>
        </list>
        </t>
        <t>
          In this case, there is only a single internode round-trip latency 
          necessary to effect the RPC.  Total request latency includes 
          this round-trip
          latency plus interrupt latency on the requester and responder, plus
          the time for the responder to actually perform the requested 
          operation.
        </t>
        <t>
          Thus the delta between the operations discussed in 
          <xref target="CUR"/> and our baseline consists of two portions,
          one of which applies to all the requests we are concerned with
          and the second of which only applies to request which involve
          use of RDMA READ, as discussed in <xref target="CUR-wdetails" />.
          The latter category consists of: 
        <list style="symbols">
          <t>
            One additional internode round-trip latency.
          </t>
          <t>
            One additional instance of responder-side interrupt latency.
          </t>
        </list>
        </t> 
        <t>
          The additional overhead necessary to do memory registration and
          deregistration applies to all requests using explicit RDMA 
          operations.  The costs will vary with implementation
          characteristics.  As a result, in some implementations, it may
          desirable to replace use of RDMA Write with send-based alternatives,
          while in others, use of RDMA Write may be preferable.   
        </t>
          
      </section>
      <section title="Message Continuation" anchor="NEAR-cont">
        <t>
          Using multiple RPC-over-RDMA transmissions, in sequence, to
          send a single RPC message avoids the additional latency 
          and overhead associated with the use of explicit RDMA operations 
          to transfer position-zero read chunks.  In the case of reply chunks,
          only overhead is reduced.
        </t>
        <t>
          Although transfer of a single request or reply in N transmissions
          will involve N+1 internode latencies, overall request 
          latency is not increased
          by requiring that operations involving multiple
          nodes be serialized.  Generally, these transmissions are
          pipelined.
        </t>
        <t>
          As an illustration, let's consider the case of a request involving 
          a response consisting of two RPC-over-RDMA transmissions.  Even
          though each of these transmissions is acknowledged, that 
          acknowledgement does not contribute to request latency.  The second
          transmission can be received by the requester and acted upon without
          waiting for either acknowledgment.
        </t>
        <t>
          This situation would require multiple receive-side interrupts but
          it is unlikely to result in extended interrupt latency.  With 1K
          sends (Version One), the second receive will complete about 200
          nanoseconds after the first assuming a 40Gb/s transmission rate.
          Given likely interrupt latencies, the first interrupt routine
          would be able 
          to note that the completion of the second receive had already 
          occurred.
        </t>
      </section>
      <section title="Send-based DDP" anchor="NEAR-sbddp">
        <t>
          In order to effect proper placement of request or reply
          data within the context of individual RPC-over-RDMA transmissions, 
          receive buffers
          need to be structured to accommodate this function
        </t>
        <t>
          To illustrate the considerations that could lead clients and servers 
          to choose particular buffer structures, we will use, as examples,
          the cases of NFS READs and WRITEs of 8K data blocks (or the 
          corresponding NFSv4 COMPOUNDs).
        </t>
        <t>
          In such cases, the client and server need to have the DDP-eligible
          bulk data placed in appropriately aligned 8K buffer segments.  
          Rather than
          being transferred in separate transmissions using explicit RDMA
          operations, a message can be sent so that bulk data is received
          into an appropriate buffer segment.  In this case, it would be
          excised from the XDR payload stream, just as it is in the case of
          existing DDP facilities.
        </t>
        <t> 
          Consider a server expecting write requests which are usually 
          X bytes long or less, exclusive of an 8K bulk data area.   
          In this case
          the payload stream will most likely be less 
          than X bytes and will fit in a 
          buffer segment devoted to that purpose.  The bulk data needs to
          be placed in the subsequent buffer segment in order to align it
          properly, i.e. with the appropriate alignment, in the DDP 
          target buffer.  In
          order to place the data appropriately, the  sender (in this case, 
          the client) needs to 
          add padding of length X-Y bytes where Y is the length of payload
          stream for the current request.  The case of reads is exactly the
          same except that the sender adding the padding is the server.      
        </t>
        <t>                
          To provide send-based DDP as an RPC-over-RDMA extension, the 
          framework defined in <xref target="rpcrdmav2" /> could be used.
          A new "transport characteristic" could be defined which  
          allowed a participant to expose the structure of his receive 
          buffers and to identify the buffer segments capable of being
          used as DDP targets.  In addition, a new optional message header
          would have to be defined.  It would be defined to provide:
        <list style="symbols">
          <t>
            A way to designate a DDP-eligible data item as 
            corresponding to target buffer segments, rather than memory 
            registered for RDMA.
          </t>
          <t>
            A way to indicate to the responder that it should place
            DDP-eligible data items in DDP-targetable buffer segments, rather 
            than in memory registered for RDMA.
          </t>
          <t>
            A way to designate a limited portion of an RPC-over-RDMA 
            transmission
            as constituting the payload stream.
          </t>
        </list>

        </t>
      </section>
      <section title="Feature Synergy" 
               anchor="NEAR-syn">
        <t>
          While message continuation and send-based DDP each address an
          important class of commonly used messages, their combination
          allows simpler handling of some important classes of messages:
        <list style="symbols">
          <t>
            READs and WRITEs transferring larger IOs
          </t>
          <t>
            COMPOUNDs containing multiple IO operations.
          </t>
          <t>
            Operations whose associated payload stream is longer than
            the typical value. 
          </t>
        </list>
        </t>
        <t>
          To accommodate these situations, it would be best to have 
          the definition of the headers to support message continuation 
          interact with data
          structures to support send-based DDP as follows:
        <list style="symbols">
          <t>
            The header type used for the initial transmission of a 
            message continued across multiple transmissions would contain
            DDP-directing structures which support both send-based DDP
            as well as DDP using Explicit RDMA operations.
          </t>
          <t>
            Buffer references for Send-based DDP should be relative to
            the start of the group of transmissions and 
            should allow transitions
            between buffer segments in different receive buffers. 
          </t>
          <t> 
            The header type for messages continuing a group of 
            transmissions should not 
            have DDP-related fields but should rely on the initial
            transmission
            of the group for DDP-related functions.
          </t>
          <t>
            The portion of each received transmission devoted to the 
            payload stream should be part of the header for each message 
            within a group of transmissions devoted to a single
            RPC message.  The payload stream for the message 
            as a whole should be the concatenation of the streams for each
            transmission.
          </t>
        </list>
        </t>
        <t>
          A potential extension supporting these features interacting
          as described above can be found in <xref target="rtrext"/>.
        </t>
      </section>
      <section title="Feature Selection and Negotiation" 
               anchor="NEAR-sel">
        <t> 
          Given that an appropriate extension is likely to
          support multiple OPTIONAL features, special
          attention will have to be given to defining how implementations
          which might not support the same subset of OPTIONAL features can
          successfully interact.  The goal is to allow interacting
          implementations to get the benefit of
          features that they both support, while allowing implementation
          pairs
          that do not share support for any of the OPTIONAL features to 
          operate just as base Version Two implementations could do in the
          absence of the potential extension.   
        </t>
        <t>
          It is helpful if each implementation provides characteristics
          defining its level of feature support which the peer implementation
          can test before attempting to use a particular feature.  In
          other similar contexts, the support level concerns the 
          implementation in its role as responder, i.e. whether it is
          prepared to execute a given request.  In the case of the potential
          extension discussed here, most characteristics concern an 
          implementation in its role as receiver.  One might define
          characteristics which indicate,  
        <list style="symbols">
          <t>
            The ability of the implementation, in its role as receiver,
            to process messages continued across multiple RPC-over-RDMA
            transmissions.
          </t>
          <t>
            The ability of the implementation, in its role as receiver,
            to process messages containing DDP-eligible data items,
            directly placed using a send-based DDP approach. 
          </t>
        </list>
        </t>
        <t>
          Use of such characteristics might allow asymmetric implementations.
          For example, a client might send requests containing DDP-eligible
          data items using send-based DDP without being able to accept
          messages containing data items using send-based DDP.  That is a
          likely implementation pattern, given the greater performance
          benefits of avoiding use of RDMA Read.
        </t>
        <t>
          Further useful characteristics would apply to the implementation
          in its role of responder.  For instance,
        <list style="symbols">
          <t>
            The ability of the implementation, in its role as responder,
            to accept and process requests which REQUIRE that DDP-eligible
            data items in the response be sent using send-based DDP.  The
            presence of this characteristic would allow a requester to avoid
            registering memory to be used to accommodate DDP-eligible data
            items in the response. 
          </t>
          <t>
            The ability of the implementation, in its role as responder,
            to send responses using message continuation, as opposed to 
            using a reply chunk.
          </t>
        </list>
        </t>
        <t>
          Because of the potentially different needs of operations in the
          forward and backward directions, it may be desirable to separate
          the receiver-based characteristics according the direction of
          operation that they apply to.
        </t>
        <t>
          A further issue relates to the role of explicit RDMA operations 
          in connection with backwards operation.  Although, no current
          protocols require support for DDP or transfer of large messages when
          operating in the backward direction, the protocol is designed to
          allow such support to be developed in the future.  Since the
          protocol, with the extension discussed here is 
          likely to have multiple
          methods of providing these functions, we have a number of 
          possible choices regarding the role of chunk-based methods of
          providing these functions
        <list style="symbols">
          <t>
            Support for chunk-based operation remains a REQUIREMENT 
            for responders, and requesters always have the option of
            using it, regardless of the direction of operation. 
          <vspace blankLines='1'/>
            Requesters could select alternatives to the use of explicit
            RDMA operations when these are supported 
            by the responder
          </t>
          <t>
            When operating in the forward direction, support for chunk-based 
            operation remains a REQUIREMENT for responders (i.e. servers), 
            and requesters (i.e. clients).
          <vspace blankLines='1'/>
            When operating in the backward direction, support for chunk-based
            is OPTIONAL for responders (i.e. clients) allowing requesters
            (i.e. servers) to select use of explicit RDMA operations or
            alternatives when each of these is supported by the requester.
          </t>
          <t>
            Support for chunk-based operation is treated as OPTIONAL
            for responders, regardless of the direction of operation. 
          <vspace blankLines='1'/>
            In this case, requesters would select use of explicit RDMA 
            operations or alternatives when each of these is supported 
            by the responder. For a considerable time, support for explicit 
            RDMA operations would be a practical necessity, even if not
            a REQUIREMENT, for 
            operation in the forward direction.
          </t> 

        </list>
        </t>

      </section>


    </section>
    <section title="Possible Future Development" anchor="FUTURE">
      <t>
          Although the reduction of explicit RDMA operation reduces the number
          of inter-node round trips and eliminates sequences of operations 
          in which multiple round-trip latencies are serialized with server
          interrupt latencies, the use of connected operations means that
          round-trip latencies will always be present, since each
          message is acknowledged.
        </t>
        <t>
          One avenue that has been considered is use of
          unreliable-datagram (UD) transmission in environments where the
          "unreliable" transmission is sufficiently reliable that RPC
          replay can deal with a very low rate of message loss.  
          For example, UD in 
          Infiniband specifies a low enough rate of frame loss to make
          this a viable approach, particularly for use in supporting
          protocols such as NFSv4.1, that contain their own facilities to 
          ensure exactly-once semantics.
        </t>
        <t>
          With this sort of arrangement, request latency is still the same.
          However, since the acknowledgements are not serving any 
          substantial function, it is tempting to consider removing them,
          as they do take up some transmission bandwidth, that might be 
          used otherwise, if the protocol were to reach the goal of 
          effectively using the underlying medium.
        </t>
        <t>
          The size of such wasted transmission bandwidth depends on the 
          average message size and many implementation considerations
          regarding how acknowledgments are done.  In any case, given
          expected message sizes, the wasted transmission bandwidth will
          be very small.
        </t>
        <t>
          When RPC messages are quite small, acknowledgments may be of
          concern.  However, in that situation, a better response would
          be transfer multiple RPC messages within a single RPC-over-RDMA
          transmission. 
        </t>
        <t>
          When multiple RPC messages are combined into a single transmission,
          the overhead of interfacing with the RNIC, particularly the 
          interrupt handling overhead, is amortized over multiple RPC
          messages. 
        </t>
        <t>
          Although this technique is quite outside the spirit of existing
          RPC-over-RDMA implementations, it appears possible to define new
          header types capable of supporting this sort of transmission, 
          using the extension framework described in 
          <xref target="rpcrdmav2" />.
        </t>
      </section>


    <section title="Summary" anchor="CON">
      <t>
        We've examined the issue of round-trip latency and concluded:
      <list style="symbols">
        <t>
          That the number of round trips per se is not as important as the
          contribution of any extra round trips to overall request latency.
        </t>
        <t>
          That the latency issue can be addressed using the extension
          mechanism provided for in <xref target="rpcrdmav2" />.
        </t>
        <t>
          That in many cases in which latency is not an issue, there may be 
          overhead issues that can be addressed using the same sorts
          of techniques as those useful in latency reduction, again using
          the extension
          mechanism provided for in <xref target="rpcrdmav2" />.
        </t>
      </list>
      </t>
      <t>
        As it seems that the features sketched out could put internode 
        latencies and overhead for a large class of requests 
        back to the baseline value 
        for the RPC paradigm, more detailed definition of the required
        extension functionality is in order.
      </t>
      <t>
        We've also looked at round trips at the physical level, in that
        acknowledgments are sent in circumstances where there is no obvious
        need for them.  With regard to these, we have concluded:
      <list style="symbols">
        <t>
          That these acknowledgements do not contribute to request latency.
        </t>
        <t>
          That while UD transmission can remove 
          acknowledgements of limited value, the
          performance benefits are not sufficient to justify the disruption
          that this would entail.
        </t>
        <t>
          That issues with transmission bandwidth overhead in a small-message
          environment are better addressed by combining 
          multiple RPC messages in
          a single RPC-over-RDMA transmission.  
          This is particularly so, because 
          such a step is likely to reduce overhead in such environments
          as well 
        </t>
      </list>
      </t>
      <t>
        As the features described involve the use of alternatives to 
        explicit RDMA
        operations, in performing direct data placement and in transferring
        messages that are larger than the receive buffer limit, it is 
        appropriate to understand the role that such operations
        are expected to have once the extensions discussed in this document are
        fully specified and implemented.
      </t>
      <t>
        It is important to note that these extensions are OPTIONAL and are 
        expected to remain so, while support for explicit RDMA operations
        will remain an integral part of RPC-over-RDMA.
      </t>
      <t>
        Given this framework, the degree to which explicit RDMA operations 
        will be used will reflect future implementation choices and needs.  
        While
        we have been focusing on cases in which other options might be more 
        efficient in some cases, it worth looking also at the cases in 
        which explicit RDMA operations are likely to remain preferable:
      <list style="symbols">
        <t>
          In some environments in which direct data placement to memory of
          a certain alignment does not meet application requirements 
          and in which data needs to
          be read into a particular address on the client.  Also, 
          large physically contiguous
          buffers may be required in some environments. In these situations,
          send-based DDP is not an option. 
        </t>
        <t>
          Where large transfers are to be done, there will be limits to the
          capacity of send-based DDP to provide the required functionality,
          since the basic pattern using send/receive is to allocate 
          a pool of memory to contain 
          receive buffers in advance of issuing requests.
          While this issue can be mitigated by use of message continuation,
          tying up large numbers of credits for a single request can cause
          difficult issues as well.  As a result, send-based DDP may 
          be restricted to IO's of limited size, although the specific limits 
          will depend on the details of the specific implementation.

        </t>
      </list>
      </t>
    </section>
    <section title="Security Considerations" anchor="SEC">
      <t>
        This document does not raise any security issues.
      </t>
    </section>
    <section title="IANA Considerations" anchor="IANA">
      <t>
        This document does not require any actions by IANA.
      </t>
    </section>
  </middle>
  <back>
    <references title="Normative References">
      &RFC2119;
      <reference anchor="rfc5666bis" 
                 target="http://www.ietf.org/id/draft-ietf-nfsv4-rfc5666bis-10.txt">
        <front>
          <title>
            Remote Direct Memory Access Transport for Remote Procedure Call
          </title>

          <author initials="C." surname="Lever" role="editor">
            <organization>Oracle</organization>
          </author>
          <author initials="W." surname="Simpson">
            <organization>DayDreamer</organization>
          </author>
          <author initials="T." surname="Talpey">
            <organization>Oracle</organization>
          </author>
          <date month="February" year="2017" />
        </front>
        <annotation>
          Work in progress.
        </annotation>

      </reference>
    </references>
    <references title="Informative References">
      &RFC5666;
      <reference anchor="rpcrdmav2" 
                 target="http://www.ietf.org/id/draft-cel-nfsv4-rpcrdma-version-two-03.txt">
        <front>
          <title>
            RPC-over-RDMA Version Two
          </title>

          <author initials="C." surname="Lever" role="editor">
            <organization>Oracle</organization>
          </author>
          <author initials="D." surname="Noveck">
            <organization>Hewlett Packard Enterprise</organization>
          </author>
          <date month="December" year="2016" />
        </front>
        <annotation>
          Work in progress.
        </annotation>
      </reference>
      <reference anchor="rfc5667bis" 
                 target="http://www.ietf.org/id/draft-ietf-nfsv4-rfc5667bision-06.txt">
        <front>
          <title>
            Network File System (NFS) Upper Layer Binding To RPC-Over-RDMA
          </title>

          <author initials="C." surname="Lever" role="editor">
            <organization>Oracle</organization>
          </author>
          <date month="February" year="2017" />
        </front>
        <annotation>
          Work in progress.
        </annotation>
      </reference>
      <reference anchor="rtrext" 
                 target="http://www.ietf.org/id/draft-dnoveck-nfsv4-rpcrdma-rtrext-01.txt">
        <front>
          <title>
            RPC-over-RDMA Extensions to Reduce Internode Round-trips
          </title>

          <author initials="D." surname="Noveck">
            <organization>Hewlett Packard Enterprise</organization>
          </author>
          <date month="December" year="2016" />
        </front>
        <annotation>
          Work in progress.
        </annotation>
      </reference>
    </references>
    <section title="Acknowledgements" anchor="ACK">
      <t>
        The author gratefully acknowledges the work of Brent Callaghan and
        Tom Talpey producing the original RPC-over-RDMA Version One 
        specification <xref target="RFC5666" /> and also Tom's work in
        helping to clarify that specification. 
      </t>
      <t>
        The author also wishes to thank Chuck Lever for his work reviving 
        RDMA  support for NFS in <xref target="rfc5666bis"/>, 
        <xref target="rfc5667bis"/>, and <xref target="rpcrdmav2"/>, and for 
        helpful discussion regarding RPC-over-RDMA latency issues.
    
      </t>
    </section>
  </back>
</rfc>
