Issues Related to RPC-over-RDMA Internode Round Trips

Issues Related to RPC-over-RDMA Internode Round Trips Hewlett Packard Enterprise

165 Dascomb Road Andover MA 01810 USA +1 781-572-8038 davenoveck@gmail.com

Transport Network File System Version 4 As currently designed and implemented, the RPC-over-RDMA protocol requires use of multiple internode round trips to process some common operations. For example, NFS WRITE operations require use of three internode round trips. This document looks at this issue and discusses what can and what should be done to address it, both within the context of an extensible version of RPC-over-RDMA and potentially outside that framework.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in .

When many common operations are performed using RPC-over-RDMA, additional inter-node round-trip latencies are required to take advantage of the performance benefits provided by RDMA Functionality. While the latencies involved are generally small, they are a reason for concern for two reasons. With the ongoing improvement of persistent memory technologies, such internode latencies, being fixed, can be expected to consume an increasing portion of the total latency required for processing NFS requests using RPC-over-RDMA. High-performance transfers using NFS may be needed outside of a machine-room environment. As RPC-over-RDMA is used in networks of campus and metropolitan scale, the internode round-trip time of sixteen microseconds per mile becomes an issue. Given this background, round trips beyond the minimum necessary need to be justified by corresponding benefits. If they are not, work needs to be done to eliminate those excess round trips. We are going to look at the existing situation with regard to round-trip latency and make some suggestions as to how the issue might be best addressed. We will consider things that could be done in the near future and also explore further possibilities that would require a longer-term approach to be adopted.

We will be looking at four sorts of situations: An RPC operation involving Direct Data Placement of request data (e.g., an NFSv3 WRITE or corresponding NFSv4 COMPOUND). An RPC operation involving Direct Data Placement of response data (e.g., an NFSv3 READ or corresponding NFSv4 COMPOUND). An RPC operation where the request data is longer than the inline buffer limit. An RPC operation where the response data is longer than the inline buffer limit. These are all simple examples of situations in which explicit RDMA operations are used, either to effect Direct Data Placement or to respond to message size limits that derive from a limited receive buffer size. We will survey the resulting latency and overhead issues in an RPC-over-RDMA Version One environment in Sections and below.

We'll start with the case of a request involving direct placement of request data. In this case, an RDMA READ is used to transfer a DDR-eligible data item (e.g. the data to be written) from its location in requester memory, to a location selected by the responder. Processing proceeds as described below. Although we are focused on internode latency, the time to perform a request also includes such things as interrupt latency, overhead involved in interacting with the RNIC, and the time for the server to execute the requested operation. First, the memory to be accessed remotely is registered. This is a local operation. Once the registration has been done, the initial send of the request can proceed. Since this is in the context of connected operation, there is an internode round trip involved. However, the next step can proceed after the initial transmission is received by the responder. As a result, only the responder-bound side of the transmission contributes to overall operation latency. The responder, after being notified of the receipt of the request, uses RDMA READ to fetch the bulk data. This involves an internode round-trip latency. After the fetch of the data, the responder needs to be notified of the completion of the explicit RDMA operation The responder (after performing the requested operation) sends the response. Again, as this is in the context of connected operation, there is an internode round trip involved. However, the next step can proceed after the initial transmission is received by the requester. The memory registered before the request was issued needs to be deregistered, before the request is considered complete and the sending process restarted. When remote invalidation is not available, the requester, after being notified of the receipt of the response, performs a local operation to deregister the memory in question. Alternatively, the responder will use Send With Invalidate and the responder's RNIC will effect the deregistration before notifying the requester of the response which has been received. To summarize, if we exclude the actual server execution of the request, the latency consists of two internode round-trip latencies plus two responder-side interrupt latencies plus one requester-side interrupt latency plus any necessary registration/de-registration overhead. This is in contrast to a request not using explicit RDMA operations in which there is a single inter-node round-trip latency and one interrupt latency on the requester and the responder. The processing of the other sorts of requests mentioned in show both similarities and differences: Handling of a long request is similar to the above. The memory associated with a position-zero read chunk is registered, transferred using RDMA READ, and deregistered. As a result, we have the same overhead and latency issues noted in the case of direct data placement, without the corresponding benefits. The case of direct data placement of response data follows a similar pattern. The important difference is that the transfer of the bulk data is performed using RDMA WRITE, rather than RDMA READ. However, because of the way that RDMA WRITE is effected over the wire, the latency consequences are different. See for a detailed discussion. Handling of a long response is similar to the previous case.

We'll now discuss the case of a request involving direct placement of response data. In this case, an RDMA WRITE is used to transfer a DDR-eligible data item (e.g. the data being read) from its location in responder memory, to a location previously selected by the requester. Processing proceeds as described below. Although we are focused on internode latency, the time to perform a request also includes such things as interrupt latency, overhead involved in interacting with the RNIC, and the time for the server to execute the requested operation. First, the memory to be accessed remotely is registered. This is a local operation. Once the registration has been done, the initial send of the request can proceed. Since this is in the context of connected operation, there is an internode round trip involved. However, the next step can proceed after the initial transmission is received. As a result, only the responder-bound side of the transmission contributes to overall operation latency. The responder, after being notified of the receipt of the request, proceeds to process the request until the data to be read is available in its own memory, with its location determined and fixed. It then uses RDMA WRITE to transfer the bulk data to the location in requester memory selected previously. This involves an internode latency, but there is no round trip and thus no round-trip latency, The responder continues processing and sends the inline portion of the response. Again, as this is in the context of connected operation, there is an internode round trip involved. However, the next step can proceed immediately. If the RDMA WRITE or the send of the inline portion of the response were to fail, the responder can be notified subsequently. The requester, after being notified of the receipt of the response, can analyze it and can access the data written into its memory. Deregistration of the memory originally registered before the request was issued can be done using remote invalidation or can be done by the requester as a local operation To summarize, in this case the additional latency that we saw in does not arise. Except for the additional overhead due to memory registration and invalidation, the situation is the same as for a request not using explicit RDMA operations in which there is a single inter-node round-trip latency and one interrupt latency on the requester and the responder.

We are going to consider how the latency and overhead issues discussed in might be addressed in the context of an extensible version of RPC-over-RDMA, such as that proposed in . In , we will establish a performance target for the troublesome requests, based on the performance of requests that do not involve long messages or direct data placement. We will then consider how extensions might be defined to bring latency and overhead for the requests discussed in into line with those for other requests. There will be two specific classes of requests to address: Those that do not involve direct data placement will be addressed in . In this case, there are no compensating benefits justifying the higher overhead and, in some cases, latency. The more complicated case of requests that do involve direct data placement is discussed in . In this case, direct data placement could serve as a compensating benefit, and the important question to be addressed is whether Direct Data Placement can be effected without the use of explicit RDMA operations. The optional features to deal with each of the classes of messages discussed above could be implemented separately. However, in the handling of RPCs with very large amounts of bulk data, the two features are synergistic. This fact makes it desirable to define the features as part of the same extension. See Sections and for details.

As our target, we will look at the latency and overhead associated with other sorts of RPC requests, i.e. those that do not use DDP, and that have request and response messages which do fit within the receive buffer limit. Processing proceeds as follows: The initial send of the request is done. Since this is in the context of connected operation, there is an internode round-trip involved. However, the next step can proceed after the initial transmission is received. As a result, only the responder-bound side of the transmission contributes to overall operation latency. The responder, after being notified of the receipt of the request, performs the requested operation and sends the reply. As in the case of the request, there is an internode round trip involved. However, the request can be considered complete upon receipt of the requester-bound transmission. The responder-bound acknowledgment does not contribute to request latency. In this case, there is only a single internode round-trip latency necessary to effect the RPC. Total request latency includes this round-trip latency plus interrupt latency on the requester and responder, plus the time for the responder to actually perform the requested operation. Thus the delta between the operations discussed in and our baseline consists of two portions, one of which applies to all the requests we are concerned with and the second of which only applies to request which involve use of RDMA READ, as discussed in . The latter category consists of: One additional internode round-trip latency. One additional instance of responder-side interrupt latency. The additional overhead necessary to do memory registration and deregistration applies to all requests using explicit RDMA operations. The costs will vary with implementation characteristics. As a result, in some implementations, it may desirable to replace use of RDMA Write with send-based alternatives, while in others, use of RDMA Write may be preferable.

Using multiple RPC-over-RDMA transmissions, in sequence, to send a single RPC message avoids the additional latency and overhead associated with the use of explicit RDMA operations to transfer position-zero read chunks. In the case of reply chunks, only overhead is reduced. Although transfer of a single request or reply in N transmissions will involve N+1 internode latencies, overall request latency is not increased by requiring that operations involving multiple nodes be serialized. Generally, these transmissions are pipelined. As an illustration, let's consider the case of a request involving a response consisting of two RPC-over-RDMA transmissions. Even though each of these transmissions is acknowledged, that acknowledgement does not contribute to request latency. The second transmission can be received by the requester and acted upon without waiting for either acknowledgment. This situation would require multiple receive-side interrupts but it is unlikely to result in extended interrupt latency. With 1K sends (Version One), the second receive will complete about 200 nanoseconds after the first assuming a 40Gb/s transmission rate. Given likely interrupt latencies, the first interrupt routine would be able to note that the completion of the second receive had already occurred.

In order to effect proper placement of request or reply data within the context of individual RPC-over-RDMA transmissions, receive buffers need to be structured to accommodate this function To illustrate the considerations that could lead clients and servers to choose particular buffer structures, we will use, as examples, the cases of NFS READs and WRITEs of 8K data blocks (or the corresponding NFSv4 COMPOUNDs). In such cases, the client and server need to have the DDP-eligible bulk data placed in appropriately aligned 8K buffer segments. Rather than being transferred in separate transmissions using explicit RDMA operations, a message can be sent so that bulk data is received into an appropriate buffer segment. In this case, it would be excised from the XDR payload stream, just as it is in the case of existing DDP facilities. Consider a server expecting write requests which are usually X bytes long or less, exclusive of an 8K bulk data area. In this case the payload stream will most likely be less than X bytes and will fit in a buffer segment devoted to that purpose. The bulk data needs to be placed in the subsequent buffer segment in order to align it properly, i.e. with the appropriate alignment, in the DDP target buffer. In order to place the data appropriately, the sender (in this case, the client) needs to add padding of length X-Y bytes where Y is the length of payload stream for the current request. The case of reads is exactly the same except that the sender adding the padding is the server. To provide send-based DDP as an RPC-over-RDMA extension, the framework defined in could be used. A new "transport characteristic" could be defined which allowed a participant to expose the structure of his receive buffers and to identify the buffer segments capable of being used as DDP targets. In addition, a new optional message header would have to be defined. It would be defined to provide: A way to designate a DDP-eligible data item as corresponding to target buffer segments, rather than memory registered for RDMA. A way to indicate to the responder that it should place DDP-eligible data items in DDP-targetable buffer segments, rather than in memory registered for RDMA. A way to designate a limited portion of an RPC-over-RDMA transmission as constituting the payload stream.

While message continuation and send-based DDP each address an important class of commonly used messages, their combination allows simpler handling of some important classes of messages: READs and WRITEs transferring larger IOs COMPOUNDs containing multiple IO operations. Operations whose associated payload stream is longer than the typical value. To accommodate these situations, it would be best to have the definition of the headers to support message continuation interact with data structures to support send-based DDP as follows: The header type used for the initial transmission of a message continued across multiple transmissions would contain DDP-directing structures which support both send-based DDP as well as DDP using Explicit RDMA operations. Buffer references for Send-based DDP should be relative to the start of the group of transmissions and should allow transitions between buffer segments in different receive buffers. The header type for messages continuing a group of transmissions should not have DDP-related fields but should rely on the initial transmission of the group for DDP-related functions. The portion of each received transmission devoted to the payload stream should be part of the header for each message within a group of transmissions devoted to a single RPC message. The payload stream for the message as a whole should be the concatenation of the streams for each transmission. A potential extension supporting these features interacting as described above can be found in .

Given that an appropriate extension is likely to support multiple OPTIONAL features, special attention will have to be given to defining how implementations which might not support the same subset of OPTIONAL features can successfully interact. The goal is to allow interacting implementations to get the benefit of features that they both support, while allowing implementation pairs that do not share support for any of the OPTIONAL features to operate just as base Version Two implementations could do in the absence of the potential extension. It is helpful if each implementation provides characteristics defining its level of feature support which the peer implementation can test before attempting to use a particular feature. In other similar contexts, the support level concerns the implementation in its role as responder, i.e. whether it is prepared to execute a given request. In the case of the potential extension discussed here, most characteristics concern an implementation in its role as receiver. One might define characteristics which indicate, The ability of the implementation, in its role as receiver, to process messages continued across multiple RPC-over-RDMA transmissions. The ability of the implementation, in its role as receiver, to process messages containing DDP-eligible data items, directly placed using a send-based DDP approach. Use of such characteristics might allow asymmetric implementations. For example, a client might send requests containing DDP-eligible data items using send-based DDP without being able to accept messages containing data items using send-based DDP. That is a likely implementation pattern, given the greater performance benefits of avoiding use of RDMA Read. Further useful characteristics would apply to the implementation in its role of responder. For instance, The ability of the implementation, in its role as responder, to accept and process requests which REQUIRE that DDP-eligible data items in the response be sent using send-based DDP. The presence of this characteristic would allow a requester to avoid registering memory to be used to accommodate DDP-eligible data items in the response. The ability of the implementation, in its role as responder, to send responses using message continuation, as opposed to using a reply chunk. Because of the potentially different needs of operations in the forward and backward directions, it may be desirable to separate the receiver-based characteristics according the direction of operation that they apply to. A further issue relates to the role of explicit RDMA operations in connection with backwards operation. Although, no current protocols require support for DDP or transfer of large messages when operating in the backward direction, the protocol is designed to allow such support to be developed in the future. Since the protocol, with the extension discussed here is likely to have multiple methods of providing these functions, we have a number of possible choices regarding the role of chunk-based methods of providing these functions Support for chunk-based operation remains a REQUIREMENT for responders, and requesters always have the option of using it, regardless of the direction of operation. Requesters could select alternatives to the use of explicit RDMA operations when these are supported by the responder When operating in the forward direction, support for chunk-based operation remains a REQUIREMENT for responders (i.e. servers), and requesters (i.e. clients). When operating in the backward direction, support for chunk-based is OPTIONAL for responders (i.e. clients) allowing requesters (i.e. servers) to select use of explicit RDMA operations or alternatives when each of these is supported by the requester. Support for chunk-based operation is treated as OPTIONAL for responders, regardless of the direction of operation. In this case, requesters would select use of explicit RDMA operations or alternatives when each of these is supported by the responder. For a considerable time, support for explicit RDMA operations would be a practical necessity, even if not a REQUIREMENT, for operation in the forward direction.

Although the reduction of explicit RDMA operation reduces the number of inter-node round trips and eliminates sequences of operations in which multiple round-trip latencies are serialized with server interrupt latencies, the use of connected operations means that round-trip latencies will always be present, since each message is acknowledged. One avenue that has been considered is use of unreliable-datagram (UD) transmission in environments where the "unreliable" transmission is sufficiently reliable that RPC replay can deal with a very low rate of message loss. For example, UD in Infiniband specifies a low enough rate of frame loss to make this a viable approach, particularly for use in supporting protocols such as NFSv4.1, that contain their own facilities to ensure exactly-once semantics. With this sort of arrangement, request latency is still the same. However, since the acknowledgements are not serving any substantial function, it is tempting to consider removing them, as they do take up some transmission bandwidth, that might be used otherwise, if the protocol were to reach the goal of effectively using the underlying medium. The size of such wasted transmission bandwidth depends on the average message size and many implementation considerations regarding how acknowledgments are done. In any case, given expected message sizes, the wasted transmission bandwidth will be very small. When RPC messages are quite small, acknowledgments may be of concern. However, in that situation, a better response would be transfer multiple RPC messages within a single RPC-over-RDMA transmission. When multiple RPC messages are combined into a single transmission, the overhead of interfacing with the RNIC, particularly the interrupt handling overhead, is amortized over multiple RPC messages. Although this technique is quite outside the spirit of existing RPC-over-RDMA implementations, it appears possible to define new header types capable of supporting this sort of transmission, using the extension framework described in .

We've examined the issue of round-trip latency and concluded: That the number of round trips per se is not as important as the contribution of any extra round trips to overall request latency. That the latency issue can be addressed using the extension mechanism provided for in . That in many cases in which latency is not an issue, there may be overhead issues that can be addressed using the same sorts of techniques as those useful in latency reduction, again using the extension mechanism provided for in . As it seems that the features sketched out could put internode latencies and overhead for a large class of requests back to the baseline value for the RPC paradigm, more detailed definition of the required extension functionality is in order. We've also looked at round trips at the physical level, in that acknowledgments are sent in circumstances where there is no obvious need for them. With regard to these, we have concluded: That these acknowledgements do not contribute to request latency. That while UD transmission can remove acknowledgements of limited value, the performance benefits are not sufficient to justify the disruption that this would entail. That issues with transmission bandwidth overhead in a small-message environment are better addressed by combining multiple RPC messages in a single RPC-over-RDMA transmission. This is particularly so, because such a step is likely to reduce overhead in such environments as well As the features described involve the use of alternatives to explicit RDMA operations, in performing direct data placement and in transferring messages that are larger than the receive buffer limit, it is appropriate to understand the role that such operations are expected to have once the extensions discussed in this document are fully specified and implemented. It is important to note that these extensions are OPTIONAL and are expected to remain so, while support for explicit RDMA operations will remain an integral part of RPC-over-RDMA. Given this framework, the degree to which explicit RDMA operations will be used will reflect future implementation choices and needs. While we have been focusing on cases in which other options might be more efficient in some cases, it worth looking also at the cases in which explicit RDMA operations are likely to remain preferable: In some environments in which direct data placement to memory of a certain alignment does not meet application requirements and in which data needs to be read into a particular address on the client. Also, large physically contiguous buffers may be required in some environments. In these situations, send-based DDP is not an option. Where large transfers are to be done, there will be limits to the capacity of send-based DDP to provide the required functionality, since the basic pattern using send/receive is to allocate a pool of memory to contain receive buffers in advance of issuing requests. While this issue can be mitigated by use of message continuation, tying up large numbers of credits for a single request can cause difficult issues as well. As a result, send-based DDP may be restricted to IO's of limited size, although the specific limits will depend on the details of the specific implementation.

This document does not raise any security issues.

This document does not require any actions by IANA.

&RFC2119; Remote Direct Memory Access Transport for Remote Procedure Call Oracle DayDreamer Oracle Work in progress. &RFC5666; RPC-over-RDMA Version Two Oracle Hewlett Packard Enterprise Work in progress. Network File System (NFS) Upper Layer Binding To RPC-Over-RDMA Oracle Work in progress. RPC-over-RDMA Extensions to Reduce Internode Round-trips Hewlett Packard Enterprise Work in progress.

The author gratefully acknowledges the work of Brent Callaghan and Tom Talpey producing the original RPC-over-RDMA Version One specification and also Tom's work in helping to clarify that specification. The author also wishes to thank Chuck Lever for his work reviving RDMA support for NFS in , , and , and for helpful discussion regarding RPC-over-RDMA latency issues.