<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC2367 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2367.xml">
<!ENTITY RFC4034 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4034.xml">
<!ENTITY RFC4301 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4301.xml">
<!ENTITY RFC5890 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5890.xml">
<!ENTITY RFC6698 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.6698.xml">
<!ENTITY RFC6982 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.6982.xml">
<!ENTITY RFC7296 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7296.xml">
<!ENTITY RFC8174 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8174.xml">
<!ENTITY RFC7942 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7942.xml">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<?rfc strict="yes" ?>
<?rfc toc="yes"?>
<?rfc tocdepth="4"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes" ?>
<?rfc compact="yes" ?>
<?rfc subcompact="no" ?>
<rfc ipr="trust200902" updates="" obsoletes="" category="std" docName="draft-pwouters-ipsecme-multi-sa-performance-00">
  <front>
    <title>IKEv2 support for per-queue Child SAs</title>
    <author fullname="Antony Antony" initials="A." surname="Antony">
      <organization abbrev="secunet">secunet Security Networks AG</organization>
      <address>
        <email>antony.antony@secunet.com</email>
      </address>
    </author>
    <author initials="T." surname="Brunner" fullname="Tobias Brunner">
      <organization abbrev="codelabs">codelabs GmbH</organization>
      <address>
        <email>tobias@codelabs.ch</email>
      </address>
    </author>
    <author fullname="Steffen Klassert" initials="S." surname="Klassert">
      <organization abbrev="secunet">secunet Security Networks AG</organization>
      <address>
        <email>steffen.klassert@secunet.com</email>
      </address>
    </author>
    <author initials="P." surname="Wouters" fullname="Paul Wouters">
      <organization>Aiven</organization>
      <address>
        <email>paul.wouters@aiven.io</email>
      </address>
    </author>
    <date/>
    <area>General</area>
    <workgroup>Network</workgroup>
    <keyword>IKEv2</keyword>
    <keyword>IPsec</keyword>
    <abstract>
      <t>
       This document defines four Notify Message Type Payloads for the Internet
       Key Exchange Protocol Version 2 (IKEv2) indicating support for
       the negotiation of multiple identical Child SAs to optimize
       performance.
      </t>
      <t>
       The CPU_QUEUES notification indicates support for multiple queues
       or CPUs. The QOS_QUEUES notification indicates support for different
       Quality of Service (QoS) levels. The CPU_QUEUE_INFO and QOS_QUEUE_INFO
       notification are used to confirm and optionally convey information
       about the specific queue, such as QoS level.
      </t>
      <t>
       Using multiple identical Child SAs has the benefit that each
       stream has its own Sequence Number Counter, ensuring that CPUs don't
       have to synchronize their crypto state or disable their packet
       replay protection.
      </t>
    </abstract>
  </front>
  <middle>
    <section title="Introduction">
      <t>
       IPsec implementations are currently limited to using one queue
       or CPU per Child SA. The result is that a machine with many
       queues/CPUs is limited to only using one of these per Child SA. This
       severely limits the throughput that can be attained. An unencrypted
       link of 10Gbps or more is commonly reduced to 2-5Gbps when IPsec
       is used to encrypt the link using AES-GCM. By using the implementation
       specified in this document, aggregate throughput increased from 5Gbps
       using 1 CPU to 40-60 Gbps using 25-30 CPUs
      </t>
      <t>
       Furthermore, IPsec implementations are currently limited to use the
       same Child SA for all Quality of Service (QoS) types because the
       QoS type is not a part of the Traffic Selector (TS) payload. The
       result is that IPsec cannot support active Quality of Service
       prioritization without disabling the anti-replay protection.
      </t>
      <t>
       While this could be (partially) mitigated by setting up multiple
       narrowed Child SAs, for example using Populate From Packet (PFP)
       as specified in <xref target="RFC4301"/>, this IPsec feature is
       not widely implemented. Some route based IPsec implementations
       might be able to implement this with specific rules into separate
       network interfaces, but these methods might not be available for
       policy based IPsec implementations.
      </t>
      <t>
       To make better use of multiple network queues and CPUs, it can
       be beneficial to negotiate and install multiple identical Child
       SAs. IKEv2 <xref target="RFC7296"/> already allows installing
       multiple identical Child SAs, it offers no method to negotiate
       the number of Child SAs or indicate the purpose for the multiple
       Child SAs that are requested.
      </t>
      <t>
       When two IKEv2 peers want to negotiate multiple Child SAs, it
       is useful to be able to convey how many Child SAs are required
       for optimized traffic. This avoids triggering CREATE_CHILD_SA
       exchanges that will only be rejected by the peer.
      </t>
      <section title="Requirements Language">
        <t>
       The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
       "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
       "OPTIONAL" in this document are to be interpreted as described in BCP
       14 <xref target="RFC2119"/> <xref target="RFC8174"/> when, and only
       when, they appear in all capitals, as shown here.
      </t>
      </section>
    </section>
    <section title="Performance bottlenecks" anchor="performance">
      <t>
       Currently, most IPsec implementations are limited by using one CPU
       or network queue per Child SA. There are a number of practical
       reasons for this, but a key limitation is that sharing the crypto
       state, counters and sequence numbers between multiple CPUs is
       not feasible without a significant performance penalty. There
       is a need to negotiate and establish multiple Child SAs with
       identical TSi/TSr on a per-queue or per-CPU basis.
        </t>
    </section>
    <section title="Negotiation of CPU specific Child SAs" anchor="neg_cpu">
      <t>
       When negotiating CPU specific Child SAs, the first SA negotiated
       either in an IKE_AUTH exchange or CREATE_CHILD_SA is called
       Fallback SA. This Child SA is similar to a regular Cgild SA in
       that it is not bound to a single resource (CPU or QoS queue). This
       Fallback Child SA (or its rekeyed successors) MUST remain active
       for the lifetime of the IPsec session to ensure that there is
       always a Child SA that can be selected to send traffic over,
       in case a per-resource Child SA is not available. Additional
       Child SAs are installed bound to a specific resource (CPU or
       QoS queue). These Child SAs are responsible for the bulk of
       the traffic.
      </t>
      <t>
       The CPU_QUEUES notification payload is sent in the IKE_AUTH or
       CREATE_CHILD_SA Exchange indicating the negotiated Child SA is
       a Fallback SA.
      </t>
      <t>
       The CPU_QUEUES notification value refers to the number of
       additional resource-specific Child SAs that may be installed for
       this particular TSi/TSr combination excluding the Fallback Child
       SA. Both peers send the preferred minimum number of additional
       Child SAs to install. Both peers pick the maximum of the two
       numbers (within reason). That is, if the initiator prefers 16
       and the responder prefers 48, then the number negotiated is
       48. The responder may at any time reject additional Child SAs
       by returning TS_UNACCEPTABLE. It should not return NO_ADDITIONAL_SAS,
       as there might be another Child SAs with different Traffic Selectors
       that would still be allowed by the peer.
      </t>
      <t>[Antony: Valery's feedback was not to use TS_UNACCEPTABLE. instead
       create a new notify or use TEMPORARY_FAILURE.
       TEMPORARY_FAILURE because the situation may change again if you try
       again. I have preference to define new NO_CPU_QUEUE_INFO_SA]
      </t>
      <t>
       Resource-specific Child SAs are negotiated as regular Child
       SAs using the CREATE_CHILD_SA exchange and are identified by a
       CPU_QUEUE_INFO notification. Upon installation, each Child SA
       is associated with an additional local selector, such as CPU
       or queue.  These additional Child SAs MUST be negotiated with
       identical Child SA properties that were negotiated for the Fallback
       SA. This includes cryptographic algorithms, Traffic Selectors, Mode
       (e.g. transport mode), compression usage, etc. However, the Child
       SAs do have their own individual keying material that is derived
       according to the regular IKEv2 process. The CPU_QUEUE_INFO can
       be empty or contain some identifying data that could be useful
       for debugging purposes.
      </t>
      <t>
       Additional Child SAs can be started on-demand or can be started
       all at once. Peers may also delete specific per-resource Child SAs if
       they deem the associated resource to be idle. The Fallback SA MUST
       NOT be deleted while any per-resource Child SAs are still present.
      </t>
      <t>
       During the CREATE_CHILD_SA rekey for the Child SA, the
       CPU_QUEUE_INFO notification MAY be included, but regardless of whether
       or not it is included, the rekeyed Child SA MUST be bound to the same
       resource(s) as the Child SA that is being rekeyed.
      </t>
      <t>
       As with regular Child SA rekeying, the new Child SA may not be
       different from the rekeyed Child SA with respect to cryptographic
       algorithms and MUST cover the original Traffic Selector ranges.
      </t>
      <t>
       If a CREATE_CHILD_SA exchange request containing both a
       CPU_QUEUE_INFO and a CPU_QUEUES notification is received, the responder
       MUST ignore the CPU_QUEUE_INFO payload. If a CREATE_CHILD_SA
       exchange reply is received with both CPU_QUEUE_INFO and CPU_QUEUES
       notifications, the initiator MUST ignore the notification that it
      did not send in the request.</t>
      <t>
       [Steffen: I tend to tread these cases as an error.]
      </t>
      <t>[Tobias: That's currently how I implemented it (being lenient on what
          I accept). But we could also treat those cases as errors. The
          question would just be what we should return (NO_PROPOSAL_CHOSEN
          and keep IKE and other Child SAs or even INALID_SYNTAX and kill the
          whole IKE_SA - and as initiator we either have to terminate the Child
          or the IKE_SA actively if we receive both notifies).]
      </t>
      <t>
       The CPU_QUEUES notification, even when it is sent in the IKE_AUTH
       exchange, is not an attribute of the IKE peer. It is an attribute
       of the Child SA, similar to the USE_TRANSPORT notification.
       That is, an IKE peer can have multiple Child SAs covering
       different traffic selectors and selectively decide whether or
       not to enable additional per-resource Child SAs for each of these
       Child SAs covering different Traffic Selectors.
      </t>
    </section>
    <section title="Negotiation of QoS specific Child SAs" anchor="neg_qos">
      <t>
       To install multiple Child SAs for different QoS levels, a similar
       negotiation method is used. The QOS_QUEUES notification is sent with
       the negotiation of the Fallback Child SA that is used for all
       QoS levels not matched by more specific Child SAs. Additional
       Child SAs are installed per QoS level by including the QOS_QUEUE_INFO
       notification describing the specific QoS level that this additional Child SA
       will cover. This allows both peers to install the Child SA using the
       same QoS level.
      </t>
      <t>
       [Steffen: Maybe mention IPv6 flow label too]
      </t>
      <t>
       If a certain QoS level proposed by the peer is not acceptable to
       the responder, TS_UNACCEPTABLE MUST be returned.
      </t>
      <t>[Tobias: Would a more specific error notify make sense here?]</t>
      <t>[Antony: We need specific error if is rejected QOS_QUEUE_INFO]</t>
      </section>
    <section title="Implementation specifics" anchor="implementation">
      <t>
       There are various considerations that an implementation can
       use to determine the best way to install multiple Child SAs.
       Below are examples of such strategies.
      </t>
      <section title="per-CPU Child SAs" anchor="impl_pcpu">
        <t>
         A simple distribution could be to install one additional Child
         SA on each CPU. The Fallback Child SA ensures that any CPU
         generating traffic to be encrypted has an available (if not
         optimal) Child SA to use. Any subsequent Child SAs with identical
         TSi/TSr Traffic Selectors are installed in such a way to only be
         used by a single CPU or network queue.
        </t>
        <t>
         Performing per-CPU Child SA negotiations can result in both peers
         initiating additional Child SAs at once. This is especially
         likely if per-CPU Child SAs are triggered by individual
         SADB_ACQUIRE <xref target="RFC2367"/> messages. Responders should
         install the additional Child SA on a CPU with the least amount of
         additional Child SAs for this TSi/TSr pair. It should count outstanding
         SADB_ACQUIREs as an assigned additional Child SA. It is still possible
         that when the peers only have one slot left to assign, that both peers send
         a CREATE_CHILD_SA request at the same time. [Paul: Is there
         anything we can do at the protocol level to terminate one of
         these without race conditions?] [Antony: if CPU_QUEUE_INFO is
         a MUST, that info could be used for better one-to-one mapping,
         as well as delete the extra SAs. Also, keep in mind the general
         case IKE window > 1]
        </t>
        <t>
         As an optimization, additional Child SAs that see little traffic
         MAY be deleted. The Fallback Child SA MUST NOT be deleted when
         idle, as it is likely to be idle if enough per-CPU Child SAs
         are installed. However, if one of those per-CPU child SAs is
         deleted because it was idle, and subsequently that CPU starts
         to generate traffic again, that traffic does not have a per-CPU
         Child SA and will be encrypted using the Fallback Child SA. Meanwhile,
         the IKE daemon might be negotiating to bring up a new per-CPU Child SA.
        </t>
        <t>
         When the number of queues or CPUs are different between the
         peers, the peer with the least amount of queues or CPUs MAY
         decide to not install a second outbound Child SA for the same
         resource as it will never use it to send traffic. However, it MUST
         install all inbound Child SAs as it has committed to receiving traffic
         on these negotiated Child SAs.
        </t>
        <t>
         If per-CPU SADB_ACQUIRE messages are implemented (see <xref target="Operations"/>),
         the Traffic Selector (TSi) entry containing the information of the
         trigger packet should still be included in the TS set.  This information
         MAY be used by the peer to select the most optimal target CPU to install
         the additional Child SA on. For example, if the trigger packet was for a
         TCP destination to port 25 (SMTP), it might be able to install the Child
         SA on the CPU that is also running the mail server process. Trigger packet
         Traffic Selectors are documented in <xref target="RFC7296"/> Section 2.9.
        </t>
        <t>
         As per RFC 7296, rekeying a Child SA SHOULD use the same (or wider) Traffic
         Selectors to ensure that the new Child SA covers everything that the
         rekeyed Child SA covers. This includes Traffic Selectors negotiated
         via Configuration Payloads (CP) such as INTERNAL_IP4_ADDRESS which may
         use the original wide TS set or use the narrowed TS set.
        </t>
      </section>
      <section title="per-QoS Child SAs" anchor="impl_qos">
        <t>[Paul: is there anything we need to say here?]</t>
        <t>[Steffen: If we want to say something about that case, maybe this:]</t>
        <t>
         Most considerations from the per-CPU case apply to the per-QoS case as well.
         The main difference between these two cases is that the number of possible
         QoS types are always the same for both peers (e.g. 64 types for IPv4).
         Unlike the per-CPU case, handling different numbers of QoS types
         is not necessary.
      </t>
      <t>
       [Paul: I was hoping we could negotiate things like "only 2 different levels needed",
       and not just a "we want to install SAs for all theoretical possible levels"]
      </t>
    </section>
      <section title="Combining per-CPU and per-QoS level Child SAs" anchor="impl_mixed">
        <t>
          It is unlikely but not disallowed, to use both per CPU and per QoS level Child SAs.
          Any conflicts between the performance improving types of SAs would need to be
          handled by local policies. For some, the QoS might be more important to honour as
          best as possible, while for others, CPU distribution might be more important. There
          is currently no operational experience with combining these two types of Child SAs.
         </t>
         <t>[Tobias: How would this look like? Would you send both notifies on
             the same set of SAs (CPU/QOS_QUEUE on the fallback SA and INFO on
             the others)? (So each SA would be for a specific CPU AND QoS class.)
             Or would you negotiate separate per-CPU and per-QoS SAs all with
             the same TS? (e.g. if you already bound certain classes to certain
             CPUs anyway and use a QoS specific SA for that, but still want to
             use multiple CPUs for the other traffic and negotiate per-CPU SAs
             without QoS identifier for that)]
        </t>
        <t>
         [Paul: I don't really know - perhaps we should remove QoS until we have
         someone who actually wants to run this and can provide guidance for
         standardization ? ]
        </t>
      </section>
    </section>
    <section title="Payload Format" anchor="payload_formats">
      <t>
      All multi-octet fields representing integers are laid out in big
      endian order (also known as "most significant byte first", or
      "network byte order").
     </t>
      <section title="CPU_QUEUES Notify Message Payload" anchor="payload_pcpuq">
        <figure align="center">
          <artwork align="left"><![CDATA[
                    1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-----------------------------+-------------------------------+
! Next Payload  !C!  RESERVED   !         Payload Length        !
+---------------+---------------+-------------------------------+
!  Protocol ID  !   SPI Size    !      Notify Message Type      !
+---------------+---------------+-------------------------------+
!  Minimum number of IPsec SAs                                  !
+-------------------------------+-------------------------------+
            ]]></artwork>
        </figure>
        <t>
          <list style="symbols">
            <t>Protocol ID (1 octet) - MUST be 0. MUST be ignored if not 0.</t>
            <t>SPI Size (1 octet) - MUST be 0. MUST be ignored if not 0.</t>
            <t>Notify Message Type (2 octets) - set to [TBD1]</t>
            <t>Minimum number of per-CPU IPsec SAs (4 octets).
               MUST be greater than 0. If 0 is received, it MUST be interpreted as 1.</t>
          </list>
        </t>
        <t>
       Note: The Fallback Child SA that is not bound to a single CPU is not counted as part of these numbers.
       </t>
      </section>
      <section title="QOS_QUEUES Notify Message Payload" anchor="payload_qosq">
        <figure align="center">
          <artwork align="left"><![CDATA[
                    1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-----------------------------+-------------------------------+
! Next Payload  !C!  RESERVED   !         Payload Length        !
+---------------+---------------+-------------------------------+
!  Protocol ID  !   SPI Size    !      Notify Message Type      !
+---------------+---------------+-------------------------------+
!  Minimum number of IPsec SAs                                  !
+-------------------------------+-------------------------------+
            ]]></artwork>
        </figure>
        <t>
          <list style="symbols">
            <t>Protocol ID (1 octet) - MUST be 0. MUST be ignored if not 0.</t>
            <t>SPI Size (1 octet) - MUST be 0. MUST be ignored if not 0.</t>
            <t>Notify Message Type (2 octets) - set to [TBD2]</t>
            <t>Maximum number of QoS level IPsec SAs (4 octets).
            MUST be greater than 0. If 0 is received, it MUST be interpreted as 1.</t>
            <t>[Steffen: Does it make sense to negotiate the max. number of QoS types?
                Unlike the per-CPU case, there is no tradeoff between the peers.
                Both peers always support the same number of QoS types (64 on IPv4)]</t>
            <t>[Tobias: I agree with Steffen. This doesn't seem necessary and might
               even be confusing as reducing the number would not tell the peer
               what classes should actually be sent.]</t>
            <t>[Paul: I was hoping to send the desired number of different levels, not the
               theoretical maximum of used levels</t>
          </list>
        </t>
        <t>
       Note: The Fallback Child SA that is not bound to a single QoS is not counted as part of these numbers.
       </t>
      </section>
      <section title="CPU_QUEUE_INFO Notify Message Payload" anchor="payload_info_cpu">
        <figure align="center">
          <artwork align="left"><![CDATA[
                    1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-----------------------------+-------------------------------+
! Next Payload  !C!  RESERVED   !         Payload Length        !
+---------------+---------------+-------------------------------+
!  Protocol ID  !   SPI Size    !      Notify Message Type      !
+---------------+---------------+-------------------------------+
!                                                               !
~               Optional queue identifier                       ~
!                                                               !
+-------------------------------+-------------------------------+
            ]]></artwork>
        </figure>
        <t>
          <list style="symbols">
            <t>Protocol ID (1 octet) - MUST be 0. MUST be ignored if not 0.</t>
            <t>SPI Size (1 octet) - MUST be 0. MUST be ignored if not 0.</t>
            <t>Notify Message Type (2 octets) - set to [TBD3]</t>
            <t>Optional Payload Data. This value MAY be set to convey the local identity of the queue.
               The value SHOULD be a unique identifier and the peer SHOULD only use it for debugging purposes.</t>
          </list>
        </t>
      </section>
      <section title="QOS_QUEUE_INFO Notify Message Payload" anchor="payload_info_qos">
        <figure align="center">
          <artwork align="left"><![CDATA[
                    1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-----------------------------+-------------------------------+
! Next Payload  !C!  RESERVED   !         Payload Length        !
+---------------+---------------+-------------------------------+
!  Protocol ID  !   SPI Size    !      Notify Message Type      !
+---------------+---------------+-------------------------------+
!                                                               !
~               Mandatory QoS level specifier                   ~
!                                                               !
+-------------------------------+-------------------------------+
            ]]></artwork>
        </figure>
        <t>
          <list style="symbols">
            <t>Protocol ID (1 octet) - MUST be 0. MUST be ignored if not 0.</t>
            <t>SPI Size (1 octet) - MUST be 0. MUST be ignored if not 0.</t>
            <t>Notify Message Type (2 octets) - set to [TBD4]
               Mandatory Payload Data. This value MUST be set to identify the QoS level.
               [Paul: Can we say 'one byte for each level of QoS included for this SA' ?]
               [Steffen: I don't understand that? Do we support more than one QoS type per SA?
                I think we need space to cover either a 6 bit IPv4 QoS type or a 20 bit IPv6
                flow label.]
                [Tobias: Hm, one problem here is that CHILD_SAs can have
                   traffic selectors of both address families.  So how could
                   we negotiate that we need a QoS type AND a flow label? Would
                   that require two notifies (QOS_4|6_QUEUE_INFO types) or could
                   we have two fields in the notify that may be set to 0?
                   Or should that just not be allowed? I don't even know if it
                   makes sense and whether QoS classes and flow labels are
                   combinable in that way (I guess a dual-stack VoIP client
                   would classify traffic in a comparable way for each family).
                   And I also wonder if there is a mechanism to apply a flow
                   label to an outer IPv4 header's TOS field and vice-versa.
                   If multiple classes/labels should be supported per SA we
                   could also send multiple notifies (but I guess that would
                   mean that on-path routers had to treat all these
                   classes/labels the same way, which begs the question why
                   different values would get assigned to the packets in the
                   first place).]
            </t>
          </list>
        </t>
      </section>
    </section>
    <section anchor="Operations" title="Operational Considerations">
     <t>
      Implementations supporting per-CPU SAs SHOULD extend their local
      SPD selector, and the mechanism of on-demand negotiation that is
      triggered by traffic to include a CPU (or queue) identifier in
      their SADB_ACQUIRE message from the SPD to the IKE daemon. If the IKEv2
      extension defined in this document is negotiated with the peer, a
      node which does not support receiving per-CPU SADB_ACQUIRE messages MAY
      initiate all its Child SAs immediately upon receiving the (only)
      SADB_ACQUIRE it will receive from the IPsec stack. Such implementations
      also need to be careful when receiving a Delete Notify request for a
      per-CPU Child SA, as it has no method to detect when it should bring
      up such a per-CPU Child SA again later. And bringing the deleted
      per-CPU Child SA up again immediately after receiving the Delete
      Notify might cause an infinite loop between the peers. Another
      issue of not bringing up all its per-CPU Child SAs is that if
      the peer acts similarly, the two peers might end up with only the
      Fallback SA without ever activating any per-CPU Child SAs. It is
      there for RECOMMENDED to implement per-CPU SADB_ACQUIRE messages.
      [ Antony: It would be nice to add manual/scripts for starting of connection
       and bringing up per-CPU SAs. It could be very simple, a external
       program decides to start a per-CPU SA. ]
     </t>
     <t>
      The minimum number of Child SAs negotiated should not be treated
      as the maximum number of allowed Child SAs. Peers SHOULD
      be lenient with this number to account for corner cases. For
      example, during Child SA rekeying, there might be a large number
      of additional Child SAs created before the old Child SAs are torn
      down. Similarly, when using on-demand Child SAs, both ends could
      trigger multiple Child SA requests as the initial packet causing
      the Child SA negotiation might have been transported to the peer
      via the Fallback SA where its reply packet might also trigger an
      on-demand Child SA negotiation to start. A peer may want to allow
      up to double the negotiated minimum number of Child SAs, and rely on
      idleness of Child SAs to tear down any unused Child SAs gradually to
      to reach an optimal number of Child SAs. Adding too many SAs may slow
      down per-packet SAD lookup.
     </t>
     <t>
     Implementations might support dynamically moving a per-CPU Child
     SAs from one CPU to another CPU. If this method is supported,
     implementations must be careful to move both the inbound and outbound
     SAs. If the IPsec endpoint is a gateway, it can move the inbound SA
     and outbound SA independently from each other. It is likely that
     for a gateway, IPsec traffic would be asymmetric.  If the IPsec
     endpoint is the same host responsible for generating the traffic,
     the inbound and outbound SAs SHOULD remain as a pair on the same CPU.
     If a host previously skipped installing an outbound SA because it
     would be an unused duplicate outbound SA, it will have to create
     and add the previously skipped outbound SA to the SAD with the new
     CPU ID. The inbound SA may not have CPU ID in the SAD.  Adding the
     outbound SA to the SAD requires access to the key material, whereas
     for updating the CPU selector on an existing outbound SAs. access
     to key material might not be needed.  To support this, the IKE
     software might have to hold on to the key material longer than it
     normally would, as it might actively attempt to destroy key material
     from memorya that it no longer needs access to.
     </t>
    </section>
    <section anchor="Security" title="Security Considerations">
      <t>
      [TO DO]
     </t>
    </section>
    <section title="Implementation Status" anchor="impl_status">
      <t>
      [Note to RFC Editor: Please remove this section and the reference to
      <xref target="RFC6982"/> before publication.]
     </t>
      <t>
      This section records the status of known implementations of the
      protocol defined by this specification at the time of posting of
      this Internet-Draft, and is based on a proposal described in
      <xref target="RFC7942"/>. The description of implementations in this
      section is intended to assist the IETF in its decision processes
      in progressing drafts to RFCs. Please note that the listing of
      any individual implementation here does not imply endorsement
      by the IETF. Furthermore, no effort has been spent to verify the
      information presented here that was supplied by IETF contributors.
      This is not intended as, and must not be construed to be, a catalog
      of available implementations or their features. Readers are advised
      to note that other implementations may exist.
     </t>
      <t>
      According to <xref target="RFC7942"/>, "this will allow reviewers
      and working groups to assign due consideration to documents that
      have the benefit of running code, which may serve as evidence of
      valuable experimentation and feedback that have made the implemented
      protocols more mature.  It is up to the individual working groups
      to use this information as they see fit".
     </t>
      <t>
      Authors are requested to add a note to the RFC Editor at the
      top of this section, advising the Editor to remove the entire
      section before publication, as well as the reference to <xref target="RFC7942"/>.
     </t>
      <section anchor="section.impl-status.xfrm" title="Linux XFRM">
        <t>
          <list style="hanging">
            <t hangText="Organization: ">Linux kernel XFRM</t>
            <t hangText="Name: ">XFRM-PCPU-v1 https://git.kernel.org/pub/scm/linux/kernel/git/klassert/linux-stk.git/log/?h=xfrm-pcpu-v1</t>
            <t hangText="Description: "> An initial Kernel IPsec implementation
             of the per-CPU method.</t>
            <t hangText="Level of maturity: ">Alpha</t>
            <t hangText="Coverage: ">
            Implements Fallback Child SA and per-CPU Child SAs. It only supports
            the NETLINK API. The PFKEYv2 API is not supported.</t>
            <t hangText="Licensing: ">GPLv2</t>
            <t hangText="Implementation experience: "> The Linux XFRM
             implementation added two additional attributes to support per-CPU SAs.

             There is a new attribute XFRMA_SA_PCPU, u32, for the SAD entry.
             This attribute should present on the outgoing SA, per-CPU Child SAs,
             starting from 0. This attribute MUST NOT be present on the Fallback
             XFRM SA. It is used by the kernel only for the outgoing traffic,
             (clear to encrypted).
             The incoming SAs, both the Fallback and the per-CPU SA, do not need
             XFRMA_SA_PCPU attribute. XFRM stack can not use CPU id on the incoming SA.
             The kernel internally sets the value to 0xFFFFFF for the
             incoming SA and the Fallback SA.

             However, one may add XFRMA_SA_PCPU to the incoming  per-CPU SA to steer
             the ESP flow, to a specific Q or CPU e.g ethtool ntuple configuration.

	     The SPD entry has new flag XFRM_POLICY_CPU_ACQUIRE.
             It should be set only on the "out" policy. The flag should
             be disabled when the policy is a trap policy, without SPD entries.
             After a successful negotiation of CPU_QUEUES, while adding the
             Fallback Child SA, the SPD entry can be updated with the
             XFRM_POLICY_CPU_ACQUIRE flag.
             When XFRM_POLICY_CPU_ACQUIRE is set, the XFRM_MSG_ACQUIRE generated
             will include the XFRMA_SA_PCPU attribute.
	    </t>
            <t hangText="Contact: ">Steffen Klassert steffen.klassert@secunet.com</t>
          </list>
        </t>
      </section>
      <section anchor="section.impl-status.libreswan" title="Libreswan">
        <t>
          <list style="hanging">
            <t hangText="Organization: ">The Libreswan Project</t>
            <t hangText="Name: ">pcpu-3 https://libreswan.org/wiki/XFRM_pCPU</t>
            <t hangText="Description: ">
           An initial IKE implementation of the per-CPU method.</t>
            <t hangText="Level of maturity: ">Alpha</t>
            <t hangText="Coverage: ">
            implements Fallback Child SA and per-CPU additional Child SAs</t>
            <t hangText="Licensing: ">GPLv2</t>
            <t hangText="Implementation experience: ">TBD</t>
            <t hangText="Contact: ">Libreswan Development: swan-dev@libreswan.org</t>
          </list>
        </t>
      </section>
      <section anchor="section.impl-status.strongswan" title="strongSwan">
        <t>
          <list style="hanging">
            <t hangText="Organization: ">The StrongSwan Project</t>
            <t hangText="Name: ">StrongSwan https://github.com/strongswan/strongswan/tree/per-cpu-sas-poc/</t>
            <t hangText="Description: ">
           An initial IKE implementation of the per-CPU method.</t>
            <t hangText="Level of maturity: ">Alpha</t>
            <t hangText="Coverage: ">
            implements Fallback Child SA and per-CPU additional Child SAs</t>
            <t hangText="Licensing: ">GPLv2</t>
            <t hangText="Implementation experience: ">
             StrongSwan use private space values for notifications
             CPU_QUEUES (40970) and QUEUE_INFO (40971).
            </t>
            <t hangText="Contact: ">Tobias Brunner tobias@strongswan.org</t>
          </list>
        </t>
      </section>
      <section anchor="section.impl-status.iproute2" title="iproute2">
       <t>
        <list style="hanging">
         <t hangText="Organization: ">The iproute2 Project</t>
         <t hangText="Name: "> iproute2 https://github.com/antonyantony/iproute2/tree/pcpu-v1</t>
         <t hangText="Description: ">Implemented the per-CPU attributes for the "ip xfrm" command.</t>
         <t hangText="Level of maturity: ">Alpha</t>
         <t hangText="Licensing: ">GPLv2</t>
         <t hangText="Implementation experience: ">TBD</t>
         <t hangText="Contact: ">Antony Antony antony.antony@secunet.com</t>
        </list>
       </t>
      </section>
    </section>
    <section anchor="IANA" title="IANA Considerations">
      <t>
        This document defines four new IKEv2 Notify Message Type payloads for the IANA "IKEv2 Notify Message Types - Status Types" registry.
        </t>
      <figure align="center" anchor="iana_requests">
        <artwork align="left"><![CDATA[
      Value   Notify Type Messages - Status Types    Reference
      -----   ------------------------------    ---------------
      [TBD1]   CPU_QUEUES                        [this document]
      [TBD2]   QOS_QUEUES                        [this document]
      [TBD3]   CPU_QUEUE_INFO                    [this document]
      [TBD4]   QOS_QUEUE_INFO                    [this document]
            ]]></artwork>
      </figure>
    </section>
  </middle>
  <back>
    <references title="Normative References">
     &RFC2119;
     &RFC7296;
     &RFC8174;
     &RFC2367;
    </references>
    <references title="Informative References">
     &RFC4301;
     &RFC6982;
     &RFC7942;
    </references>
  </back>
</rfc>
