<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="bcp" docName="draft-ietf-intarea-frag-fragile-07"
     ipr="trust200902">
  <front>
    <title abbrev="IP Fragmentation Fragile">IP Fragmentation Considered
    Fragile</title>

    <author fullname="Ron Bonica" initials="R." surname="Bonica">
      <organization>Juniper Networks</organization>

      <address>
        <postal>
          <street>2251 Corporate Park Drive</street>

          <city>Herndon</city>

          <code>20171</code>

          <region>Virginia</region>

          <country>USA</country>
        </postal>

        <email>rbonica@juniper.net</email>
      </address>
    </author>

    <author fullname="Fred Baker" initials="F." surname="Baker">
      <organization>Unaffiliated</organization>

      <address>
        <postal>
          <street/>

          <city>Santa Barbara</city>

          <region>California</region>

          <code>93117</code>

          <country>USA</country>
        </postal>

        <email>FredBaker.IETF@gmail.com</email>
      </address>
    </author>

    <author fullname="Geoff Huston" initials="G." surname="Huston">
      <organization>APNIC</organization>

      <address>
        <postal>
          <street>6 Cordelia St</street>

          <city>Brisbane</city>

          <region>4101 QLD</region>

          <code/>

          <country>Australia</country>
        </postal>

        <email>gih@apnic.net</email>
      </address>
    </author>

    <author fullname="Robert M. Hinden" initials="R." surname="Hinden">
      <organization>Check Point Software</organization>

      <address>
        <postal>
          <street>959 Skyway Road</street>

          <city>San Carlos</city>

          <region>California</region>

          <code>94070</code>

          <country>USA</country>
        </postal>

        <email>bob.hinden@gmail.com</email>
      </address>
    </author>

    <author fullname="Ole Troan" initials="O." surname="Troan">
      <organization>Cisco</organization>

      <address>
        <postal>
          <street>Philip Pedersens vei 1</street>

          <city>N-1366 Lysaker</city>

          <country>Norway</country>
        </postal>

        <email>ot@cisco.com</email>
      </address>
    </author>

    <author fullname="Fernando Gont" initials="F." surname="Gont">
      <organization>SI6 Networks</organization>

      <address>
        <postal>
          <street>Evaristo Carriego 2644</street>

          <city>Haedo</city>

          <region>Provincia de Buenos Aires</region>

          <country>Argentina</country>
        </postal>

        <email>fgont@si6networks.com</email>
      </address>
    </author>

    <date day="30" month="January" year="2019"/>

    <area>Internet Area</area>

    <workgroup>Internet Area WG</workgroup>

    <keyword>IPv6</keyword>

    <keyword>Fragmentation</keyword>

    <abstract>
      <t>This document describes IP fragmentation and explains how it reduces
      the reliability of Internet communication.</t>

      <t>This document also proposes alternatives to IP fragmentation and
      provides recommendations for developers and network operators.</t>
    </abstract>
  </front>

  <middle>
    <section anchor="into" title="Introduction">
      <t><xref target="Kent">Operational experience </xref> <xref
      target="Huston"/> <xref target="RFC7872"/> reveals that IP fragmentation
      reduces the reliability of Internet communication. This document
      describes IP fragmentation and explains how it reduces the reliability
      of Internet communication. This document also proposes alternatives to
      IP fragmentation and provides recommendations for developers and network
      operators.</t>

      <t>While this document identifies issues associated with IP
      fragmentation, it does not recommend deprecation. Some applications (see
      <xref target="rely"/>) require IP fragmentation. Furthermore,
      fragmentation is expected to work in limited domains where security and
      interoperability issues can be addressed.</t>

      <t>Rather than deprecating IP Fragmentation, this document recommends
      that upper-layer protocols address the problem of fragmentation at their
      layer, reducing their reliance on IP fragmentation to the greatest
      degree possible.</t>
    </section>

    <section title="IP Fragmentation">
      <section anchor="pmtu" title="Links, Paths, MTU and PMTU">
        <t>An Internet path connects a source node to a destination node. A
        path can contain links and routers. If a path contains more than one
        link, the links are connected in series and a router connects each
        link to the next.</t>

        <t>Internet paths are dynamic. Assume that the path from one node to
        another contains a set of links and routers. If the network topology
        changes, that path can also change so that it includes a different set
        of links and routers.</t>

        <t>Each link is constrained by the number of bytes that it can convey
        in a single IP packet. This constraint is called the link Maximum
        Transmission Unit (MTU). <xref target="RFC0791">IPv4</xref> requires
        every link to support a specified MTU (see NOTE 1). <xref
        target="RFC8200">IPv6</xref> requires every link to support an MTU of
        1280 bytes or greater. These are called the IPv4 and IPv6 minimum link
        MTU's.</t>

        <t>Likewise, each Internet path is constrained by the number of bytes
        that it can convey in a IP single packet. This constraint is called
        the Path MTU (PMTU). For any given path, the PMTU is equal to the
        smallest of its link MTU's. Because Internet paths are dynamic, PMTU
        is also dynamic.</t>

        <t>For reasons described below, source nodes estimate the PMTU between
        themselves and destination nodes. A source node can produce extremely
        conservative PMTU estimates in which:</t>

        <t><list style="symbols">
            <t>The estimate for each IPv4 path is equal to the IPv4 minimum
            link MTU.</t>

            <t>The estimate for each IPv6 path is equal to the IPv6 minimum
            link MTU.</t>
          </list>While these conservative estimates are guaranteed to be less
        than or equal to the actual PMTU, they are likely to be much less than
        the actual PMTU. This may adversely affect upper-layer protocol
        performance.</t>

        <t>By executing <xref target="RFC1191">Path MTU Discovery
        (PMTUD)</xref> <xref target="RFC8201"/> procedures, a source node can
        maintain a less conservative estimate of the PMTU between itself and a
        destination node. In PMTUD, the source node produces an initial PMTU
        estimate. This initial estimate is equal to the MTU of the first link
        along the path to the destination node. It can be greater than the
        actual PMTU.</t>

        <t>Having produced an initial PMTU estimate, the source node sends
        non-fragmentable IP packets to the destination node (see NOTE 2). If
        one of these packets is larger than the actual PMTU, a downstream
        router will not be able to forward the packet through the next link
        along the path. Therefore, the downstream router drops the packet and
        sends an <xref target="RFC0792">Internet Control Message Protocol
        (ICMP)</xref> <xref target="RFC4443"/> Packet Too Big (PTB) message to
        the source node (see NOTE 3). The ICMP PTB message indicates the MTU
        of the link through which the packet could not be forwarded. The
        source node uses this information to refine its PMTU estimate.</t>

        <t>PMTUD produces a running estimate of the PMTU between a source node
        and a destination node. Because PMTU is dynamic, at any given time,
        the PMTU estimate can differ from the actual PMTU. In order to detect
        PMTU increases, PMTUD occasionally resets the PMTU estimate to its
        initial value and repeats the procedure described above.</t>

        <t>Ideally, PMTUD operates as described above. However, in some
        scenarios, PMTUD fails. For example:</t>

        <t><list style="symbols">
            <t>PMTUD relies on the network's ability to deliver ICMP PTB
            messages to the source node. If the network cannot deliver ICMP
            PTB messages to the source node, PMTUD fails.</t>

            <t>PMTUD is susceptible to attack because ICMP messages are easily
            <xref target="RFC5927">forged</xref>. Such attacks can cause PMTUD
            to produce unnecessarily conservative PMTU estimates.</t>
          </list></t>

        <t>NOTE 1: In IPv4, every host must be capable of receiving a packet
        whose length is equal to 576 bytes. However, the IPv4 minimum link MTU
        is not 576. Section 3.2 of RFC 791 explicitly states that the IPv4
        minimum link MTU is 68 bytes. But for practical purposes, many network
        operators consider the IPv4 minimum link MTU to be 576 bytes. So, for
        the purposes of this document, we assume that the IPv4 minimum link
        MTU is 576 bytes.</t>

        <t>NOTE 2: A non-fragmentable packet can be fragmented at its source.
        However, it cannot be fragmented by a downstream node. An IPv4 packet
        whose DF-bit is set to zero is fragmentable. An IPv4 packet whose
        DF-bit is set to one is non-fragmentable. All IPv6 packets are also
        non-fragmentable.</t>

        <t>NOTE 3:: The ICMP PTB message has two instantiations. In <xref
        target="RFC0792">ICMPv4</xref>, the ICMP PTB message is Destination
        Unreachable message with Code equal to (4) fragmentation needed and DF
        set. This message was augmented by <xref target="RFC1191"/> to
        indicate the MTU of the link through which the packet could not be
        forwarded. In <xref target="RFC4443">ICMPv6</xref>, the ICMP PTB
        message is a Packet Too Big Message with Code equal to (0). This
        message also indicates the MTU of the link through which the packet
        could not be forwarded.</t>
      </section>

      <section title="Fragmentation Procedures">
        <t>When an upper-layer protocol submits data to the underlying IP
        module, and the resulting IP packet's length is greater than the PMTU,
        the packet is divided into fragments. Each fragment includes an IP
        header and a portion of the original packet.</t>

        <t><xref target="RFC0791"/> describes IPv4 fragmentation procedures.
        An IPv4 packet whose DF-bit is set to one can be fragmented by the
        source node, but cannot be fragmented by a downstream router. An IPv4
        packet whose DF-bit is set to zero can be fragmented by the source
        node or by a downstream router. When an IPv4 packet is fragmented, all
        IP options appear in the first fragment, but only options whose "copy"
        bit is set to one appear in subsequent fragments.</t>

        <t><xref target="RFC8200"/> describes IPv6 fragmentation procedures.
        An IPv6 packet can be fragmented at the source node only. When an IPv6
        packet is fragmented, all extension headers appear in the first
        fragment, but only per-fragment headers appear in subsequent
        fragments. Per-fragment headers include the following:</t>

        <t><list style="symbols">
            <t>The IPv6 header.</t>

            <t>The Hop-by-hop Options header (if present)</t>

            <t>The Destination Options header (if present and if it precedes a
            Routing header)</t>

            <t>The Routing Header (if present)</t>

            <t>The Fragment Header</t>
          </list></t>

        <t>In both IPv4 and IPv6, the upper-layer header appears in the first
        fragment only. It does not appear in subsequent fragments.</t>
      </section>

      <section anchor="upper" title="Upper-Layer Reliance on IP Fragmentation">
        <t>Upper-layer protocols can operate in the following modes:</t>

        <t><list style="symbols">
            <t>Do not rely on IP fragmentation.</t>

            <t>Rely on IP fragmentation by the source node only.</t>

            <t>Rely on IP fragmentation by any node.</t>
          </list></t>

        <t>Upper-layer protocols running over IPv4 can operate in all of the
        above-mentioned modes. Upper-layer protocols running over IPv6 can
        operate in the first and second modes only.</t>

        <t>Upper-layer protocols that operate in the first two modes (above)
        require access to the PMTU estimate. In order to fulfil this
        requirement, they can:</t>

        <t><list style="symbols">
            <t>Estimate the PMTU to be equal to the IPv4 or IPv6 minimum link
            MTU.</t>

            <t>Access the estimate that PMTUD produced.</t>

            <t>Execute PMTUD procedures themselves.</t>

            <t>Execute <xref target="RFC4821">Packetization Layer PMTUD
            (PLPMTUD)</xref> <xref target="I-D.ietf-tsvwg-datagram-plpmtud"/>
            procedures.</t>
          </list>According to PLPMTUD procedures, the upper-layer protocol
        maintains a running PMTU estimate. It does so by sending probe packets
        of various sizes to its upper-layer peer and receiving
        acknowledgements. This strategy differs from PMTUD in that it relies
        of acknowledgement of received messages, as opposed to ICMP PTB
        messages concerning dropped messages. Therefore, PLPMTUD does not rely
        on the network's ability to deliver ICMP PTB messages to the
        source.</t>
      </section>
    </section>

    <section title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
      "OPTIONAL" in this document are to be interpreted as described in <xref
      target="RFC2119">BCP 14</xref> <xref target="RFC8174"/> when, and only
      when, they appear in all capitals, as shown here.</t>
    </section>

    <section anchor="dr" title="Reduced Reliability">
      <t>This section explains how IP fragmentation reduces the reliability of
      Internet communication.</t>

      <section anchor="mb" title="Policy-Based Routing">
        <t>IP Fragmentation causes problems for routers that implement
        policy-based routing.</t>

        <t>When a router receives a packet, it identifies the next-hop on
        route to the packet's destination and forwards the packet to that
        next-hop. In order to identify the next-hop, the router interrogates a
        local data structure called the Forwarding Information Base (FIB).</t>

        <t>Normally, the FIB contains destination-based entries that map a
        destination prefix to a next-hop. Policy-based routing allows
        destination-based and policy-based entries to coexist in the same FIB.
        A policy-based FIB entry maps multiple fields, drawn from either the
        IP or transport-layer header, to a next-hop.</t>

        <t/>

        <texttable anchor="FIB" style="full" title="Policy-Based Routing FIB">
          <ttcol align="center">Entry</ttcol>

          <ttcol align="left">Type</ttcol>

          <ttcol>Dest. Prefix</ttcol>

          <ttcol align="left">Next Hdr / Dest. Port</ttcol>

          <ttcol>Next-Hop</ttcol>

          <c/>

          <c/>

          <c/>

          <c/>

          <c/>

          <c>1</c>

          <c>Destination- based</c>

          <c>2001:db8::1/128</c>

          <c>Any / Any</c>

          <c>2001:db8::2</c>

          <c/>

          <c/>

          <c/>

          <c/>

          <c/>

          <c>2</c>

          <c>Policy- based</c>

          <c>2001:db8::1/128</c>

          <c>TCP / 80</c>

          <c>2001:db8::3</c>
        </texttable>

        <t>Assume that a router maintains the FIB in <xref target="FIB"/>. The
        first FIB entry is destination-based. It maps the a destination prefix
        (2001:db8::1/128) to a next-hop (2001:db8::2). The second FIB entry is
        policy-based. It maps the same destination prefix (2001:db8::1/128)
        and a destination port ( TCP / 80 ) to a different next-hop
        (2001:db8::3). The second entry is more specific than the first.</t>

        <t>When the router receives the first fragment of a packet that is
        destined for TCP port 80 on 2001:db8::1, it interrogates the FIB. Both
        FIB entries satisfy the query. The router selects the second FIB entry
        because it is more specific and forwards the packet to
        2001:db8::3.</t>

        <t>When the router receives the second fragment of the packet, it
        interrogates the FIB again. This time, only the first FIB entry
        satisfies the query, because the second fragment contains no
        indication that the packet is destined for TCP port 80. Therefore, the
        router selects the first FIB entry and forwards the packet to
        2001:db8::2.</t>

        <t>Policy-based routing is also known as filter-based-forwarding.</t>
      </section>

      <section title="Network Address Translation (NAT)">
        <t>IP fragmentation causes problems for Network Address Translation
        (NAT) devices. When a NAT device detects a new, outbound flow, it maps
        that flow's source port and IP address to another source port and IP
        address. Having created that mapping, the NAT device translates:</t>

        <t><list style="symbols">
            <t>The Source IP Address and Source Port on each outbound
            packet.</t>

            <t>The Destination IP Address and Destination Port on each inbound
            packet.</t>
          </list></t>

        <t><xref target="RFC6346">A+P</xref> and <xref
        target="RFC6888">Carrier Grade NAT (CGN)</xref> are two common NAT
        strategies. In both approaches the NAT device must virtually
        reassemble fragmented packets in order to translate and forward each
        fragment.</t>

        <t>Virtual reassembly in the network is problematic, because it is
        computationally expensive and because it is prone to <xref
        target="at">attacks</xref>.</t>
      </section>

      <section anchor="icf" title="Stateless Firewalls">
        <t>IP fragmentation causes problems for stateless firewalls whose
        rules include TCP and UDP ports. Because port information is not
        available in the trailing fragments the firewall is limited to the
        following options:</t>

        <t><list style="symbols">
            <t>Accept all trailing fragments, possibly admitting certain
            classes of attack.</t>

            <t>Block all trailing fragments, possibly blocking legitimate
            traffic.</t>
          </list>Neither option is attractive.</t>

        <t>This problem does not occur in stateful firewalls or Network
        Address Translation (NAT) devices. Such devices maintain state so that
        they can afford identical treatment to each fragment that belongs to a
        packet.</t>
      </section>

      <section anchor="loadblalnce"
               title="Equal Cost Multipath, Link Aggregate Groups and Stateless Load-Balancers">
        <t>IP fragmentation causes problems for Equal Cost Multipath (ECMP),
        Link Aggregate Groups (LAG) and other stateless load-balancing
        technologies. In order to assign a packet or packet fragment to a
        link, an intermediate node executes a hash (i.e., load-balancing)
        algorithm. The following paragraphs describe a commonly deployed hash
        algorithm.</t>

        <t>If the packet or packet fragment contains a transport-layer header,
        the algorithm accepts the following 5-tuple as input:</t>

        <t><list style="symbols">
            <t>IP Source Address.</t>

            <t>IP Destination Address.</t>

            <t>IPv4 Protocol or IPv6 Next Header.</t>

            <t>transport-layer source port.</t>

            <t>transport-layer destination port.</t>
          </list>If the packet or packet fragment does not contain a
        transport-layer header, the algorithm accepts only the following
        3-tuple as input:</t>

        <t><list style="symbols">
            <t>IP Source Address.</t>

            <t>IP Destination Address.</t>

            <t>IPv4 Protocol or IPv6 Next Header.</t>
          </list></t>

        <t>Therefore, non-fragmented packets belonging to a flow can be
        assigned to one link while fragmented packets belonging to the same
        flow can be divided between that link and another. This can cause
        suboptimal load-balancing.</t>

        <t><xref target="RFC6438"/> offers a partial solution to this problem
        for IPv6 devices only. According to <xref target="RFC6438"/>:</t>

        <t>"At intermediate routers that perform load distribution, the hash
        algorithm used to determine the outgoing component-link in an ECMP
        and/or LAG toward the next hop MUST minimally include the 3-tuple
        {dest addr, source addr, flow label} and MAY also include the
        remaining components of the 5-tuple."</t>

        <t>If the algorithm includes only the 3-tuple {dest addr, source addr,
        flow label}, it will assign all fragments belonging to a packet to the
        same link.</t>
      </section>

      <section title="IPv4 Reassembly Errors at High Data Rates">
        <t>IPv4 fragmentation is not sufficiently robust for use under some
        conditions in today's Internet. At high data rates, the 16-bit IP
        identification field is not large enough to prevent frequent
        incorrectly assembled IP fragments, and the TCP and UDP checksums are
        insufficient to prevent the resulting corrupted datagrams from being
        delivered to higher protocol layers. <xref target="RFC4963"/>
        describes some easily reproduced experiments demonstrating the
        problem, and discusses some of the operational implications of these
        observations.</t>

        <t>These reassembly issues are not easily reproducible in IPv6 because
        the IPv6 identification field is 32 bits long.</t>
      </section>

      <section anchor="at" title="Security Vulnerabilities">
        <t>Security researchers have documented several attacks that exploit
        IP fragmentation. The following are examples:</t>

        <t><list style="symbols">
            <t>Overlapping fragment attacks <xref target="RFC1858"/><xref
            target="RFC3128"/><xref target="RFC5722"/></t>

            <t>Resource exhaustion attacks (such as the Rose Attack)</t>

            <t>Attacks based on predictable fragment identification values
            <xref target="RFC7739"/></t>

            <t>Evasion of Network Intrusion Detection Systems (NIDS) <xref
            target="Ptacek1998"/></t>
          </list>In the overlapping fragment attack, an attacker constructs a
        series of packet fragments. The first fragment contains an IP header,
        a transport-layer header, and some transport-layer payload. This
        fragment complies with local security policy and is allowed to pass
        through a stateless firewall. A second fragment, having a non-zero
        offset, overlaps with the first fragment. The second fragment also
        passes through the stateless firewall. When the packet is reassembled,
        the transport layer header from the first fragment is overwritten by
        data from the second fragment. The reassembled packet does not comply
        with local security policy. Had it traversed the firewall in one
        piece, the firewall would have rejected it.</t>

        <t>A stateless firewall cannot protect against the overlapping
        fragment attack. However, destination nodes can protect against the
        overlapping fragment attack by implementing the procedures described
        in RFC 1858, RFC 3128 and RFC 8200. These reassembly procedures detect
        the overlap and discard the packet.</t>

        <t>The fragment reassembly algorithm is a stateful procedure for an
        otherwise stateless protocol. Therefore, it can be exploited by
        resource exhaustion attacks. An attacker can construct a series of
        fragmented packets, with one fragment missing from each packet so that
        the reassembly is impossible. Thus, this attack causes resource
        exhaustion on the destination node, possibly denying reassembly
        services to other flows. This type of attack can be mitigated by
        flushing fragment reassembly buffers when necessary, at the expense of
        possibly dropping legitimate fragments.</t>

        <t>Each IP fragment contains an "Identification" field that
        destination nodes use to reassemble fragmented packets. Many
        implementations set the Identification field to a predictable value,
        thus making it easy for an attacker to forge malicious IP fragments
        that would cause the reassembly procedure for legitimate packets to
        fail.</t>

        <t>NIDS aims at identifying malicious activity by analyzing network
        traffic. Ambiguity in the possible result of the fragment reassembly
        process may allow an attacker to evade these systems. Many of these
        systems try to mitigate some of these evasion techniques (e.g. By
        computing all possible outcomes of the fragment reassembly process, at
        the expense of increased processing requirements).</t>
      </section>

      <section anchor="PTB" title="PMTU Blackholing Due to ICMP Loss">
        <t>As mentioned in <xref target="upper"/>, upper-layer protocols can
        be configured to rely on PMTUD. Because PMTUD relies upon the network
        to deliver ICMP PTB messages, those protocols also rely on the
        networks to deliver ICMP PTB messages.</t>

        <t>According to <xref target="RFC4890"/>, ICMP PTB messages must not
        be filtered. However, ICMP PTB delivery is not reliable. It is subject
        to both transient and persistent loss.</t>

        <t>Transient loss of ICMP PTB messages can cause transient PMTU black
        holes. When the conditions contributing to transient loss abate, the
        network regains its ability to deliver ICMP PTB messages and
        connectivity between the source and destination nodes is restored.
        <xref target="transLoss"/> of this document describes conditions that
        lead to transient loss of ICMP PTB messages.</t>

        <t>Persistent loss of ICMP PTB messages can cause persistent black
        holes. <xref target="CPE"/> and <xref target="Anycast"/> of this
        document describe conditions that lead to persistent loss of ICMP PTB
        messages.</t>

        <t>The problem described in this section is specific to PMTUD. It does
        not occur when the upper-layer protocol obtains its PMTU estimate from
        PLPMTUD or from any other source.</t>

        <section anchor="transLoss" title="Transient Loss">
          <t>The following factors can contribute to transient loss of ICMP
          PTB messages:</t>

          <t><list style="symbols">
              <t>Network congestion.</t>

              <t>Packet corruption.</t>

              <t>Transient routing loops.</t>

              <t>ICMP rate limiting.</t>
            </list></t>

          <t>The effect of rate limiting may be severe, as RFC 4443 recommends
          strict rate limiting of IPv6 traffic.</t>
        </section>

        <section anchor="CPE"
                 title="Incorrect Implementation of Security Policy">
          <t>Incorrect implementation of security policy can cause persistent
          loss of ICMP PTB messages.</t>

          <t>Assume that a Customer Premise Equipment (CPE) router implements
          the following zone-based security policy:</t>

          <t><list style="symbols">
              <t>Allow any traffic to flow from the inside zone to the outside
              zone.</t>

              <t>Do not allow any traffic to flow from the outside zone to the
              inside zone unless it is part of an existing flow (i.e., it was
              elicited by an outbound packet).</t>
            </list>When a correct implementation of the above-mentioned
          security policy receives an ICMP PTB message, it examines the ICMP
          PTB payload in order to determine whether the original packet (i.e.,
          the packet that elicited the ICMP PTB message) belonged to an
          existing flow. If the original packet belonged to an existing flow,
          the implementation allows the ICMP PTB to flow from the outside zone
          to the inside zone. If not, the implementation discards the ICMP PTB
          message.</t>

          <t>When a incorrect implementation of the above-mentioned security
          policy receives an ICMP PTB message, it discards the packet because
          its source address is not associated with an existing flow.</t>

          <t>The security policy described above is implemented incorrectly on
          many consumer CPE routers.</t>
        </section>

        <section anchor="Anycast" title="Persistent Loss Caused By Anycast ">
          <t>Anycast can cause persistent loss of ICMP PTB messages. Consider
          the example below:</t>

          <t>A DNS client sends a request to an anycast address. The network
          routes that DNS request to the nearest instance of that anycast
          address (i.e., a DNS Server). The DNS server generates a response
          and sends it back to the DNS client. While the response does not
          exceed the DNS server's PMTU estimate, it does exceed the actual
          PMTU.</t>

          <t>A downstream router drops the packet and sends an ICMP PTB
          message the packet's source (i.e., the anycast address). The network
          routes the ICMP PTB message to the anycast instance closest to the
          downstream router. That anycast instance may not be the DNS server
          that originated the DNS response. It may be another DNS server with
          the same anycast address. The DNS server that originated the
          response may never receive the ICMP PTB message and may never update
          its PMTU estimate.</t>
        </section>
      </section>

      <section title="Blackholing Due To Filtering or Loss">
        <t>In RFC 7872, researchers sampled Internet paths to determine
        whether they would convey packets that contain IPv6 extension headers.
        Sampled paths terminated at popular Internet sites (e.g., popular web,
        mail and DNS servers).</t>

        <t>The study revealed that at least 28% of the sampled paths did not
        convey packets containing the IPv6 Fragment extension header. In most
        cases, fragments were dropped in the destination autonomous system. In
        other cases, the fragments were dropped in transit autonomous
        systems.</t>

        <t>Another <xref target="Huston">recent study</xref> confirmed this
        finding. It reported that 37% of sampled endpoints used IPv6-capable
        DNS resolvers that were incapable of receiving a fragmented IPv6
        response.</t>

        <t>It is difficult to determine why network operators drop fragments.
        Possible causes follow:</t>

        <t><list style="symbols">
            <t>Hardware inability to process fragmented packets.</t>

            <t>Failure to change vendor defaults.</t>

            <t>Unintentional misconfiguration.</t>

            <t>Intentional configuration (e.g., network operators consciously
            chooses to drop IPv6 fragments in order to address the issues
            raised in <xref target="mb"/> through <xref target="PTB"/>,
            above.)</t>
          </list></t>
      </section>
    </section>

    <section title="Alternatives to IP Fragmentation">
      <t/>

      <section title="Transport Layer Solutions">
        <t>The <xref target="RFC0793">Transport Control Protocol (TCP)</xref>)
        can be operated in a mode that does not require IP fragmentation.</t>

        <t>Applications submit a stream of data to TCP. TCP divides that
        stream of data into segments, with no segment exceeding the TCP
        Maximum Segment Size (MSS). Each segment is encapsulated in a TCP
        header and submitted to the underlying IP module. The underlying IP
        module prepends an IP header and forwards the resulting packet.</t>

        <t>If the TCP MSS is sufficiently small, the underlying IP module
        never produces a packet whose length is greater than the actual PMTU.
        Therefore, IP fragmentation is not required.</t>

        <t>TCP offers the following mechanisms for MSS management:</t>

        <t><list style="symbols">
            <t>Manual configuration</t>

            <t>PMTUD</t>

            <t>PLPMTUD</t>
          </list></t>

        <t>Manual configuration is always applicable. If the MSS is configured
        to a sufficiently low value, the IP layer will never produce a packet
        whose length is greater than the protocol minimum link MTU. However,
        manual configuration prevents TCP from taking advantage of larger link
        MTU's.</t>

        <t>Upper-layer protocols can implement PMTUD in order to discover and
        take advantage of larger path MTUs. However, as mentioned in <xref
        target="pmtu"/>, PMTUD relies upon the network to deliver ICMP PTB
        messages. Therefore, PMTUD is applicable only in environments where
        the risk of ICMP PTB loss is acceptable.</t>

        <t>By contrast, PLPMTUD does not rely upon the network's ability to
        deliver ICMP PTB messages. However, in many loss-based TCP congestion
        control algorithms, the dropping of a packet may cause the TCP control
        algorithm to drop the congestion control window, or even re-start with
        the entire slow start process. For high capacity, long round-trip
        time, large volume TCP streams, the deliberate probing with large
        packets and the consequent packet drop may impose too harsh a penalty
        on total TCP throughput for it to be a viable approach. <xref
        target="RFC4821"/> defines PLPMTUD procedures for TCP.</t>

        <t>While TCP will never cause the underlying IP module to emit a
        packet that is larger than the PMTU estimate, it can cause the
        underlying IP module to emit a packet that is larger than the actual
        PMTU. If this occurs, the packet is dropped, the PMTU estimate is
        updated, the segment is divided into smaller segments and each smaller
        segment is submitted to the underlying IP module.</t>

        <t>The <xref target="RFC4340">Datagram Congestion Control Protocol
        (DCCP)</xref> and the <xref target="RFC4960">Stream Control Protocol
        (SCP)</xref> also can be operated in a mode that does not require IP
        fragmentation. They both accept data from an application and divide
        that data into segments, with no segment exceeding a maximum size.
        Both DCCP and SCP offer manual configuration, PMTUD and PLPMTUD as
        mechanisms for managing that maximum size. <xref
        target="I-D.ietf-tsvwg-datagram-plpmtud"/> proposes PLPMTUD procedures
        for DCCP and SCP.</t>

        <t>Currently, <xref target="RFC0768">User Data Protocol (UDP)</xref>
        lacks a fragmentation mechanism of its own and relies on IP
        fragmentation. However, <xref target="I-D.ietf-tsvwg-udp-options"/>
        proposes a fragmentation mechanism for UDP.</t>
      </section>

      <section title="Application Layer Solutions">
        <t><xref target="RFC8085"/> recognizes that IP fragmentation reduces
        the reliability of Internet communication. It also recognizes that UDP
        lacks a fragmentation mechanism of its own and relies on IP
        fragmentation. Therefore, <xref target="RFC8085"/> offers the
        following advice regarding applications the run over the UDP.</t>

        <t>"An application SHOULD NOT send UDP datagrams that result in IP
        packets that exceed the Maximum Transmission Unit (MTU) along the path
        to the destination. Consequently, an application SHOULD either use the
        path MTU information provided by the IP layer or implement Path MTU
        Discovery (PMTUD) itself to determine whether the path to a
        destination will support its desired message size without
        fragmentation."</t>

        <t>RFC 8085 continues:</t>

        <t>"Applications that do not follow the recommendation to do
        PMTU/PLPMTUD discovery SHOULD still avoid sending UDP datagrams that
        would result in IP packets that exceed the path MTU. Because the
        actual path MTU is unknown, such applications SHOULD fall back to
        sending messages that are shorter than the default effective MTU for
        sending (EMTU_S in <xref target="RFC1122"/>). For IPv4, EMTU_S is the
        smaller of 576 bytes and the first-hop MTU. For IPv6, EMTU_S is 1280
        bytes. The effective PMTU for a directly connected destination (with
        no routers on the path) is the configured interface MTU, which could
        be less than the maximum link payload size. Transmission of
        minimum-sized UDP datagrams is inefficient over paths that support a
        larger PMTU, which is a second reason to implement PMTU
        discovery."</t>

        <t>RFC 8085 assumes that for IPv4, an EMTU_S of 576 is sufficiently
        small, even though the IPv4 minimum link MTU is 68 bytes.</t>

        <t>This advice applies equally to application that run directly over
        IP.</t>
      </section>
    </section>

    <section anchor="rely"
             title="Applications That Rely on IPv6 Fragmentation">
      <t>The following applications rely on IPv6 fragmentation:</t>

      <t><list style="symbols">
          <t><xref target="RFC1035">DNS </xref></t>

          <t><xref target="RFC2328">OSPFv3</xref><xref target="RFC5340">
          </xref></t>

          <t>Packet-in-packet encapsulations</t>
        </list>Each of these applications relies on IPv6 fragmentation to a
      varying degree. In some cases, that reliance is essential, and cannot be
      broken without fundamentally changing the protocol. In other cases, that
      reliance is incidental, and most implementations already take
      appropriate steps to avoid fragmentation.</t>

      <t>This list is not comprehensive, and other protocols that rely on IP
      fragmentation may exist. They are not specifically considered in the
      context of this document.</t>

      <section title="DNS">
        <t>DNS relies on UDP for efficiency, and the consequence is the use of
        IP fragmentation for large responses, as permitted by the DNS EDNS(0)
        options in the query. It is possible to mitigate the issue of
        fragmentation-based packet loss by having queries use smaller EDNS(0)
        UDP buffer sizes, or by having the DNS server limit the size of its
        UDP responses to some self-imposed maximum packet size that may be
        less than the preferred EDNS(0) UDP Buffer Size. In both cases, large
        responses are truncated in the DNS, signalling to the client to
        re-query using TCP to obtain the complete response. However, the
        operational issue of the partial level of support for DNS over TCP,
        particularly in the case where IPv6 transport is being used, becomes a
        limiting factor of the efficacy of this approach <xref
        target="Damas"/>.</t>

        <t>Larger DNS responses can normally be avoided by aggressively
        pruning the Additional section of DNS responses. One scenario where
        such pruning is ineffective is in the use of DNSSEC, where large key
        sizes act to increase the response size to certain DNS queries. There
        is no effective response to this situation within the DNS other than
        using smaller cryptographic keys and adoption of DNSSEC administrative
        practices that attempt to keep DNS response as short as possible.</t>
      </section>

      <section title="OSPF">
        <t>OSPF implementations can emit messages large enough to cause
        fragmentation. However, in order to optimize performance, most OSPF
        implementations restrict their maximum message size to a value that
        will not cause fragmentation.</t>
      </section>

      <section title="Packet-in-Packet Encapsulations">
        <t>In this document, packet-in-packet encapsulations include <xref
        target="RFC2003">IP-in-IP </xref>, <xref target="RFC2784">Generic
        Routing Encapsulation (GRE) </xref>, <xref
        target="RFC8086">GRE-in-UDP</xref> and <xref target="RFC2473">Generic
        Packet Tunneling in IPv6</xref>. <xref target="RFC4459"/> describes
        fragmentation issues associated with all of the above-mentioned
        encapsulations.</t>

        <t>The fragmentation strategy described for GRE in <xref
        target="RFC7588"/> has been deployed for all of the above-mentioned
        encapsulations. This strategy does not rely on IP fragmentation except
        in one corner case. (see Section 3.3.2.2 of RFC 7588 and Section 7.1
        of RFC 2473). Section 3.3 of <xref target="RFC7676"/> further
        describes this corner case.</t>

        <t>See <xref target="I-D.ietf-intarea-tunnels"/> for further
        discussion.</t>
      </section>

      <section title="Licklider Transmission Protocol (LTP)">
        <t>Some UDP applications rely on IP fragmentation to achieve
        acceptable levels of performance. These applications use UDP datagram
        sizes that are larger than the path MTU so that more data can be
        conveyed between the application and the kernel in a single system
        call.</t>

        <t>For example, the <xref target="RFC5326">Licklider Transmission
        Protocol (LTP) </xref> which is in current use on the International
        Space Station (ISS) uses UDP datagram sizes larger than the path MTU
        to achieve acceptable levels of performance even though this invokes
        IP fragmentation.</t>

        <t/>
      </section>
    </section>

    <section title="Recommendations">
      <t/>

      <section title="For Application and Protocol Developers">
        <t>Developers SHOULD NOT develop new protocols or applications that
        rely on IP fragmentation. When a new protocol or application is
        deployed in an environment that does not fully support IP
        fragmentation, it SHOULD operate correctly, either in its default
        configuration or in a specified alternative configuration.</t>

        <t>Developers MAY develop new protocols or applications that rely on
        IP fragmentation if the protocol or application is to be run only in
        environments where IP fragmentation is known to be supported.</t>

        <t>Legacy protocols that depend upon IP fragmentation SHOULD be
        updated to break that dependency. However, in some cases, there may be
        no viable alternative to IP fragmentation (e.g., IPSEC tunnel mode,
        IP-in-IP encapsulation). In these cases, the protocol will continue to
        rely on IP fragmentation but should only be used in environments where
        IP fragmentation is known to be supported.</t>

        <t>Protocols may be able to avoid IP fragmentation by using a
        sufficiently small MTU (e.g. The protocol minimum link MTU), disabling
        IP fragmentation, and ensuring that the transport protocol in use
        adapts its segment size to the MTU. Other protocols may deploy a
        sufficiently reliable PMTU discovery mechanism (e.g.,PLMPTUD).</t>
      </section>

      <section title="For System Developers">
        <t>Software libraries SHOULD include provision for PLPMTUD for each
        supported transport protocol.</t>
      </section>

      <section title="For Middle Box Developers">
        <t>Middle boxes SHOULD process IP fragments in a manner that is
        consistent with <xref target="RFC0791"/> and <xref target="RFC8200"/>.
        In many cases, middle boxes must maintain state in order to achieve
        this goal.</t>

        <t>Price and performance considerations frequently motivate network
        operators to deploy stateless middle boxes. These stateless middle
        boxes may perform sub-optimally, process IP fragments in a manner that
        is not compliant with RFC 791 or RFC 8200, or even discard IP
        fragments completely. Such behaviors are NOT RECOMMENDED. If a
        middleboxes implements non-standard behavior with respect to IP
        fragmentation, then that behavior MUST be clearly documented.</t>
      </section>

      <section title="For ECMP, LAG and Load-Balancer Developers And Operators">
        <t>In their default configuration, when the IPv6 Flow Label is not
        equal to zero, IPv6 devices that implement ECMP, LAG or other
        load-balancing technologies SHOULD accept only the following fields as
        input to their hash algorithm:</t>

        <t><list style="symbols">
            <t>IP Source Address.</t>

            <t>IP Destination Address.</t>

            <t>Flow Label.</t>
          </list>Operators SHOULD deploy these devices in their default
        configuration.</t>
      </section>

      <section title="For Network Operators">
        <t>Operators MUST ensure proper PMTUD operation in their network,
        including making sure the network generates PTB packets when dropping
        packets too large compared to outgoing interface MTU.</t>

        <t>As per RFC 4890, network operators MUST NOT filter ICMPv6 PTB
        messages unless they are known to be forged or otherwise illegitimate.
        As stated in <xref target="PTB"/>, filtering ICMPv6 PTB packets causes
        PMTUD to fail. Many upper-layer protocols rely on PMTUD.</t>

        <t>As per RFC 8200, network operators MUST NOT deploy IPv6 links whose
        MTU is less than 1280 bytes.</t>

        <t>Network operators SHOULD NOT filter IP fragments if they originated
        at a domain name server or are destined for a domain name server.</t>
      </section>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>This document makes no request of IANA.</t>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>This document mitigates some of the security considerations
      associated with IP fragmentation by discouraging its use. It does not
      introduce any new security vulnerabilities, because it does not
      introduce any new alternatives to IP fragmentation. Instead, it
      recommends well-understood alternatives.</t>
    </section>

    <section anchor="Acknowledgements" title="Acknowledgements">
      <t>Thanks to Mikael Abrahamsson, Brian Carpenter, Silambu Chelvan,
      Lorenzo Colitti, Mike Heard, Tom Herbert, Tatuya Jinmei, Jen Linkova,
      Paolo Lucente, Manoj Nayak, Eric Nygren, and Joe Touch for their
      comments.</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>

      <?rfc include='reference.RFC.8174'?>

      <?rfc include='reference.RFC.8085'?>

      <?rfc include='reference.RFC.8200'?>

      <?rfc include='reference.RFC.0791'?>

      <?rfc include='reference.RFC.8201'?>

      <?rfc include='reference.RFC.4821'?>

      <?rfc include='reference.RFC.1191'?>

      <?rfc include='reference.RFC.0792'?>

      <?rfc include='reference.RFC.0793'?>

      <?rfc include='reference.RFC.0768'?>

      <?rfc include='reference.RFC.1035'?>

      <?rfc include='reference.RFC.4443'?>
    </references>

    <references title="Informative References">
      <?rfc include='reference.RFC.7872'?>

      <?rfc include='reference.RFC.1122'?>

      <?rfc include='reference.RFC.6438'?>

      <?rfc include='reference.RFC.1858'?>

      <?rfc include='reference.RFC.2473'?>

      <?rfc include='reference.RFC.4960'?>

      <?rfc include='reference.RFC.5927'?>

      <?rfc include='reference.RFC.6346'?>

      <?rfc include='reference.RFC.4340'?>

      <?rfc include='reference.RFC.2003'?>

      <?rfc include='reference.RFC.5340'?>

      <?rfc include='reference.RFC.4890'?>

      <?rfc include='reference.RFC.2784'?>

      <?rfc include='reference.RFC.7676'?>

      <?rfc include='reference.RFC.5722'?>

      <?rfc include='reference.RFC.7739'?>

      <?rfc include='reference.RFC.7588'?>

      <?rfc include='reference.RFC.8086'?>

      <?rfc include='reference.RFC.4459'?>

      <?rfc include='reference.RFC.6888'?>

      <?rfc include='reference.RFC.4963'?>

      <?rfc include='reference.RFC.2328'?>

      <?rfc include='reference.RFC.5326'?>

      <?rfc include='reference.I-D.ietf-tsvwg-datagram-plpmtud'?>

      <?rfc include='reference.I-D.ietf-tsvwg-udp-options'?>

      <?rfc include='reference.I-D.ietf-intarea-tunnels'?>

      <?rfc include='reference.RFC.3128'?>

      <reference anchor="Huston">
        <front>
          <title>IPv6, Large UDP Packets and the DNS
          (http://www.potaroo.net/ispcol/2017-08/xtn-hdrs.html)</title>

          <author fullname="Geoff Huston" initials="G." surname="Huston">
            <organization/>
          </author>

          <date month="August" year="2017"/>
        </front>
      </reference>

      <reference anchor="Kent"
                 target="http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-87-3.pdf">
        <front>
          <title>"Fragmentation Considered Harmful", In Proc. SIGCOMM '87
          Workshop on Frontiers in Computer Communications Technology, DOI
          10.1145/55483.55524</title>

          <author fullname="Kent" initials="C. " surname="Kent">
            <organization/>
          </author>

          <author fullname="Mogul" initials="J." surname="Mogul">
            <organization/>

            <address>
              <postal>
                <street/>

                <city/>

                <region/>

                <code/>

                <country/>
              </postal>

              <phone/>

              <facsimile/>

              <email/>

              <uri/>
            </address>
          </author>

          <date month="August" year="1987"/>
        </front>
      </reference>

      <reference anchor="Damas"
                 target="http://www.potaroo.net/ispcol/2018-04/atr.html">
        <front>
          <title>Measuring ATR</title>

          <author fullname="Joao Damas" initials="J." surname="Damas">
            <organization/>
          </author>

          <author fullname="Geoff Huston" initials="G." surname="Huston">
            <organization/>
          </author>

          <date month="April" year="2018"/>
        </front>
      </reference>

      <reference anchor="Ptacek1998"
                 target="http://www.aciri.org/vern/Ptacek-Newsham-Evasion-98.ps">
        <front>
          <title>Insertion, Evasion and Denial of Service: Eluding Network
          Intrusion Detection</title>

          <author fullname="T. H. Ptacek" initials="T. H." surname="Ptacek">
            <organization>Secure Networks, Inc.</organization>
          </author>

          <author fullname="T. N. Newsham" initials="T. N." surname="Newsham">
            <organization>Secure Networks, Inc.</organization>
          </author>

          <date year="1998"/>
        </front>
      </reference>
    </references>

    <section title="Contributors' Address">
      <figure>
        <artwork><![CDATA[
]]></artwork>
      </figure>

      <t/>
    </section>
  </back>
</rfc>
