<?xml version="1.0"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!ENTITY RFC0792 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.0792.xml" >
<!ENTITY RFC1812 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.1812.xml" >
<!ENTITY RFC1933 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.1933.xml" >
<!ENTITY RFC2119 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml" >
<!ENTITY RFC2473 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.2473.xml" >
<!ENTITY RFC2983 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.2983.xml" >
<!ENTITY RFC3270 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.3270.xml">
<!ENTITY RFC3443 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.3443.xml">
<!ENTITY RFC4379 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.4379.xml">
<!ENTITY RFC4884 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.4884.xml">
<!ENTITY RFC4950 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.4950.xml">
<!ENTITY RFC7348 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.7348.xml">
<!ENTITY RFC7365 SYSTEM "http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.7365.xml">
<!ENTITY NVGRE PUBLIC '' 
'http://xml2rfc.ietf.org/public/rfc/bibxml3/reference.I-D.sridharan-virtualization-nvgre.xml'>
<!ENTITY LIME PUBLIC '' 
'http://xml2rfc.ietf.org/public/rfc/bibxml3/reference.I-D.tissa-lime-yang-oam-model.xml'>
<!ENTITY NVO-SEC PUBLIC '' 
'http://xml2rfc.ietf.org/public/rfc/bibxml3/reference.I-D.ietf-nvo3-security-requirements.xml'>
<!ENTITY GENEVE PUBLIC '' 
'http://xml2rfc.ietf.org/public/rfc/bibxml3/reference.I-D.gross-geneve.xml'>
<!ENTITY GUE PUBLIC '' 
'http://xml2rfc.ietf.org/public/rfc/bibxml3/reference.I-D.herbert-gue.xml'>
]>


<?rfc toc="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc compact='yes'?>
<?rfc subcompact="no"?>
<?rfc iprnotified="no" ?>
<?rfc strict="no" ?>

<rfc category="std" ipr="trust200902"
     docName="draft-nordmark-nvo3-transcending-traceroute-03">

  <front>
    <title abbrev="LTTON">
      Layer-Transcending Traceroute for Overlay Networks like VXLAN
    </title>

    <author initials="E" surname="Nordmark" fullname="Erik Nordmark">
      <organization>Arista Networks</organization>
      <address>
        <postal>
          <street></street>
          <city>Santa Clara, CA</city>
          <country>USA</country>
        </postal>
        <email>nordmark@arista.com</email>
      </address>
    </author>

    <author initials="C" surname="Appanna" fullname="Chandra Appanna">
      <organization>Arista Networks</organization>
      <address>
        <postal>
          <street></street>
          <city>Santa Clara, CA</city>
          <country>USA</country>
        </postal>
        <email>achandra@arista.com</email>
      </address>
    </author>

    <author initials="A" surname="Lo" fullname="Alton Lo">
      <organization>Arista Networks</organization>
      <address>
        <postal>
          <street></street>
          <city>Santa Clara, CA</city>
          <country>USA</country>
        </postal>
        <email>altonlo@arista.com</email>
      </address>
    </author>

    <author initials="S" surname="Boutros" fullname="Sami Boutros">
      <organization>VMware</organization>
      <address>
        <email>sboutros@vmware.com</email>
      </address>
    </author>

    <author initials="A" surname="Dubey" fullname="Ankur Dubey">
      <organization>VMware</organization>
      <address>
        <email>adubey@vmware.com</email>
      </address>
    </author>

    <date month="Jul" year="2016"/>
    <area>Internet</area>
    <workgroup>NVO3 WG</workgroup>
    <keyword>NVO3</keyword>
    <keyword>VXLAN</keyword>

    <abstract>
      <t>Tools like traceroute have been very valuable for the operation of the Internet. Part of that value comes from being able to display information about routers and paths over which the user of the tool has no control, but the traceroute output can be passed along to someone else that can further investigate or fix the problem.</t>

      <t>In overlay networks such as VXLAN and NVGRE the prevailing view is that since the overlay network has no control of the underlay there needs to be special tools and agreements to enable extracting traces from the underlay. We argue that enabling visibility into the underlay and using existing tools like traceroute has been overlooked and would add value in many deployments of overlay networks.</t>

      <t>This document specifies an approach that can be used to make traceroute transcend layers of encapsulation including details for how to apply this to VXLAN. The technique can be applied to other encapsulations used for overlay networks. It can also be implemented using current commercial silicon.</t>
    </abstract>
  </front>

  <middle>
    <section title="Introduction">
      <t>Tools like traceroute have been very valuable for the operation of the Internet. Part of that value comes from being able to display information about routers and paths over which the user of the tool has no control, but the traceroute output can be passed along to someone else that can further investigate or fix the problem. The output of traceroute can be included in an email or a trouble ticket to report the problem. This provide a lot more information than the mere indication that A can't communicate with B, in particular when the failures are transient. The ping tool provides some of the same benefits in being able to return ICMP errors such as host unreachable messages.</t>

      <t>This document shows how those tools can be used to gather information for both the overlay and underlay parts of an end-to-end path by providing the option to have some packets use a uniform time-to-live (ttl) model for the tunnels, and associated ICMP error handling. These changes are limited to the tunnel ingress and egress points.</t>

      <t>The desire to make traceroute provide useful information for overlay network is not an argument against also using a layered approach for OAM as specified in e.g., <xref target="I-D.tissa-lime-yang-oam-model"/>. Such approaches are quite appropriate for continuos monitoring at different layers and across different domains. A layer transcending traceroute complements the ability to do layered and/or continuos monitoring.</t>

      <t>The traceroute tool relies on receiving ICMP errors <xref target="RFC0792"/> in combination with using different IP time-to-live values. That results in the packet making it further and further towards the destination with ICMP ttl exceeded errors being received from each hop. That provides the user the working path even if the packets are black holed eventually, and also provides any errors like ICMP host unreachable. The fundamental assumption is that the ttl is decremented for each hop and that the resulting ICMP ttl exceeded errors are delivered back to the host.</t>

      <t>When some encapsulation is used to tunnel packets there is an architectural question how those tunnels should be viewed from the rest of the network. Different models were described first for diffserv in <xref target="RFC2983"/> and then applied to MPLS in <xref target="RFC3270"/> and expanded to MPLS ttl handling in <xref target="RFC3443"/> and those models apply to other forms of direct or indirect IP in IP tunnels. Those RFCs define two models for ttl that are of interest to us:
      <list style="symbols">
        <t>A pipe model, where the tunnel is invisible to the rest of the network in that it looks like a direct connection between the tunnel ingress and egress.</t>
        <t>A uniform model, where the ttl decrements uniformly for hops outside and inside the tunnel.</t>
      </list>
      </t>

      <t>The tunneling mechanisms discussed in NVO3 (such as VXLAN <xref target="RFC7348"/>, NVGRE <xref target="I-D.sridharan-virtualization-nvgre"/>, GENEVE <xref target="I-D.gross-geneve"/>, and GUE <xref target="I-D.herbert-gue"/>), have either been specified to provide the pipe model of a tunnel or are silent on the setting of the outer ttl. Those protocols can be extended to have an optional uniform tunnel model when the payload is IP, following the same model as in <xref target="RFC3443"/>. Note that these encapsulations carry Ethernet frames hence are not even aware that the payload is IP. However, IP is the bulk of what is carried over such tunnels and the ingress NVE can inspect the IP part of the Ethernet frame.</t>

      <t>However, for general application traffic the pipe model is fine and might even be expected by some applications. In general, when the source and destination IP are in the same IP subnet the ttl should not be decremented. Thus it makes sense to have a way to selectively enable the uniform model perhaps based on some method to identify packets associated with traceroute or some marker in the packet itself that the traceroute tool can set.</t>
    </section>

    <section title="Solution Overview">
      <t>The pieces needed to accomplish this are:
      <list style="symbols">
        <t>One or more ways to select the uniform model packets at the tunnel ingress.</t>
        <t>Tunnel ingress copying out the original ttl from a selected packet to the outer IP header, and then doing a check and decrement of that ttl.</t>
        <t>If that ttl check results in ttl expiry at the tunnel ingress, then deliver an ICMP ttl exceeded packet back to the host.</t>
        <t>A mechanism by which the tunnel egress knows which packets should have uniform model, for instance a bit in the encapsulation header.</t>
        <t>The tunnel egress copying in the ttl (for identified packets) from the outer header to the inner IP header, then doing a check and decrement of that ttl.</t>
        <t>If ttl check results in ttl expiry at the tunnel egress, then deliver an ICMP error back to the original host (or, perhaps better, to tunnel ingress the same way as underlay routers do).</t>
        <t>IP routers in the underlay will deliver any ICMP errors to the source IP address of the packet. For tunneled packets that will be the tunnel ingress. Hence the tunnel ingress needs to be able to take such ICMP errors and form corresponding ICMP errors that are sent back to the host. The requirement in <xref target="RFC1812"/> ensures that the ICMP errors will contain enough headers to form such an ICMP error. It has been noted that there are routers in the Internet which decades later fail to conform to that aspect of <xref target="RFC1812"/>.</t>
      </list>
      </t>

      <t>The idea to reflect (some) ICMP errors from inside a tunnel back to the original source goes back to IPv6 in IPv4 encapsulation as specified in <xref target="RFC1933"/> and <xref target="RFC2473"/>. However, those drafts did not advocate using a uniform ttl model for the tunnels but did handle ICMP packet too big and other unreachable messages. Those drafts specify how to reflect ICMP errors received from underlay routers to ICMP errors sent to the original host. The addition of handling ICMP ttl exceeded errors for uniform tunnel model is straight forward.</t>

      <t>The information carried in the ICMP errors are quite limited - the original packet plus an ICMP type and code. However, there are extension mechanisms specified in <xref target="RFC4884"/> and used for MPLS in <xref target="RFC4950"/> which include TLVs with additional information. If there are additional information to include for overlay networks that information could be added by defining new ICMP Extensions Objects based on <xref target="RFC4884"/>. An example of such an extension for ECMP information is included in this document.</t>

</section>

<section title="Goals and Requirements">
  <t>The following goals and requirements apply:
  <list style="symbols">
    <t>No changes needed in the underlay.</t>
    <t>Optional changes on the decapsulating end.</t>
    <t>ECMP friendly. If the underlay employs equal cost multipath routing then one should be able to use this mechanism to trace the same path as a given TCP or UDP flow is using. In addition, one should be able to explore different ECMP paths by varying the IP addresses and port numbers in the packets originated by traceroute on the host.</t>
    <t>Provide output which makes it possible to compare a regular overlay traceroute with the layer-transcending output.</t>
  </list>
  </t>

</section>

<section title="Definition Of Terms">
  <t>The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
  NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
  "OPTIONAL" in this document are to be interpreted as described
  in <xref target="RFC2119"/>.
  </t>
  <t>The terminology such as NVE, and TS are used as specified in <xref target="RFC7365"/>:
  <list style="symbols">
    <t>Network Virtualization Edge (NVE): An NVE is the network entity that
   sits at the edge of an underlay network and implements L2 and/or L3
   network virtualization functions.</t>
   <t>Tenant System (TS): A physical or virtual system that can play the role of
   a host or a forwarding element such as a router, switch, firewall,
   etc.</t>
   <t>Virtual Access Points (VAPs): A logical connection point on the NVE
   for connecting a Tenant System to a virtual network.</t>
   <t>Virtual Network (VN): A VN is a logical abstraction of a physical
   network that provides L2 or L3 network services to a set of Tenant
   Systems.</t>
   <t>Virtual Network Context (VN Context) Identifier: Field in an overlay
   encapsulation header that identifies the specific VN the packet
   belongs to.</t>

  </list>
  </t>
  <t>We use the VTEP term in <xref target="RFC7348"/> as synonymous with NVE, and VNI as synonymous to VN Context Identifier.</t>

</section>

<section title="Example Topologies">
  <t>The following example topologies illustrate different cases where we want a tracing capability. The examples are for overlay technologies such as VXLAN which provide a layer 2 overlay on IP. The cases for layer 3 overlay on top of IP are simpler and not shown in this document.</t>

  <t>The VXLAN term VTEP is used as synonymous to NVO3's NVE term.</t>

  <?rfc needLines="12" ?>
<figure title="Simple L2 overlay">
    <artwork><![CDATA[
-----------                -----------   
|    H1   |                |    H2   |
| 1.0.1.1 |                | 1.0.1.2 |
|         |                |         |
-----------                -----------   
     |                          |
     |                          |
-----------   -----------  -----------      
|  VtepA  |   |    R1   |  |  VtepB  |
| 2.0.1.1 | --| 2.0.1.2 |  | 2.0.2.1 |
|         |   | 2.0.2.2 |--|         |
-----------   -----------  -----------    
]]></artwork>
  </figure>

<t>The figure above shows two hosts connected using an underlay which provides a layer two service. Thus H1 and H2 are in the same subnet and unaware of the existence of the underlay. Thus a normal ping or traceroute would not be able to provide any information about the nature of a failure; either packets get through or they do not. When the packets get through traceroute would output something like:
<figure>
<artwork>
traceroute to 1.0.1.2 (1.0.1.2), 30 hops max, 60 byte packets
 1  1.0.2.1 (1.0.2.1)  1.104 ms  1.235 ms  1.729 ms
</artwork>
</figure>
</t>

<t>In this case it would be desirable to be able to traceroute from H1 to H2 (and vice versa) and observe VtepA, R1, VtepB and H2. Thus in the case of packets getting through traceroute would output:
<figure>
<artwork>
traceroute to 1.0.1.2 (1.0.1.2), 30 hops max, 60 byte packets
 1  2.0.1.1 (2.0.1.1)  1.104 ms  1.235 ms  1.729 ms
 2  2.0.1.2 (2.0.1.2)  2.106 ms  2.007 ms  2.156 ms
 3  2.0.2.1 (2.0.2.1)  35.034 ms  24.490 ms  21.626 ms
 4  1.0.1.2 (1.0.1.2)  40.830 ms  44.694 ms  75.620 ms
</artwork>
</figure>
</t>

<t>Note that the underlay and overlay might exist in completely separate addressing domains. Thus H1 might not be able to reach any of the underlay addresses. And the underlay IP addresses might overlap the overlay IP addresses. For example, it would be completely valid to see e.g. VtepA having the same IP address as H1. The user of this tool need to understand that the utility of the traceroute output is to get information to determine whether the issue is in the underlay or overlay, and be able to pass the underlay information to the operator of the underlay.</t>

<t>In overlay networks without any ARP/ND optimizations ARP/ND packets would be flooded between the tunnel endpoints. Thus if there is some communication failure between H1 and H2, then H1 above might not have an ARP entry for H2. This results in traceroute not being able to output any data. This implies that in order to use traceroute to trouble shoot the issue one would need some workaround, such as installing some temporary ARP entries on the hosts.</t>

  <?rfc needLines="12" ?>
  <figure title="L2 overlay as part of larger network">
    <artwork><![CDATA[
-----------                -----------  -----------  -----------   
|    H1   |                |    R2   |  |    R3   |  |    H4   |
| 1.0.1.1 |                | 1.0.2.2 |--| 1.0.2.3 |  |         |
|         |                | 1.0.1.2 |  | 1.0.3.3 |--| 1.0.3.4 |
-----------                -----------  -----------  -----------   
     |                          |
     |                          |
-----------   -----------  -----------      
| VtepA  |   |    R1   |  |  VtepB  |
| 2.0.1.1 | --| 2.0.1.2 |  | 2.0.2.1 |
|         |   | 2.0.2.2 |--|         |
-----------   -----------  -----------    
]]></artwork>
  </figure>

<t>The figure above has a overlay router the nexthop as seen by H1. In this case a normal overlay traceroute would be able to display the overlay path i.e.
<figure>
<artwork>
traceroute to H4, 30 hops max, 60 byte packets
 1  R2
 2  R3
 3  H4
</artwork>
</figure>
</t>

<t>The layer-transcending traceroute would show the combination of the underlay and overlay paths i.e.,
<figure>
<artwork>
traceroute to H4, 30 hops max, 60 byte packets
 1  VtepA
 2  R1
 3  VtepB
 4  R2
 5  R3
 6  H4
</artwork>
</figure>
</t>

  <?rfc needLines="12" ?>
  <figure title="Multiple L2 overlays in path">
    <artwork><![CDATA[
-----------             -------------------             -----------
|    H1   |             |       R5        |             |    H6   |
| 1.0.1.1 |             |                 |             |         |
|         |             | 1.0.1.2 1.0.5.5 |             | 1.0.5.6 |
-----------             |-----------------|             ----------- 
     |                  |    |       |    |                  |
     |                  |    |       |    |                  |
----------- ----------- |-----------------| ----------- -----------
| VtepA   | |   R1    | |  VtepB    VtepC | |   R6    | |  VtepD  |
| 2.0.1.1 |-| 2.0.1.2 | | 2.0.2.1 3.0.1.1 |-| 3.0.1.2 | |         |
|         | | 2.0.2.2 |-|                 | | 3.0.2.2 |-| 3.0.3.1 |
----------- ----------- ------------------- ----------- -----------
]]></artwork>
  </figure>

<t>The figure above has multiple overlay network segments, that are connected in one router which provides the tunnel endpoints for both overlay segments plus routing for the overlay. A more general picture would be to have an overlay routed path between the two NVEs e.g., VtepB and VtepC connected to different routers in the overlay. However, such a drawing in ASCII art doesn't fit on the page.</t>

<t>An normal overlay traceroute in the above topology would show the overlay router i.e.,
<figure>
<artwork>
traceroute to H6, 30 hops max, 60 byte packets
 1  R5
 2  H6
</artwork>
</figure>
</t>

<t>The layer-transcending traceroute would show the combination of the underlay and overlay paths i.e.,
<figure>
<artwork>
traceroute to H6, 30 hops max, 60 byte packets
 1  VtepA
 2  R1
 3  VtepB
 4  R5
 5  VtepC
 6  R6
 7  VtepD
 8  H6
</artwork>
</figure>
</t>

<t>Note that the R3 device, which include VtepB and VtepC, appears as three hops in the traceroute output. That is needed to be able to correlate the output with the overlay output which has R3. That correlation would be hard if the R3 device only appeared as VtepB in the LTTON output. The three-hop representation also stays invariant whether or not the NVEs and overlay router are implemented by a single device or multiple devices.</t>

</section>

<section title="Controlling and selecting ttl behavior">

  <t>The network admin needs to be able to control who can use the layer transcending traceroute, since the operator might not want to disclose the underlay topology to all its users all the time. There are different approaches for this such as designating particular ports (Virtual Access Points in NVO3 terminology) on a NVE to have uniform ttl tunnel model. We have found it useful to be able to enable this capability on a per port and/or virtual network basis, in addition to having a global setting per NVE.</t>

  <t>When enabled on the NVEs the user on the TS needs to be able to control which traffic is subject to which tunnel mode. The normal traffic would  use the pipe ttl tunnel model and only explicit trace applications are likely to want to use the uniform ttl tunnel model. Hence it makes sense to use some marker in the packets sent by the TS to select those packets for uniform model on the NVE. Such a mechanism should usable so that the user can perform both a regular traceroute and a LTTON.</t>

  <t>Potentially different fields in the packets originated by traceroute on the TS can be used to mark the packets for uniform ttl tunnel model. However, many of those fields such as source and destination port numbers and protocol might be used in hashing for ECMP. The marking that can be used without impacting ECMP is the DSCP field in the packet. That field can be set with an option (--tos) in at least some existing traceroute implementations.</t>

  <t>Note that when DSCP is used for such marking it is a configured choice subject to agreement between the operator of the TS and NVE. The matching on the NVE should ignore the ECN bits as to not interfere with ECN.</t>

  <t>However, the DSCP value used in the overlay might have an impact on the forwarding of the packets. In such a case one can use an alternative selector such as the UDP source port number. That has the downside of affecting the has values used for ECMP and link aggregation port selection.</t>
</section>

<section title="Introducing a ttl copyin flag in the encapsulation header">

  <t>When this approach is applied to VXLAN <xref target="RFC7348"/> the decapsulating NVE has to be able to identify packets that have to be processed in the uniform ttl tunnel model way. For that purpose we define a new flag which is sent by the encapsulating NVE on selected packets, and is used by the decapsulating NVE to perform the ttl copyin, decrement and check.</t>

<t>In addition to the one I-flag defined in <xref target="RFC7348"/> we define a new T-flag to capture this the trace behavior at the decapsulating tunnel endpoint.</t>
  <?rfc needLines="7" ?>
  <figure>
    <artwork><![CDATA[
   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |R|R|R|R|I|R|R|T|            Reserved                           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                VXLAN Network Identifier (VNI) |   Reserved    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
  </figure>
  <t>New fields:
  <list style='hanging' hangIndent='15'>
    <t hangText="T-flag:"> When set indicates that decapsulator should take the outer ttl and copy it to the inner ttl, and then check and decrement the resulting ttl.</t>
  </list>
  </t>
</section>

<section title="Encapsulation Behavior">
  <t>If the uniform ttl model is enabled for the input, and the received naked packet matches the selector, then the ingress NVE will perform these additional operations as part of encapsulating an IPv4 or IPv6 packet:
  <list style="symbols">
    <t>Examine the IPv4 TTL (or IPv6 hopcount, respectively) on receipt and if 1 or less, then drop the packet and send an ICMPv4 (or ICMPv6) ttl exceeded back to the original host. Since the NVE is operating on a L2 packet, it might not have any layer 3 interfaces or routes for the originating host. Thus it sends the packet back to the source L2 address of the packet back out the ingress port - without any IP address lookup.</t>
    <t>If ttl did not expire, then decrement the above ttl/hopcount and place it in the outer IP header. Encapsulate and send the packet as normal.</t>
    <t>If some other errors prevent sending the packet (such as unknown VN Context Id, no flood list configured), then the NVE SHOULD send an ICMP host unreachable back to the host.</t>
  </list>
  </t>

  <t>The ingress NVE will receive ICMP errors from underlay routers and the egress NVE; whether due to ttl exceeded or underlay issues such as host unreachable, or packet too big errors. The NVE should take such errors, and in addition to any local syslog etc, generate an ICMP error sent back to the host. The principle for this is specified in <xref target="RFC1933"/> and <xref target="RFC2473"/>. Just like in those specifications, for 
  the inner and outer IP header could be off different version. A common case of that might be an IPv6 overlay with an IPv4 underlay. That case requires some changes in the ICMP type and code values in addition to recreating the packets. The place where LTTON differs from those specifications is that there is an NVO3 header and (for L2 over L3) and L2 header in the packet.</t>

  <t>The figures below show an example of ICMP header re-generation at VtepA for the case of IPv6 overlay with IPv4 underlay. The case of IPv4 over IPv4 is similar and simpler since the ICMP header is the same for both overlay and underlay. The example uses VXLAN encapsulation to provide the concrete details, but the approach applies to other NVO3 proposals.</t>

  <?rfc needLines="32" ?>
  <figure title="ICMPv4 Error Message Returned to Encapsulating Node">
    <artwork><![CDATA[
             +--------------+
             | IPv4 Header  |
             | src = R1     |
             | dst = VtepA  |
             +--------------+
             |    ICMPv4    |
             |    Header    |
             |   type = X   |
             |   code = Y   |
      - -    +--------------+
             | IPv4 Header  |
             | src = VtepA  |
     IPv4    | dst = VtepB  |
             +--------------+
    Packet   |     UDP      |
             | dst = VXLAN  |
      in     +--------------+
             |   Ethernet   |
    Error    | DA = H2 mac  |
             | SA = H1 mac  |
             +--------------+   - -
             |    IPv6      |
             | src = H1 ipv6|   
             | dst = H2 ipv6|   Original IPv6
             +--------------+   Packet.
             |  Transport   |   Used to
             |    Header    |   generate an
             +--------------+   ICMPv6
             |              |   error message
             ~     Data     ~   back to the source.
             |              |
      - -    +--------------+   - -
]]></artwork>
  </figure>

  <t>The above underlay ICMPv4 is used to form an overlay ICMPv6 packet by extracting the Ethernet DA from the inner Ethernet SA, and forming an IPv6 header where the source address is based on the source address of the ICMPv4 error. The ICMPv6 type and code values are set based on the ICMPv4 type and code values.</t>

  <?rfc needLines="25" ?>
  <figure title="Generated ICMPv6 Error Message for Overlay Source">
    <artwork><![CDATA[
             +--------------+
             |   Ethernet   |
             | DA = H1 mac  |   From ICMPv4 packet
             | SA = VtepA   |   in error
             +--------------+
             | IPv6 Header  |
             | src = ::R1   |   96 zeros + IPv4 address 
             | dst = H1 ipv6|
             +--------------+
             |    ICMPv6    |
             |    Header    |
             |   type = X'  |   Type and code mapped
             |   code = Y'  |   from v4 to v6 values
      - -    +--------------+   - -
             |    IPv6      |
     IPv6    | src = H1 ipv6|   
             | dst = H2 ipv6|   Unmodified from
    Packet   +--------------+   ICMPv4 error
             |  Transport   |   
      in     |    Header    |   
             +--------------+   
    Error    |              |   
             ~     Data     ~   
             |              |
      - -    +--------------+   - -
]]></artwork>
  </figure>

  <t>In the case of IPv6 over IPv4 the above example setting of the IPv6 source address results in this type of traceroute output:
  <figure>
    <artwork>
traceroute to 2000:0:0:40::2, 30 hops max, 80 byte packets
 1  ::2.0.1.1 (::2.0.1.1)  1.231 ms  1.004 ms  1.126 ms
 2  ::2.0.1.2 (::2.0.1.2)  1.994 ms  2.301 ms  2.016 ms
 3  ::2.0.2.1 (::2.0.2.1)  18.846 ms  30.582 ms  19.776 ms
 4  2000:0:0:40::2 (2000:0:0:40::2)  48.964 ms  60.131 ms  53.895 ms
    </artwork>
  </figure>
  </t>

</section>

<section title="Decapsulating Behavior">
  <t>If this uniform ttl model is enabled on the decapsulating NVE, and the overlay header indicates that uniform ttl model applies (the T-bit in the case of VXLAN), then the NVE will perform these additional operations as part of decapsulating a packet where the inner packet is an IPv4 or IPv6 packet:
  <list style="symbols">
    <t>Examine the outer IPv4 TTL (or outer IPv6 hopcount, respectively) on receipt and if 1 or less, then drop the packet and send an outer ICMPv4 (or ICMPv6) ttl exceeded back to the source of the outer packet i.e., the ingress NVE. This ICMP packet should look the same as an ICMP error generated by an underlay router, and the requirement in <xref target="RFC1812"/> on the size of the packet in error applies.</t>
    <t>If ttl did not expire, then decrement the above ttl/hopcount and place it in the inner IP header. If the inner IP header is IPv4 then update the IPv4 header checksum. Then decapsulate and send the packet as for other decapsulated packets.</t>
    <t>If some other errors prevent sending the packet (such as unknown VN Context Id), then the NVE SHOULD send an ICMP host unreachable instead of a ttl exceeded error.</t>
  </list>
  </t>
</section>

<section title="Other ICMP errors">
  <t>The technique for selecting ttl behavior specified in this draft can also be used to trigger other ICMPv4 and ICMPv6 errors. For example, <xref target="RFC1933"/> specifies how ICMP packet too big from underlay routers can be used to report over ICMP packet too big errors to the original source. Other errors that are more specific to the overlay protocol might also be useful, such as not being able to find a VNI ID for the incoming port,vlan, or not being able to flood the packet if the packet is a Broadcast, Unknown unicast, or Multicast packet.</t>
</section>

<section title="Downstream Egress Paths Object">
  <t>
The Downstream Egress Paths Object MAY be appended to the ICMP Time Exceeded and Destination Unreachable messages. A single instance of the Downstream Egress Paths Object represents the egress paths at the router that sends the ICMP message. The Downstream Egress Paths Object must be preceded by an ICMP Extension Structure Header and an ICMP Object Header.  Both are defined in <xref target="RFC4884"/>. The format follows closely <xref target="RFC4379"/> with some generalizations for Multipath types.
    <list style="empty">
      <t>Class-Num = TBA by IANA,  Downstream Egress Paths Class</t>
      <t>C-Type = 1.</t>
    </list>
  </t>
  <t>
If the replying router is the destination of the echo request, then a Downstream Egress Paths Object SHOULD NOT be included in the ICMP Error message. Otherwise the replying router MAY append a Downstream Egress Paths Object for all interfaces over which the echo request packet could be forwarded.
  </t>
  <t>
The Object Length is K*N + M*N, where M is the Multipath Length for each egress path, M may not be the same for different paths. Values for K are found in the description of Address Type below.  
  </t>
  <t>
The Downstream Egress Paths Object has the following format:
  <?rfc needLines="32" ?>
<figure title="Downstream Egress Paths Object">
    <artwork><![CDATA[
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Path-1        MTU             | Address Type  | Reserved      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|             Downstream IP Address (4 or 16 octets)            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|         Downstream Interface Address (4 or 16 octets)         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| MultipathType |       Multipath Length        | Reserved      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
.                                                               .
.                     (Multipath Information)                   .
.                                                               .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
~                                                               ~
~                                                               ~
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Path-N        MTU             | Address Type  | Reserved      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|             Downstream IP Address (4 or 16 octets)            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|         Downstream Interface Address (4 or 16 octets)         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| MultipathType |      Multipath Length         | Reserved      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
.                                                               .
.                     (Multipath Information)                   .
.                                                               .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
  </figure>
  </t>
  <t>
    <list style="hanging">
      <t hangText="Maximum Transmission Unit (MTU):"><vspace />
The MTU is the size in octets of the largest IP frame that fits on the downstream interface.</t>
      <t hangText="Address Type:"><vspace />
The Address Type indicates if the interface is numbered or unnumbered. It also determines the length of the Downstream IP Address and Downstream Interface fields.  The resulting total for the initial part of the one path of the downstream Egress Paths Object is listed in the table below as "K Octets". 
      </t>
      <t>
The Address Type is set to one of the following values:
<figure>
  <artwork>
     Type #        Address Type           K Octets
     ------        ------------           --------
       1           IPv4 Numbered             16
       2           IPv4 Unnumbered           16
       3           IPv6 Numbered             40
       4           IPv6 Unnumbered           28
  </artwork>
</figure>
      </t>
      <t hangText="Downstream IP Address and Downstream Interface Address:"><vspace />
   IPv4 addresses and interface indices are encoded in 4 octets; IPv6
   addresses are encoded in 16 octets.
      </t>
      <t>
   If the interface to the downstream router has a unique IP address (e.g., it is numbered and not a LAG), then the
   Address Type MUST be set to IPv4 or IPv6, the Downstream IP Address
   MUST be set to either the downstream router's Router ID or the
   interface address of the downstream router, and the Downstream
   Interface Address MUST be set to the downstream router's interface
   address.
      </t>
      <t>
   If the interface to the downstream router does not have a unique IP address (e.g., it is is unnumbered or a LAG), the Address
   Type MUST be IPv4 Unnumbered or IPv6 Unnumbered, the Downstream IP
   Address MUST be the downstream router's Router ID or the
   interface address of the downstream router, and the Downstream
   Interface Address MUST be set to the index assigned by the upstream
   router to the interface.
      </t>
      <t hangText="Multipath Type:"><vspace />
   The following Multipath Types are defined:
<figure>
  <artwork>
     Key   Type                Multipath Information
     ---   ----------------    ---------------------
      0    no multipath        Empty (Multipath Length = 0)
      1    MAC SA/DA           Inner MAC in tunnel payload
      2    IP Src/Dest         Inner IP src/dest in tunnel payload
      3    L4 src port         L4 src ports in tunnel payload
      4    L4 src port range   low/high L4 src port pairs

  </artwork>
</figure>
   Type 0 indicates that all packets will be forwarded out this one
   interface.
      </t>
      <t>
   Types 1 through 4 specify that the supplied Multipath Information will
   serve to exercise this path.
      </t>
      <t hangText="Multipath Length:"><vspace />
   The length in octets of the Multipath Information.
      </t>
      <t hangText="Multipath Information:"><vspace />
   The Multipath Information encodes L4 source ports that will exercise
   this path. The Multipath Information depends on the Multipath Type. 
   The contents of the field are shown in the table above. For Type 4,
   ranges indicated by L4 source port pairs MUST NOT overlap and MUST be
   in ascending sequence.
      </t>
    </list>
  </t>
</section>

<section title="Security Considerations">
  <t>The considerations in <xref target="I-D.ietf-nvo3-security-requirements"/> apply.</t>

  <t>In addition, the use of the uniform ttl tunnel model will result in ICMP errors being generated by underlay routers and consumed by NVEs. That resents an attack vector which does not exist in a pipe ttl tunnel model. However, ICMP errors should be rate limited <xref target="RFC1812"/>. Implementations should also take appropriate measures in rate limiting the input rate for ICMP errors that are processed by limited CPU resources.</t>

  <t>Some implementations might handle the trace packets (with uniform ttl model) in software while the pipe ttl model packets can be handled in hardware. In such a case the implementation should have mechanisms to avoid starvation of limited CPU resources due to these packets.</t>

</section>

<section title="IANA Considerations">
  <t>TBD</t>
</section>

<section title="Acknowledgements">
  <t>The authors acknowledge the helpful comments from David Black and Diego García del Rio.</t>
</section>

</middle>

<back>
<references title="Normative References">
&RFC0792;
&RFC1812;
&RFC2119;
&RFC7348;
&RFC7365;
</references>

<references title="Informative References">
&RFC1933;
&RFC2473;
&RFC2983;
&RFC3270;
&RFC3443;
&RFC4379;
&RFC4884;
&RFC4950;
&NVGRE;
&LIME;
&NVO-SEC;
&GENEVE;
&GUE;
</references>

</back>
</rfc>


