Layer-Transcending Traceroute for Overlay Networks like VXLAN

Tools like traceroute have been very valuable for the operation of the Internet. Part of that value comes from being able to display information about routers and paths over which the user of the tool has no control, but the traceroute output can be passed along to someone else that can further investigate or fix the problem. The output of traceroute can be included in an email or a trouble ticket to report the problem. This provide a lot more information than the mere indication that A can't communicate with B, in particular when the failures are transient. The ping tool provides some of the same benefits in being able to return ICMP errors such as host unreachable messages. This document shows how those tools can be used to gather information for both the overlay and underlay parts of an end-to-end path by providing the option to have some packets use a uniform time-to-live (ttl) model for the tunnels, and associated ICMP error handling. These changes are limited to the tunnel ingress and egress points. The desire to make traceroute provide useful information for overlay network is not an argument against also using a layered approach for OAM as specified in e.g., . Such approaches are quite appropriate for continuos monitoring at different layers and across different domains. A layer transcending traceroute complements the ability to do layered and/or continuos monitoring. The traceroute tool relies on receiving ICMP errors in combination with using different IP time-to-live values. That results in the packet making it further and further towards the destination with ICMP ttl exceeded errors being received from each hop. That provides the user the working path even if the packets are black holed eventually, and also provides any errors like ICMP host unreachable. The fundamental assumption is that the ttl is decremented for each hop and that the resulting ICMP ttl exceeded errors are delivered back to the host. When some encapsulation is used to tunnel packets there is an architectural question how those tunnels should be viewed from the rest of the network. Different models were described first for diffserv in and then applied to MPLS in and expanded to MPLS ttl handling in and those models apply to other forms of direct or indirect IP in IP tunnels. Those RFCs define two models for ttl that are of interest to us: A pipe model, where the tunnel is invisible to the rest of the network in that it looks like a direct connection between the tunnel ingress and egress. A uniform model, where the ttl decrements uniformly for hops outside and inside the tunnel. The tunneling mechanisms discussed in NVO3 (such as VXLAN , NVGRE , GENEVE , and GUE ), have either been specified to provide the pipe model of a tunnel or are silent on the setting of the outer ttl. Those protocols can be extended to have an optional uniform tunnel model when the payload is IP, following the same model as in . Note that these encapsulations carry Ethernet frames hence are not even aware that the payload is IP. However, IP is the bulk of what is carried over such tunnels and the ingress NVE can inspect the IP part of the Ethernet frame. However, for general application traffic the pipe model is fine and might even be expected by some applications. In general, when the source and destination IP are in the same IP subnet the ttl should not be decremented. Thus it makes sense to have a way to selectively enable the uniform model perhaps based on some method to identify packets associated with traceroute or some marker in the packet itself that the traceroute tool can set.

The pieces needed to accomplish this are: One or more ways to select the uniform model packets at the tunnel ingress. Tunnel ingress copying out the original ttl from a selected packet to the outer IP header, and then doing a check and decrement of that ttl. If that ttl check results in ttl expiry at the tunnel ingress, then deliver an ICMP ttl exceeded packet back to the host. A mechanism by which the tunnel egress knows which packets should have uniform model, for instance a bit in the encapsulation header. The tunnel egress copying in the ttl (for identified packets) from the outer header to the inner IP header, then doing a check and decrement of that ttl. If ttl check results in ttl expiry at the tunnel egress, then deliver an ICMP error back to the original host (or, perhaps better, to tunnel ingress the same way as underlay routers do). IP routers in the underlay will deliver any ICMP errors to the source IP address of the packet. For tunneled packets that will be the tunnel ingress. Hence the tunnel ingress needs to be able to take such ICMP errors and form corresponding ICMP errors that are sent back to the host. The requirement in ensures that the ICMP errors will contain enough headers to form such an ICMP error. It has been noted that there are routers in the Internet which decades later fail to conform to that aspect of . The idea to reflect (some) ICMP errors from inside a tunnel back to the original source goes back to IPv6 in IPv4 encapsulation as specified in and . However, those drafts did not advocate using a uniform ttl model for the tunnels but did handle ICMP packet too big and other unreachable messages. Those drafts specify how to reflect ICMP errors received from underlay routers to ICMP errors sent to the original host. The addition of handling ICMP ttl exceeded errors for uniform tunnel model is straight forward. The information carried in the ICMP errors are quite limited - the original packet plus an ICMP type and code. However, there are extension mechanisms specified in and used for MPLS in which include TLVs with additional information. If there are additional information to include for overlay networks that information could be added by defining new ICMP Extensions Objects based on . An example of such an extension for ECMP information is included in this document.

The following goals and requirements apply: No changes needed in the underlay. Optional changes on the decapsulating end. ECMP friendly. If the underlay employs equal cost multipath routing then one should be able to use this mechanism to trace the same path as a given TCP or UDP flow is using. In addition, one should be able to explore different ECMP paths by varying the IP addresses and port numbers in the packets originated by traceroute on the host. Provide output which makes it possible to compare a regular overlay traceroute with the layer-transcending output.

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in . The terminology such as NVE, and TS are used as specified in : Network Virtualization Edge (NVE): An NVE is the network entity that sits at the edge of an underlay network and implements L2 and/or L3 network virtualization functions. Tenant System (TS): A physical or virtual system that can play the role of a host or a forwarding element such as a router, switch, firewall, etc. Virtual Access Points (VAPs): A logical connection point on the NVE for connecting a Tenant System to a virtual network. Virtual Network (VN): A VN is a logical abstraction of a physical network that provides L2 or L3 network services to a set of Tenant Systems. Virtual Network Context (VN Context) Identifier: Field in an overlay encapsulation header that identifies the specific VN the packet belongs to. We use the VTEP term in as synonymous with NVE, and VNI as synonymous to VN Context Identifier.

The following example topologies illustrate different cases where we want a tracing capability. The examples are for overlay technologies such as VXLAN which provide a layer 2 overlay on IP. The cases for layer 3 overlay on top of IP are simpler and not shown in this document. The VXLAN term VTEP is used as synonymous to NVO3's NVE term.

The figure above shows two hosts connected using an underlay which provides a layer two service. Thus H1 and H2 are in the same subnet and unaware of the existence of the underlay. Thus a normal ping or traceroute would not be able to provide any information about the nature of a failure; either packets get through or they do not. When the packets get through traceroute would output something like:

traceroute to 1.0.1.2 (1.0.1.2), 30 hops max, 60 byte packets 1 1.0.2.1 (1.0.2.1) 1.104 ms 1.235 ms 1.729 ms In this case it would be desirable to be able to traceroute from H1 to H2 (and vice versa) and observe VtepA, R1, VtepB and H2. Thus in the case of packets getting through traceroute would output:

traceroute to 1.0.1.2 (1.0.1.2), 30 hops max, 60 byte packets 1 2.0.1.1 (2.0.1.1) 1.104 ms 1.235 ms 1.729 ms 2 2.0.1.2 (2.0.1.2) 2.106 ms 2.007 ms 2.156 ms 3 2.0.2.1 (2.0.2.1) 35.034 ms 24.490 ms 21.626 ms 4 1.0.1.2 (1.0.1.2) 40.830 ms 44.694 ms 75.620 ms Note that the underlay and overlay might exist in completely separate addressing domains. Thus H1 might not be able to reach any of the underlay addresses. And the underlay IP addresses might overlap the overlay IP addresses. For example, it would be completely valid to see e.g. VtepA having the same IP address as H1. The user of this tool need to understand that the utility of the traceroute output is to get information to determine whether the issue is in the underlay or overlay, and be able to pass the underlay information to the operator of the underlay. In overlay networks without any ARP/ND optimizations ARP/ND packets would be flooded between the tunnel endpoints. Thus if there is some communication failure between H1 and H2, then H1 above might not have an ARP entry for H2. This results in traceroute not being able to output any data. This implies that in order to use traceroute to trouble shoot the issue one would need some workaround, such as installing some temporary ARP entries on the hosts.

The figure above has a overlay router the nexthop as seen by H1. In this case a normal overlay traceroute would be able to display the overlay path i.e.

traceroute to H4, 30 hops max, 60 byte packets 1 R2 2 R3 3 H4 The layer-transcending traceroute would show the combination of the underlay and overlay paths i.e.,

traceroute to H4, 30 hops max, 60 byte packets 1 VtepA 2 R1 3 VtepB 4 R2 5 R3 6 H4

The figure above has multiple overlay network segments, that are connected in one router which provides the tunnel endpoints for both overlay segments plus routing for the overlay. A more general picture would be to have an overlay routed path between the two NVEs e.g., VtepB and VtepC connected to different routers in the overlay. However, such a drawing in ASCII art doesn't fit on the page. An normal overlay traceroute in the above topology would show the overlay router i.e.,

traceroute to H6, 30 hops max, 60 byte packets 1 R5 2 H6 The layer-transcending traceroute would show the combination of the underlay and overlay paths i.e.,

traceroute to H6, 30 hops max, 60 byte packets 1 VtepA 2 R1 3 VtepB 4 R5 5 VtepC 6 R6 7 VtepD 8 H6 Note that the R3 device, which include VtepB and VtepC, appears as three hops in the traceroute output. That is needed to be able to correlate the output with the overlay output which has R3. That correlation would be hard if the R3 device only appeared as VtepB in the LTTON output. The three-hop representation also stays invariant whether or not the NVEs and overlay router are implemented by a single device or multiple devices.

The network admin needs to be able to control who can use the layer transcending traceroute, since the operator might not want to disclose the underlay topology to all its users all the time. There are different approaches for this such as designating particular ports (Virtual Access Points in NVO3 terminology) on a NVE to have uniform ttl tunnel model. We have found it useful to be able to enable this capability on a per port and/or virtual network basis, in addition to having a global setting per NVE. When enabled on the NVEs the user on the TS needs to be able to control which traffic is subject to which tunnel mode. The normal traffic would use the pipe ttl tunnel model and only explicit trace applications are likely to want to use the uniform ttl tunnel model. Hence it makes sense to use some marker in the packets sent by the TS to select those packets for uniform model on the NVE. Such a mechanism should usable so that the user can perform both a regular traceroute and a LTTON. Potentially different fields in the packets originated by traceroute on the TS can be used to mark the packets for uniform ttl tunnel model. However, many of those fields such as source and destination port numbers and protocol might be used in hashing for ECMP. The marking that can be used without impacting ECMP is the DSCP field in the packet. That field can be set with an option (--tos) in at least some existing traceroute implementations. Note that when DSCP is used for such marking it is a configured choice subject to agreement between the operator of the TS and NVE. The matching on the NVE should ignore the ECN bits as to not interfere with ECN. However, the DSCP value used in the overlay might have an impact on the forwarding of the packets. In such a case one can use an alternative selector such as the UDP source port number. That has the downside of affecting the has values used for ECMP and link aggregation port selection.

When this approach is applied to VXLAN the decapsulating NVE has to be able to identify packets that have to be processed in the uniform ttl tunnel model way. For that purpose we define a new flag which is sent by the encapsulating NVE on selected packets, and is used by the decapsulating NVE to perform the ttl copyin, decrement and check. In addition to the one I-flag defined in we define a new T-flag to capture this the trace behavior at the decapsulating tunnel endpoint.

New fields: When set indicates that decapsulator should take the outer ttl and copy it to the inner ttl, and then check and decrement the resulting ttl.

If the uniform ttl model is enabled for the input, and the received naked packet matches the selector, then the ingress NVE will perform these additional operations as part of encapsulating an IPv4 or IPv6 packet: Examine the IPv4 TTL (or IPv6 hopcount, respectively) on receipt and if 1 or less, then drop the packet and send an ICMPv4 (or ICMPv6) ttl exceeded back to the original host. Since the NVE is operating on a L2 packet, it might not have any layer 3 interfaces or routes for the originating host. Thus it sends the packet back to the source L2 address of the packet back out the ingress port - without any IP address lookup. If ttl did not expire, then decrement the above ttl/hopcount and place it in the outer IP header. Encapsulate and send the packet as normal. If some other errors prevent sending the packet (such as unknown VN Context Id, no flood list configured), then the NVE SHOULD send an ICMP host unreachable back to the host. The ingress NVE will receive ICMP errors from underlay routers and the egress NVE; whether due to ttl exceeded or underlay issues such as host unreachable, or packet too big errors. The NVE should take such errors, and in addition to any local syslog etc, generate an ICMP error sent back to the host. The principle for this is specified in and . Just like in those specifications, for the inner and outer IP header could be off different version. A common case of that might be an IPv6 overlay with an IPv4 underlay. That case requires some changes in the ICMP type and code values in addition to recreating the packets. The place where LTTON differs from those specifications is that there is an NVO3 header and (for L2 over L3) and L2 header in the packet. The figures below show an example of ICMP header re-generation at VtepA for the case of IPv6 overlay with IPv4 underlay. The case of IPv4 over IPv4 is similar and simpler since the ICMP header is the same for both overlay and underlay. The example uses VXLAN encapsulation to provide the concrete details, but the approach applies to other NVO3 proposals.

The above underlay ICMPv4 is used to form an overlay ICMPv6 packet by extracting the Ethernet DA from the inner Ethernet SA, and forming an IPv6 header where the source address is based on the source address of the ICMPv4 error. The ICMPv6 type and code values are set based on the ICMPv4 type and code values.

In the case of IPv6 over IPv4 the above example setting of the IPv6 source address results in this type of traceroute output:

traceroute to 2000:0:0:40::2, 30 hops max, 80 byte packets 1 ::2.0.1.1 (::2.0.1.1) 1.231 ms 1.004 ms 1.126 ms 2 ::2.0.1.2 (::2.0.1.2) 1.994 ms 2.301 ms 2.016 ms 3 ::2.0.2.1 (::2.0.2.1) 18.846 ms 30.582 ms 19.776 ms 4 2000:0:0:40::2 (2000:0:0:40::2) 48.964 ms 60.131 ms 53.895 ms

If this uniform ttl model is enabled on the decapsulating NVE, and the overlay header indicates that uniform ttl model applies (the T-bit in the case of VXLAN), then the NVE will perform these additional operations as part of decapsulating a packet where the inner packet is an IPv4 or IPv6 packet: Examine the outer IPv4 TTL (or outer IPv6 hopcount, respectively) on receipt and if 1 or less, then drop the packet and send an outer ICMPv4 (or ICMPv6) ttl exceeded back to the source of the outer packet i.e., the ingress NVE. This ICMP packet should look the same as an ICMP error generated by an underlay router, and the requirement in on the size of the packet in error applies. If ttl did not expire, then decrement the above ttl/hopcount and place it in the inner IP header. If the inner IP header is IPv4 then update the IPv4 header checksum. Then decapsulate and send the packet as for other decapsulated packets. If some other errors prevent sending the packet (such as unknown VN Context Id), then the NVE SHOULD send an ICMP host unreachable instead of a ttl exceeded error.

The technique for selecting ttl behavior specified in this draft can also be used to trigger other ICMPv4 and ICMPv6 errors. For example, specifies how ICMP packet too big from underlay routers can be used to report over ICMP packet too big errors to the original source. Other errors that are more specific to the overlay protocol might also be useful, such as not being able to find a VNI ID for the incoming port,vlan, or not being able to flood the packet if the packet is a Broadcast, Unknown unicast, or Multicast packet.

The Downstream Egress Paths Object MAY be appended to the ICMP Time Exceeded and Destination Unreachable messages. A single instance of the Downstream Egress Paths Object represents the egress paths at the router that sends the ICMP message. The Downstream Egress Paths Object must be preceded by an ICMP Extension Structure Header and an ICMP Object Header. Both are defined in . The format follows closely with some generalizations for Multipath types. Class-Num = TBA by IANA, Downstream Egress Paths Class C-Type = 1. If the replying router is the destination of the echo request, then a Downstream Egress Paths Object SHOULD NOT be included in the ICMP Error message. Otherwise the replying router MAY append a Downstream Egress Paths Object for all interfaces over which the echo request packet could be forwarded. The Object Length is K*N + M*N, where M is the Multipath Length for each egress path, M may not be the same for different paths. Values for K are found in the description of Address Type below. The Downstream Egress Paths Object has the following format:

The MTU is the size in octets of the largest IP frame that fits on the downstream interface. The Address Type indicates if the interface is numbered or unnumbered. It also determines the length of the Downstream IP Address and Downstream Interface fields. The resulting total for the initial part of the one path of the downstream Egress Paths Object is listed in the table below as "K Octets". The Address Type is set to one of the following values:

Type # Address Type K Octets ------ ------------ -------- 1 IPv4 Numbered 16 2 IPv4 Unnumbered 16 3 IPv6 Numbered 40 4 IPv6 Unnumbered 28 IPv4 addresses and interface indices are encoded in 4 octets; IPv6 addresses are encoded in 16 octets. If the interface to the downstream router has a unique IP address (e.g., it is numbered and not a LAG), then the Address Type MUST be set to IPv4 or IPv6, the Downstream IP Address MUST be set to either the downstream router's Router ID or the interface address of the downstream router, and the Downstream Interface Address MUST be set to the downstream router's interface address. If the interface to the downstream router does not have a unique IP address (e.g., it is is unnumbered or a LAG), the Address Type MUST be IPv4 Unnumbered or IPv6 Unnumbered, the Downstream IP Address MUST be the downstream router's Router ID or the interface address of the downstream router, and the Downstream Interface Address MUST be set to the index assigned by the upstream router to the interface. The following Multipath Types are defined:

Key Type Multipath Information --- ---------------- --------------------- 0 no multipath Empty (Multipath Length = 0) 1 MAC SA/DA Inner MAC in tunnel payload 2 IP Src/Dest Inner IP src/dest in tunnel payload 3 L4 src port L4 src ports in tunnel payload 4 L4 src port range low/high L4 src port pairs Type 0 indicates that all packets will be forwarded out this one interface. Types 1 through 4 specify that the supplied Multipath Information will serve to exercise this path. The length in octets of the Multipath Information. The Multipath Information encodes L4 source ports that will exercise this path. The Multipath Information depends on the Multipath Type. The contents of the field are shown in the table above. For Type 4, ranges indicated by L4 source port pairs MUST NOT overlap and MUST be in ascending sequence.

The considerations in apply. In addition, the use of the uniform ttl tunnel model will result in ICMP errors being generated by underlay routers and consumed by NVEs. That resents an attack vector which does not exist in a pipe ttl tunnel model. However, ICMP errors should be rate limited . Implementations should also take appropriate measures in rate limiting the input rate for ICMP errors that are processed by limited CPU resources. Some implementations might handle the trace packets (with uniform ttl model) in software while the pipe ttl model packets can be handled in hardware. In such a case the implementation should have mechanisms to avoid starvation of limited CPU resources due to these packets.

TBD

The authors acknowledge the helpful comments from David Black and Diego García del Rio.