<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
    which is available here: http://xml.resource.org. -->

<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC2629 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2629.xml">
<!ENTITY RFC3277 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3277.xml">
<!ENTITY RFC3719 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3719.xml">
<!ENTITY RFC4271 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4271.xml">
<!ENTITY RFC5120 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5120.xml">
<!ENTITY RFC5301 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5301.xml">
<!ENTITY RFC5303 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5303.xml">
<!ENTITY RFC5304 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5304.xml">
<!ENTITY RFC5305 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5305.xml">
<!ENTITY RFC5308 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5308.xml">
<!ENTITY RFC5309 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5309.xml">
<!ENTITY RFC5311 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5311.xml">
<!ENTITY RFC5316 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5316.xml">
<!ENTITY RFC5440 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5440.xml">
<!ENTITY RFC5449 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5449.xml">
<!ENTITY RFC5614 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5614.xml">
<!ENTITY RFC5837 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5837.xml">
<!ENTITY RFC5820 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5820.xml">
<!ENTITY RFC6232 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.6232.xml">
<!ENTITY RFC7182 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7182.xml">
<!ENTITY RFC7356 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7356.xml">
<!ENTITY RFC7921 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7921.xml">
<!ENTITY RFC7981 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7981.xml">
<!ENTITY RFC8174 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8174.xml">
]>

<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<?rfc strict="yes" ?>
<?rfc toc="yes"?>
<?rfc tocdepth="4"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes" ?>
<?rfc compact="yes" ?>
<?rfc subcompact="no" ?>
<rfc category="info" docName="draft-white-distoptflood-02" ipr="trust200902">

<!-- ***** FRONT MATTER ***** -->

<front>

<title>IS-IS Optimal Distributed Flooding for Dense Topologies</title>

<author initials='R.' surname='White' fullname='Russ White'>
<organization>Juniper Networks</organization>
<address>
<email>russ@riw.us</email>
</address>
</author>

<author initials='S.' surname='Hegde' fullname='Shraddha Hegde'>
<organization>Juniper Networks</organization>
<address>
<email>shraddha@juniper.net</email>
</address>
</author>

<author initials='S.' surname='Zandi' fullname='Shawn Zandi'>
<organization>LinkedIn</organization>
<address>
<email>szandi@linkedin.com</email>
</address>
</author>

<date/>

<abstract>
<t>In dense topologies, such as data center fabrics based on the Clos and butterfly fabric topologies, flooding mechanisms designed for sparse topologies, when used in these dense topologies, can "overflood," or carry too many copies of topology and reachability to fabric devices. This results in slower convergence times and higher resource utilization. The modifications to the flooding mechanism in the Intermediate System to Intermediate System (IS-IS) link state protocol described in this document reduce resource utilization to a minimum, while increaseing convergence performance in dense topologies.</t>

<t>Note that a Clos fabric is used as the primary example of a desne flooding topology throughout this document. However, the flooding optimizations described in this document apply to any dense topology.</t>

</abstract>

</front>

<middle>

<!-- 1 -->
<section title="Introduction" toc="default">

<!-- 2 -->
<section title="Goals" toc="default">

<t>The goal of this draft is to solve one specific set of problems involved in operating a link state protocol in a dense mesh topology. The problem with such topologies is the connectivity density, which causes too much information to be flooded (or too much repeated state to be flooded). Analysis and experiment show, for instance, that in a butterfly fabric of around 2500 intermediate systems, each intermediate system will receive 40+ copies of any changed LSP fragment. This not only wastes bandwidth and processor time, this dramatically slows convergence speed.</t>

<t>This document describes a set of modifications to existing IS-IS flooding mechanisms which minimize the number of LSP framgents received by individual intermediate systems, potentially to one copy per intermediate system. The mechanisms described in this document are similar to those implemented in OSPF to support mobile ad-hoc networks, as described in <xref target="RFC5449" />, <xref target="RFC5614" />, and <xref target="RFC7182" />. These mechanisms have been widely deployed and tested.</t>

</section> <!-- end of goals -->

<!-- 2 -->
<section title="Contributors" toc="default">

<t>The following people have contributed to this draft: Nikos Triantafillis, Ivan Pepelnjak, Christian Franke, Hannes Gredler, Les Ginsberg, Naiming Shen, Uma Chunduri, Nick Russo, and Rodny Molina.</t>

</section> <!-- end of contributors -->

<section title="Experience" toc="default">

<t>The modifications described in this draft have been implemented in the FR Routing open source routing stack, and hence are available for testing and modification. The implementation is part of the openfabric daemon, which can be conditionally compiled from isisd. Note openfabricd has further modifications are not described in this document.</t>

<t>Lab testing shows these modifications reduce flooding in a large scale emulated butterfly network topology; without these modifications, intermediate systems receive, on average, 40 copies of any changed LSP fragment. With these modifications, intermediate systems recieve, on average, two copies of any changed LSP fragment. In many cases, each intermediate system receives one copy of each changed LSP. In terms of performance, the modifications described here reduce convergence times by around 50%. A network that converges in about 30-40 seconds without these modifications converged in 15-20 seconds with these modifications. Processor load times were not checked, as this was an emulated environment.</t>

</section> <!--end of experience -->

<!-- 2 -->
<section title="Sample Network" toc="default">

<t>The following spine and leaf fabric will be used to describe these modifications.</t>

<figure align="center" anchor="is-model">
<artwork align="left"><![CDATA[
+----+ +----+ +----+ +----+ +----+ +----+
| 1A | | 1B | | 1C | | 1D | | 1E | | 1F | (T0)
+----+ +----+ +----+ +----+ +----+ +----+

+----+ +----+ +----+ +----+ +----+ +----+
| 2A | | 2B | | 2C | | 2D | | 2E | | 2F | (T1)
+----+ +----+ +----+ +----+ +----+ +----+

+----+ +----+ +----+ +----+ +----+ +----+
| 3A | | 3B | | 3C | | 3D | | 3E | | 3F | (T2)
+----+ +----+ +----+ +----+ +----+ +----+

+----+ +----+ +----+ +----+ +----+ +----+
| 4A | | 4B | | 4C | | 4D | | 4E | | 4F | (T1)
+----+ +----+ +----+ +----+ +----+ +----+

+----+ +----+ +----+ +----+ +----+ +----+
| 5A | | 5B | | 5C | | 5D | | 5E | | 5F | (T0)
+----+ +----+ +----+ +----+ +----+ +----+
]]></artwork>
</figure>

<t>To reduce confusion (spine and leaf fabrics are difficult to draw in plain text art), this diagram does not contain the connections between devices. The reader should assume that each device in a given layer is connected to every device in the layer above it. For instance:</t>

<t>
<list style="symbols">
<t>5A is connected to 4A, 4B, 4C, 4D, 4E, and 4F</t>
<t>5B is connected to 4A, 4B, 4C, 4D, 4E, and 4F</t>
<t>4A is connected to 3A, 3B, 3C, 3D, 3E, 3F, 5A, 5B, 5C, 5D, 5E, and 5F</t>
<t>4B is connected to 3A, 3B, 3C, 3D, 3E, 3F, 5A, 5B, 5C, 5D, 5E, and 5F</t>
<t>etc.</t>
</list>
</t>

<t>The tiers or stages of the fabric are also marked for easier reference. T0 is assumed to be connected to application servers, or rather they are Top of Rack (ToR) intermediate systems. The remaining tiers, T1 and T2, are connected only to the fabric itself.</t>

</section> <!-- end of sample network -->

</section> <!-- End of the introduction section -->

<!-- 1 -->
<section title="Flooding Modifications" toc="default">

<t>Flooding is perhaps the most challenging scaling issue for a link state protocol running on a dense, large scale fabric. This section describes modifications to the IS-IS flooding process to reduce flooding load on a dense or mesh topology.</t>

<!-- 2 -->
<section title="Optimizing Flooding" toc="default">

<t>To reduce the flooding of link state information in the form of Link State Protocol Data Units (LSPs), the following tables are required to compute a set of reflooders:</t>

<t>
<list style="symbols">
<t>Neighbor List (NL) list: The set of neighbors</t>
<t>Neighbor's Neighbors (NN) list: The set of neighbor's neighbors; this can be calculated by running SPF truncated to two hops</t>
<t>Do Not Reflood (DNR) list: The set of neighbors who should have LSPs (or fragments) who should not reflood LSPs</t>
<t>Reflood (RF) list: The set of neighbors who should flood LSPs (or fragments) to their adjacent neighbors to ensure synchronization</t>
</list>
</t>

<t>NL is set to contain all neighbors, and sorted deterministically (for instance, from the highest IS identifier to the lowest). All intermediate systems within a single fabric SHOULD use the same mechanism for sorting the NL list. NN is set to contain all neighbor's neighbors, or all intermediate systems that are two hops away, as determined by performing a truncated SPF. The DNR and RF tables are initially empty. To begin, the following steps are taken to reduce the size of NN and NL:</t>

<t>
<list style="symbols">
<t>Remove all intermediate systems from NL and NN that in the shortest path to the IS that originated the LSP</t>
</list>
</t>

<t>Then, for every IS in NL:</t>

<t>
<list style="symbols">
<t>If the current entry in NL is connected to any entries in NN:
<list style="symbols">
<t>Move the IS to RF</t>
<t>Remove the intermediate systems connected to the IS from NN</t>
</list></t>
<t>Else move the IS to DNR</t>
</list>
</t>

<t>The calculation terminates when the NL is empty.</t>

<t>When flooding, LSPs transmitted to adjacent neighbors on the RF list will be transmitted normally. Adjacent intermediate systems on this list will reflood received LSPs into the next stage of the topology, ensuring database synchronization. LSPs transmitted to adjacent neighbors on the DNR list, however, MUST be transmitted using a circuit scope PDU as described in <xref target="RFC7356" />.</t>

</section> <!-- end of flooding optimizations -->

<!-- 2 -->
<section title="Flooding Failures" toc="default">

<t>It is possible in some failure modes for flooding to be incomplete because of the flooding optimizations outlined. Specifically, if a reflooder fails, or is somehow disconnected from all the links across which it should be reflooding, it is possible an LSP is only partially flooded through the fabric. To prevent such situations, any IS receiving an LSP transmitted using DNR SHOULD:</t>

<t>
<list style="symbols">
<t>Set a short timer; the default should be less than one second</t>
<t>When the timer expires, send a Complete Sequence Number Packet (CSNP) to all neighbors</t>
<t>Process any Partial Sequence Number Packets (PSNPs) as required to resynchronize</t>
<t>If a resynchronization is required, notify the network operator through a network management system</t>
</list>
</t>

</section> <!-- end of flooding failures -->

</section> <!-- end of optimizing flooding -->

<!-- 1 -->
<section title="Use of Flooding Leaders and Flooding Mechanism Advertisements" toc="default">

<t><xref target="I-D.ietf-lsr-dynamic-flooding" />, section 5.1.1, describes the election of a flooding domain leader, which can advertise the kind of flooding reduction mechanism used in the flooding domain. Implementations of this draft MAY implement the election and advertisement of a flooding domain leader as described in section 5.1.1 of <xref target="I-D.ietf-lsr-dynamic-flooding" />. If the election of a flooding domain leader is implemented, implementations SHOULD also advertise the flooding mechanism using the IS-IS Dynamic Flooding Sub-TLV described in section 5.1.2 of <xref target="I-D.ietf-lsr-dynamic-flooding" />, using Algorithm number (TBD).</t>

</section> <!-- end of integration with draft-ietf-lsr-dynamic-flooding -->

<!-- 1 -->
<section title="Security Considerations" toc="default">

<t>This document outlines modifications to the IS-IS protocol for operation on high density network topologies. Implementations SHOULD implement IS-IS cryptographic authentication, as described in <xref target="RFC5304" />, and should enable other security measures in accordance with best common practices for the IS-IS protocol.</t>

</section> <!-- end of security considerations -->

</middle>

<back>

<references title="Normative References">

&RFC2119;
&RFC2629;
&RFC5120;
&RFC5301;
&RFC5303;
&RFC5305;
&RFC5308;
&RFC5309;
&RFC5311;
&RFC5316;
&RFC7356;
&RFC7981;
&RFC8174;

<?rfc include="reference.I-D.ietf-lsr-dynamic-flooding.xml"?>

<reference anchor="ISO10589">
  <front>
    <title>Intermediate system to Intermediate system intra-domain
           routeing information exchange protocol for use in conjunction with
           the protocol for providing the connectionless-mode Network Service
           (ISO 8473)</title>

    <author>
      <organization abbrev="ISO">International Organization for Standardization</organization>
    </author>

    <date month="Nov" year="2002"/>
  </front>

  <seriesInfo name="ISO/IEC" value="10589:2002, Second Edition"/>
</reference>

</references> <!-- end of normative references -->

<references title="Informative References">

&RFC3277;
&RFC3719;
&RFC4271;
&RFC5304;
&RFC5440;
&RFC5449;
&RFC5614;
&RFC5820;
&RFC5837;
&RFC6232;
&RFC7182;
&RFC7921;

<?rfc include="reference.I-D.ietf-isis-segment-routing-extensions.xml"?>

</references> <!-- end of informative references -->

<!-- 1 -->
<section title="Flooding Optimization Operation" toc="default">

<t>Recent testing has shown that flooding is largely a "non-issue" in terms of scaling when using high speed links connecting intermediate systems with reasonable processing power and memory. However, testing has also shown that flooding will impact convergence speed even in such environments, and flooding optimization has a major impact on the performance of a link state protocol in resource constrained environments. Some thoughts on flooding optimization in general, and the flooding optimization contained in this document, follow.</t>

<t>There are two general classes of flooding optimization available for link state protocols. The first class of optimization relies on a centralized service or server to gather the link state information and redistribute it back into the intermediate systems making up the fabric. Such solutions are attractive in many, but not all, environments; hence these systems compliment, rather than compete with, the system described here. Systems relying on a service or server necessarily also rely on connectivity to that service or server, either through an out-of-band network or connectivity through the fabric itself. Because of this, these mechanisms do not apply to all deployments; some deployments require underlying reachability regardless of connectivity to an outside service or server.</t>

<t>The second possibility is to create a fully distributed system that floods the minimal amount of information possible to every intermediate system. The system described in this draft is an example of such a system. Again, there are many ways to accomplish this goal, but simplicity is a primary goal of the system described in this draft.</t>

<t>The system described here divides the work into two different parts; forward and reverse optimization. The forward optimization begins by finding the set of intermediate systems two hops away from the flooding device, and choosing a subset of connected neighbors that will successfully reach this entire set of intermediate systems, as shown in the diagram below.</t>

<figure align="center" anchor="two-hop-n">
<artwork align="left"><![CDATA[
G
|
A     B    C--+
|     |    |  |
+--D--+    E  H
   |       |  |
   +----F--+--+
]]></artwork>
</figure>

<t>If F is flooding some piece of information, then it will find the entire set of intermediate systems within two hops by discovering its neighbors and their neighbors from the local LSDB. This will include A, B, C, D, and E--but not G. From this set, F can determine that D can reach A and B, while a single flood to either E or H will reach C. Hence F can flood to D and either E or H to reach C. F can choose to flood to D and E normally. Because H still needs to receive this new LSP (or fragment!), but does not need to reflood to C, F can send the LSP using link local signaling. In this case, H will receive and process the new LSP, but not reflood it.</t>

<t>Rather than carrying the information necessary through hello extensions, as is done in <xref target="RFC5820" />, the neighbors are allowed to complete initial synchronization, and then a truncated shortest path tree is built to determine the "two hop neighborhood." This has the advantage of using mechanisms already used in IS-IS, rather than adding new processes. The risk with this process is any LSPs flooded through the network before this initial calculation takes place will be suboptimal. This "two hop neighborhood" process has been used in OSPF deployments for a number of years, and has proven stable in practice.</t>

<t>Rather than setting a timer for reflooding, the implementation described here uses IS-IS' ability to describe the entire database using a CSNP to ensure flooding is successful. This adds some small amount of overhead, so there is some balance between optimal flooding and ensuring flooding is complete.</t>

<t>The reverse optimization is simpler. It relies on the observation that any intermediate system between the local IS and the origin of the LSP, other than in the case of floods removing an LSP from the shared LSDB, should have already received a copy of the LSP. For instance, if F originates an LSP in the figure above, and E refloods the LSP to C, C does not need to reflood back to F if F is on its shortest path tree towards F. It is obvious this is not a "perfect" optimization. A perfect optimization would block flooding back along a directed acyclic graph towards the originator. Using the SPT, however, is a quick way to reduce flooding without performing more calculations.</t>

<t>The combination of these two optimizations have been seen, in testing, to reduce the number of copies any IS receives from the tens to precisely one.</t>

</section>
<!-- end of flooding optimization operation appendix -->

</back>

</rfc>
