Internet DRAFT - draft-ietf-ipngwg-gseaddr


Network Working Group                                        Mike O'Dell
Internet-Draft                                        UUNET Technologies
                                                  1997/02/24 01:32:32GMT

          GSE - An Alternate Addressing Architecture for IPv6


1. Status of this Memo

   This document is an Internet-Draft.  Internet-Drafts are working
   documents of the Internet Engineering Task Force (IETF), its areas,
   and its working groups. Note that other groups may also distribute
   working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as ``work in progress.''

   To learn the current status of any Internet-Draft, please check the
   1id-abstracts.txt listing contained in the Internet-Drafts Shadow
   Directories on (Africa) , (Europe), (Pacific Rim), (US East Coast ), or (US West Coast).

2. Abstract

   This document presents an alternative addressing architecture for
   IPv6 which controls global routing growth by very aggressive
   topological aggregation. It includes support for scalable multi-
   homing as a distinguished service.  It provides for future
   independent evolution of routing and forwarding models with
   essentially no impact on end systems.  Finally, it frees sites and
   service resellers from the tyranny of CIDR-based aggregation by
   providing transparent re-homing of both.

3. Introduction

   This alternative IPv6 addressing architecture addresses several
   scalability issues with the current IPv6 addressing proposals.

           Scaling of the global route computation

           Ease of re-homing (both leaf Sites and upstream Resellers)

           Economic scalability of of Multi-homing

O'Dell v3.7                                                     [Page 1]
Internet-Draft                GSE for IPv6        1997/02/24 01:32:32GMT

   The current IPv6 addressing proposals address route and topology
   aggregation by continuing to rely on CIDR-style "Provider-based
   Addressing" coupled with a powerful new dynamic address assignment
   mechanism which is intended to make renumbering more palatable.

   However, CIDR-style provider-based aggregation breaks down in the
   face of the accelerating growth of multi-homed sites (leaf sites or
   regional networks).  Worse, renumbering an entire Site to accomplish
   a simple topological re-homing such as changing ISPs is a problem
   whose magnitude can only grow over time. It will remain increasingly
   difficult to explain this renumbering requirement to customers with
   the spectre of a complete failure of this aggregation approach a
   distinct possibility.

   While the large IPv6 addresses provide for a huge increase in the
   number of end systems which can be accommodated, it also portends a
   huge increase in the number of routes required to reach them. Even if
   CIDR aggregation were to continue at current levels (maintaining
   current efficiency is relatively unlikely), this still presents a
   serious problem for the growth of the the global route computations.

   This document presents a new proposal for using the 16 byte IPv6
   address which mitigates the route scaling problem and with it a
   number of collateral issues.  This model provides for aggressive
   topological aggregation while controlling the complexity of flat-
   routed regions.  It exploits and supports the dynamic address
   assignment machinery in IPv6 but makes the exact role of that
   machinery a decision local to a Site.  It is therefore subject to
   engineering cost and benefit analysis rather than being mandatory for
   simple Site re-homing situations.

   This new model also identifies the special work done by the global
   Internet infrastructure on behalf of multi-homed sites. Rather than
   continuing the current "Tragedy of the Commons", the multi-homing is
   isolated into a specific mechanism which is then traceable to and
   incurred by only those sites wishing to subscribe to this capability.
   Again, this makes it possible for sites to make informed cost-benefit
   decisions about multi-homing.

4. Central Concepts of the Architecture

   The architecture is based upon a few central concepts.

           A strong distinction between Public and Private Topology

           A strong distinction between system identity and location

           GSE - Global, Site, and End-system address elements

O'Dell v3.7                                                     [Page 2]
Internet-Draft                GSE for IPv6        1997/02/24 01:32:32GMT

           The deep similarity of Re-homing and Multi-homing

           Rewriting address prefixes at Site boundaries

           Very aggressive hierarchical network topology aggregation

           Optimizing actual forwarding paths by limited-scope

   This model draws a strong distinction between the Public Topology
   which forms the transit infrastructure of the Global Internet and a
   "Site" which can contain a rich but strictly private local network
   topology which cannot "leak" into the global routing machinery.  The
   Site is the fundamental unit of attachment to the Global Internet and
   is therefore strictly a leaf, even if possibly multi-homed.

   This model also draws a very strong distinction between the identity
   of a computer system and where it attaches to the the Public
   Topology.  In IPv4 and current IPv6 models, these notions of identity
   and location are deeply co-mingled and this is the fundamental reason
   why simple topology changes have such wide-ranging impact on address
   assignment (if aggregation is to be maintained at all).

   The 16 byte IPv6 address is split into 3 pieces:

             0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
           |  Routing Goop    | STP| End System Designator |
                  6+ bytes   ~2 bytes       8 bytes

   Routing Goop signifies where the Site attaches to the Global
   Internet.  The Site Topology Partition (STP) is Site-private "LAN
   segment" information.  The End System Designator (ESD) specifies an
   interface on an end-system.

   One surprising notion is that re-homing and multi-homing are very
   deeply related. Multi-homing can be viewed as rather like several
   simultaneous re-homings happening at once.  Achieving both painless
   re-homing and scalable multi-homing rely on the same set of
   fundamental mechanisms, each with a few distinct details.

   Rewriting IPv6 addresses by Site Border Routers is by far the most
   controversial, but also most critical part of this proposal.  To
   control the complexity of routing information which must be managed
   within a Site and to isolate end systems and interior routers from
   external topology changes, the RG of some addresses is modified by
   Site Border Routers.  Packets exiting a site have the RG for the Site

O'Dell v3.7                                                     [Page 3]
Internet-Draft                GSE for IPv6        1997/02/24 01:32:32GMT

   egress point inserted into source addresses, while packets entering a
   Site have the RG in all destination addresses replaced with a
   canonical prefix signifying "within this Site" (the "Site-local

   One immediate result is that upper-layer protocols must use only the
   ESD for purposes such as pseudo-header checksums and the like.  The
   ESD is the invariant token, the RG is possibly transient topology
   information subject to change.

   Topology aggregation is accomplished by partitioning the Global
   Internet into a set of tree-shaped regions anchored by "Large
   Structures".  The Routing Goop in an address specifies a path from
   the root of the tree (the Large Structure) to a point in the
   topology; in the terminal case this is a Site.  Large Structures are
   chosen by their ability to aggregate topology and no particular
   advantage flows from "being one"; actually quite the contrary. Large
   Structures are responsible for subdividing the space under them and
   managing that delegation.  Large Structures provide a "forwarding
   token of last resort" which can always be used for selecting a valid
   next-hop when no other information is available.  This significantly
   limits the minimally-sufficient information required for a "default-
   free" router.  Any additional route information kept is the result of
   path optimizations from cut-throughs.

   While it is useful to think of the Large Structures as trees, the
   collection is actually a DAG (Directed Acyclic Graph) because the
   trees can touch each other via cut-throughs.  By cross-propagating
   selected details via a cut-through, a locally-controlled region can
   learn of alternative paths to some destinations.  The distance this
   optimization information is propagated and the radius of the
   optimization region advertised are the business of the collaborating

5. The Structure of End System Designators - the ESD

   End System Designators denote every computer system in the GSE
   Internet regardless of whether it is a host, router, or other network
   element.  While a given system can have more than one ESD, each ESD
   is globally unique.  This is critical for their utility to the
   upper-level protocols.  This uniqueness can be induced several ways
   as will be seen.

   A crucial design decision is whether an ESD identifies a system,
   invariant of its interfaces as in the XNS architecture, or an
   interface on a system as in the existing IPv4 and IPv6 architecture.

           An ESD designates an interface on a computer system and that

O'Dell v3.7                                                     [Page 4]
Internet-Draft                GSE for IPv6        1997/02/24 01:32:32GMT

           interface can be either physical or virtual.

   When processing a GSE address, a computer system need only examine
   the ESD portion of the address to determine whether a packet is
   destined for that system.

   There are circumstances when it is quite useful to have "an address"
   for a computer system which is independent of any particular physical
   interface on that system. It has become commonplace in IPv4 practice
   to use a distinguished virtual interface to provide a system with
   such an "interface independent identity".  This technique affords the
   same architectural utility of XNS while still allowing the
   flexibility of the IPv4 "addressed interface" model. This model
   retains the successful IPv4/IPv6 model.

   NOTE: We remain intentionally vague about exactly what constitutes an
   "interface" and a "computer system".  The malleability of those
   notions in IPv4 has proven manifestly useful in practice.

   To summarize the ESD uniqueness characteristics:

           (1) an ESD is globally unique
           (2) an ESD designates an "interface" on "a computer system"
           (3) an Interface may have more than one ESD
               (current IPv6 already requires implementations to support
               multiple addresses per interface)
           (4) an ESD may not necessarily designate a particular
               physical computer (Neighbor Discovery continues to provide
               a level of virtual address translation and considerable
               cleverness can be disguised therein)

   There are two forms of ESD, both 8 bytes long, one a subcase of the

   It is clear that with the impending onslaught of the IEEE-1394
   technology that 8-byte IEEE MAC addresses are simply fait accompli
   and many devices will be provided with a unique identity in that
   format at the time of manufacture.  The 8-byte IEEE MAC Address
   format includes the current 6-byte MAC Addresses as a proper
   subspace.  Using the 8-byte IEEE MAC address will be very convenient
   for many network builders.

   There are at least two issues with using *only* the IEEE 8-byte MAC
   addresses as ESDs:  There are point-to-point link interfaces which
   have no IEEE MAC address assigned for them, and the 8-byte IEEE MAC
   addresses assigned to the interfaces of a system are essentially
   random.  For some, there is also the issue of whether the IEEE MAC
   address is "unique enough" for the purposes at hand.

O'Dell v3.7                                                     [Page 5]
Internet-Draft                GSE for IPv6        1997/02/24 01:32:32GMT

   We clearly need a space for generating ESDs for interfaces which
   don't come equipped with one.  Some have also suggested there might
   be great utility in enabling inverse lookups on just the ESD part of
   an address.  Assigning ESDs in semantic clusters (like current IPv4
   addresses) would be a signficant aid to this end. Finally if a
   network designer decides not to trust the uniqueness of the IEEE MAC
   addresses, he could always use the Dynamic Numbering machinery of
   IPv6 to assign ESDs.

   We propose that the IETF seek a large (7 bytes or greater) subspace
   of the IEEE 8-byte MAC space for allocation as IETF-NodeIDs in
   semantic clusters to provide a pool of addresses which can be used
   for any of the above reasons, as required.  However, it is expected
   that most network builders will exploit the intrinsic IEEE MAC
   addresses present in many network interfaces whenever possible.

   The IETF-NodeID space should be partitioned into two regions - one
   exactly isomorphic to the existing IPv4 address space to provide
   instant grandfathering of IPv4 addresses, and another space which is
   simply larger but allocated in a similar manner.

   A few comments on "global uniqueness" are in order because in
   previous discussions, some have asserted that unless "uniqueness" can
   be accomplished with absolute and complete mathematical perfection,
   any scheme using the concept is unworkable.  This extreme view
   inconsistent with mass-market experience.

   IEEE MAC addresses are globally unique by nature of the delegation
   process where they are assigned to interfaces by the manufacturers.
   Both XNS and IPX rely on this uniqueness and it works very well in
   practice.  IETF-NodeID values will be globally unique by nature of
   the same kind of assignment mechanism.  IPv4 addresses must be
   globally unique for the Internet to function, and it does, mostly, by
   nature of exactly the same kind of assignment mechanism.

   While accidents and manufacturing defects do occasionally violate the
   uniqueness of IEEE MAC address assignment, humans routinely make
   errors in assigning IPv4 addresses to systems with equally mystifying
   results.  Given the reliance of IEEE-1394 Firewire interconnects on
   these unique MAC addresses, it is likely that the frequency of these
   occurence (relative to the total number of objects with assigned
   addresses) will only decrease. The economic pressure to insure this
   will be intense.

6. The Structure of a Site

   The GSE global routing architecture ultimately views a Site as a leaf

O'Dell v3.7                                                     [Page 6]
Internet-Draft                GSE for IPv6        1997/02/24 01:32:32GMT

   of the topology and doesn't concern itself with the interior of this
   private topology.  However, the internal topology of a Site is
   extremely important to the management and operation of the Site so
   the GSE address architecture provides for a rich set of
   organizational alternatives with different cost-benefit tradeoffs.

   The GSE address structure provides for 16384 distinct Site Topology
   Partitions (STPs) within a Site.  This is the number of SEGMENTS in
   the internal topology, not hosts.  The number of attached hosts is
   limited strictly by available local network technology, and the
   Site's ability to buy enough machines to exhaust the available IEEE
   8-byte MAC address space, or the available 7-byte IETF-NodeID space.

   Using this structure, a single Site can develop an internal topology
   which is a very significant fraction of the total CIDR routes in the
   IPv4 Global Internet.

   An organization is not constrained to being structured as a single
   Site.  The trade-off is that the inter-Site topology must then be
   part of the Public Topology. While the individual Sites can retain
   considerable independence in topological structure and attachment to
   the Global Internet, they must be aware of changes between the
   constituent Sites and that re-homing of constituent Sites will
   potentially impact long-running sessions. That is the cost of
   exploiting the routing machinery available to the Public Topology.

   Given the generous flexibility available for organizing a Site, it is
   worthwhile to examine a few examples.  Note that none of these
   organizational approaches is exclusive.  A large Site might well mix
   these approaches to good effect and indeed the goal is to provide the
   designer of private Site topology with a broad spectrum of design

   The simplest structure to imagine is a Site using all IEEE MAC
   Addresses with all the systems connected in a single Private Topology
   Partition (i.e., all the GSE addresses carry the same STP value which
   is assigned by the local network administration).  Given the
   sophistication of current LAN-switching technology, a Site like this
   could be both large and internally complex yet have simple IPv6
   addressing.  The complexity is absorbed into the LAN infrastructure
   and it appears to be only one partition from the GSE Site Topology
   view.  This structure has one very significant advantage:   long-
   running TCP sessions will will survive arbitrary changes in the local
   topology.  This works, of course, because the single STP is a virtual
   topology with the real topology hidden by the LAN Switching

   The second Site model is like the one just described, except it would

O'Dell v3.7                                                     [Page 7]
Internet-Draft                GSE for IPv6        1997/02/24 01:32:32GMT

   have multiple STPs with routers moving traffic between the segments.
   This is very close to the common IPv4 structure of a CIDR block being
   subnetted to assign a prefix to each STP.  This approach has the
   advantage of familiarity, but it has the disadvantage that long-lived
   TCP connections don't necessarily survive arbitrary changes to the
   private topology. This arises because even though the ESD is
   invariant, reachability will fail because a change in the STP of one
   of the system doesn't get injected into the protocol stack of the
   communicating systems when they move.  The existing IPv6 dynamic
   address assignment machinery will serve to make such internal changes
   much less painful than with IPv4, however.

   One point worth noting is that even with multiple STPs routed within
   a Site, a "Private Topology Partition" need not correspond to a
   "physical" LAN cable.  The STP values could be used to label larger
   organizational structures like "Engineering" or "Finance".  This
   could reduce the likelihood that common internal topology changes
   break long-lived connections.

   The third Site model uses IETF-NodeID ESDs based on existing IPv4
   address assignments.  In this case, all the IPv4-style ESDs could be
   placed in a single STP and then routed internally on the IPv4 address
   in the lowest 4 bytes of the ESD.  It must be emphasized that the
   IPv4 addresses used in IPv4-style ESD must be an officially-
   registered, public-use IPv4 address and NOT an RFC-1918 private-use
   address.  Using an RFC-1918 private-use address violates the global
   uniqueness properties required of an ESD.

   In all of the multi-segment cases, an IETF-NodeID ESD could be used
   to designate any point-to-point link endpoint, the loopback addresses
   in routers, or any other IP-accessible network elements which don't
   naturally have IEEE MAC address for forming an ESD.  And in all of
   the cases, an IETF-NodeID ESDs could be used universally, although it
   is more appropriate to use IEEE ESD form whenever possible.

   In all of the cases where the real topology is not completely
   virtualized by the LAN technology, there will be "Internal
   Renumbering" events caused by moving systems between infrastructure
   segments (STPs).  This will have the effect of killing long-running
   off-Site connections unless provisions are made to allow the systems
   (and the routing infrastructure) to carry the previous ESDs as
   synonyms for a while.  Given that most significant topology moves
   involve powering off the end system in question, this is hardly a
   hardship.  However, the powerful renumbering support already
   developed for IPv6 can make those other moves considerably less

   Most importantly, external re-homing of a Site to the global

O'Dell v3.7                                                     [Page 8]
Internet-Draft                GSE for IPv6        1997/02/24 01:32:32GMT

   infrastructure can be made completely transparent.

7. Dynamic Address Re-writing by Site Border Routers

   A critical component of this architecture is the modification of
   addresses when packets leave or enter a Site.  Re-writing source
   addresses to insert appropriate Routing Goop at the Site egress point
   was part of the 8+8 proposal, but this proposal extends this to re-
   writing destination addresses when inbound packets arrive at a Site
   Border Router.

   The reasons for both re-writings are the same: to insulate the
   interior of the Site from external topology changes and egress policy

   When a Site Border Router inserts the correct RG in the source
   address of outbound packets, it frees the end-systems in the Site
   from having to know the RG for the Site. This is especially important
   if the site is Multi-homed and the Site implements a complex egress
   selection policy.

   In the case of inbound packets, if the destination address were not
   converted to a canonical form, the Site interior routers would have
   to be aware of all the different RG which could be used to reach the
   site, essentially creating aliasing of the destination addresses.  In
   the singly-homed case, this doesn't seem like a significant issue,
   but in complex Multi-homing scenarios there could be a significant
   problem managing this information.

   This symmetric re-writing essentially isolates the Site from the
   Global Internet just as the hard boundary between RG and STP
   components insulates the Global Internet from the Site topology.

8. The Structure of Routing Goop

   Routing Goop, or "RG" is the upper 6+ bytes of a GSE address.  This
   somewhat non-technical term was chosen because all the other
   alternatives seem to have various degrees of conceptual baggage which
   would be as much work to neutralize as the new notions are to explain
   in the first place.

   Fundamentally, RG is a Locator.  It encodes the topological
   connectivity of the Site containing the computer system identified by
   the ESD in the lower 8 bytes.  In the case of a singly-homed Site,
   re-homing to a new attachment to the Public Topology will change ONLY
   the RG in full GSE addresses for computer systems at that Site.  One
   example of such a re-homing would be a change of the Site's Internet
   Service Provider.  This change-over can be made essentially

O'Dell v3.7                                                     [Page 9]
Internet-Draft                GSE for IPv6        1997/02/24 01:32:32GMT

   completely transparent to users both inside and outside the Site,
   although it does involve a practical limit on the transition duration
   relating to how long the departing ISP is willing to extend
   transitional courtesies.  During a changeover, though, all new
   connections will be initiated via the new ISP connection.

   This brings up the deep structure of the topology information carried
   in RG and how it is encoded.  More specifically, RG is a hierarchical
   locator which is a rooted path-expression of flat-routed regions
   which are tangent. Each element in the path-expression includes only
   enough detail to negotiate the flat-routed region.

   It has been observed before that the graph of the Global Internet is
   not obviously a hierarchy so how can this work?

   We start with the observation that every connected graph has at least
   one labeling which forms a spanning tree covering the nodes. The
   hierarchy is induced by a labeling function which partitions the
   global graph into regions and recursively into subregions.  This
   function is only globally visible at the top-level where an initial
   partitioning of the graph is used to form the first level of what
   will become the hierarchy.  Within each partition there is a local
   sub-partition function which assigns labels, and we proceed
   recursively. The nested recursions directly induce the hierarchy.

   This decomposition of the Global Internet produces a recursive graph
   where each level is composed of a set of subgraphs which are
   explicitly connected (i.e., explicitly routed between the subgraphs)
   while the structure within each subgraph is assumed to be flat-routed
   (at least as seen at that level).

   From an abstract viewpoint, a hierarchical partitioning can be
   induced with an arbitrary choice of labeling function (as long as the
   function produces the minimally-required partitioning). However, we
   desire the partitions to have several important properties which
   effects the choice of labeling function.

   The general goal is to produce a global labeling which represents the
   topology as compactly as possible, yet allows rich connectivity while
   bounding the complexity of the discrete regions which are flat-

   The top level objects in the GSE graph hierarchy are called "Large
   Structures".  These are objects chosen for their ability to naturally
   represent significant topological aggregation of substructure (not
   geographical, political, or geometric).  The number of Large
   Structures is explicitly limited to bound the complexity at the top
   level of the aggregation graph.

O'Dell v3.7                                                    [Page 10]
Internet-Draft                GSE for IPv6        1997/02/24 01:32:32GMT

   Within Large Structures, the (sub-)partition function is a trade-off
   between the flat-routing complexity within a region and minimizing
   total depth of the substructure.  This is driven by the internal
   topology of a Large Structure and the choices in different Large
   Structures will not necessarily be the same. This is why Routing Goop
   only has one hard bit boundary; Large Structures are free to
   internally subdivide as they chose. They are only required to
   encapsulate a significant portion of the Public Topology.

   One obvious candidate for Large Structures is large networks which
   already represent considerable aggregation based on existing CIDR
   deployment.  Another good candidate might be "Exchange Points".  The
   GSE model can accommodate both of these simultaneously, allowing
   IPv6-style "Network-anchored Prefixes" and "Exchange-anchored
   Prefixes" like that proposed by some to coexist and be subsumed into
   a unified notion of "Aggregator-anchored Prefixes."  Of course, these
   aren't prefixes strictly in the IPv4 CIDR sense, but the left-
   anchored substrings of the Routing Goop are intuitively quite

   Large Structures are assigned a Large Structure Identifier, known as
   an LSID.  The total number of LSIDs is intentionally limited as we
   assume the paths between Large Structures are only flat-routed.

   Two consenting Large Structures remain free to share a tangency below
   the top level and exchange routes so as to provide for improved
   routing between the two of them (formalizing cut-throughs in the
   natural hierarchy).  The goal is to provide for manageable complexity
   of the ultimate default-free zone (the top level of the global
   hierarchy) while allowing for controlled circumvention of the natural
   hierarchical paths.

O'Dell v3.7                                                    [Page 11]
Internet-Draft                GSE for IPv6        1997/02/24 01:32:32GMT

   Bit-level structure of Routing Goop:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   | xxx | 13 Bits of LSID         |      Upper 16 bits of Goop    |

    3               4                   5                   6
    2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
   | Bottom 18 bits of Routing Goop    | 14 bits of Site Topology  |

   NOTE: The Routing Goop structure above assumes that the GSE  proposal
   is  designated  by a 3-bit type of IPv6 address.  If a GSE address is
   identified by two upper bits, the LSID would expand to 14  bits.   If
   identified  by  one bit, the LSID would stay at 14 bits and the Upper
   16 bits of Goop would expand to 17 bits.

   Routing between two interior points of two different Large Structures
   is always possible based solely on the LSID. This provides a
   "forwarding strategy of last resort" for a router running "default-
   free".  From one point of view, the LSID partitions the Global
   Internet into a set of regions such that an interior router only need
   carry a "per-LSID default" pointing at an appropriate boundary router
   which knows how to to handle traffic bound outside the containing
   Large Structure for a point in the other Large Structure.

   If two Large Structures share a tangency somewhere below the top
   level, then some interior routers of both Large Structures will share
   routes to exploit the tangency for optimizing paths.  How this cut-
   through information is distributed within the two Large Structures is
   not revealed elsewhere in the global topology. The exact "shape" of
   the optimization region is controlled by the decisions about which
   routes to advertise across the cut-through.  These decisions are made
   by the collaborators and the optimized region need not be symmetric
   with respect to the cut-through.  The size of the optimization area
   is controlled by how far routes learned via the cut-through are
   propagated within the sub-graphs tangent via the cut-through. Again,
   this is a matter of engineering choices made by the collaborators
   operating the cut-through.

   While the LSID is may appear similar to the Autonomous System Number
   currently used in IPv4 policy-based routing machinery, the LSID is
   quite distinct from the AS number and the two identifiers play very
   different roles.  AS Numbers will continue be used for policy routing
   information exchange and must remain distinct from the LSID space.

O'Dell v3.7                                                    [Page 12]
Internet-Draft                GSE for IPv6        1997/02/24 01:32:32GMT

9. The "Flow" of Routing Goop

   It is intuitively useful to think about Routing Goop as "flowing
   downhill" through the hierarchy from the topmost Large Structures,
   through the intermediate levels of the Public Topology, and
   ultimately down to the Site.  As the RG propagates downward, the
   prefix extends to the right, just like in IPv4 CIDR, with each
   extension navigating the nested flat-routed subgraphs, eventually
   terminating at the Site, which then descends invisibly into the
   Private Topology of that Site.

   The nested flat-routed areas correspond to transit subnetworks of the
   Large Structure.  One very important example of such subnets is the
   "reseller" or "wholesale transit customer" of a Large Structure.
   (Note that whether the Large Structure is a network or an exchange
   point doesn't matter.)  The reseller network provides transit for
   Sites, so must be part of the Public Topology and appears as a
   substring within the Routing Goop, usually the right-most extension
   unless the reseller has further reseller customers.  In that case,
   the next level reseller will have his own extension to record his
   place in the Public Topology and to provide for navigating through it
   as well.

   The overall picture can now be drawn as a forest of trees
   distributing Routing Goop down to the Sites, with each tree being a
   Large Structure and the Large Structures connected arbitrarily at the
   top level. This structure will be mirrored by the actual machinery
   for distributing Routing Goop to the Sites as will be discussed a bit
   later, but this mental image of the prefixes "flowing" from the
   anchoring Large Structures is critical to understanding fundamental
   self-organizing abilities in the GSE model.

   While the GSE machinery is intended to be adequate for almost
   completely automated self-organization with respect to the
   construction and propagation of Routing Goop on an Internet-wide
   basis, we proceed for now closely following current practice
   (admitting manual configuration of certain information like Routing
   Goop) because of the additional complexity of the self-organization
   functions.  Initial deployment following current practice would not
   preclude eventual deployment of a fully self-organizing Global

10. The Distribution of Routing Goop

   There are two cases to consider for how Routing Goop gets
   distributed: source addresses and destination addresses.  In both
   cases RG is part of the address, one way or another, so we show how a
   full 16-byte address with the right RG gets created in these two

O'Dell v3.7                                                    [Page 13]
Internet-Draft                GSE for IPv6        1997/02/24 01:32:32GMT


10.1 RG for Source Addresses

   The initial RG of a source address is almost always the Site-local
   prefix.  If the destination address is not within the Site, the
   packet will leave the Site via one of possibly several Site Boundary
   Routers.  The egress Site Border Router will insert the correct RG in
   the source address based on the path the destination should use to
   return a packet to the sender.  Except in unusual circumstances this
   will be the RG which corresponds to the attachment path of that
   egress Site Boundary Router to the Global Internet.

   If the Site is multi-homed via just one Site Boundary Router, then
   the router is free to apply whatever local policy suits. It simply
   must fill in a valid RG path which leads back to a Site Boundary
   Router for that Site.  If the Site is multi-homed via more than one
   Site Boundary Router, which router provides egress is purely local
   policy and which RG gets applied is likewise local policy.

   The dynamic insertion of RG upon Site egress accomplishes a number of

   (1) It means that for most purposes, a computer system at a Site need
   not concern itself with egress policy matters which can be
   particularly tricky in Multi-homed Sites.

   (2) It means that computer systems are essentially not impacted at
   all by topological re-homing of the Site.

   (3) It means that more complex multi-homing scenarios with multiple
   Site Boundary Routers each with multiple connections to the Global
   Internet can execute arbitrarily complex path recovery policy without
   concern for how it might impact a computer system doing source
   address selection.

   (4) It means that while a computer systems might forge the ESD in a
   source address, it CANNOT forge the point of injection into the
   Public Topology.  This is not strong authentication down to the
   particular computer system, but it is probably a strong deterrent to
   certain obnoxious activities due to the dramatically improved
   traceability.  We also note that the first-hop attachment router in
   the Public Topology is free to insert or override the RG if somehow
   an errant packet escapes a Site carrying invalid RG, thereby
   enforcing traceability. Of course, the Public first-hop router could
   always just drop a packet carrying inappropriate source RG as well.
   But to make it very clear, we put the burden of inserting correct RG
   in exiting source addresses squarely and solely on the Site and the

O'Dell v3.7                                                    [Page 14]
Internet-Draft                GSE for IPv6        1997/02/24 01:32:32GMT

   Site Border Router. Any other location of the task has bad
   performance scaling.

   The Site Border Router acquires the necessary RG from the first-hop
   attachment router in the Public Topology.  Alternately, as an initial
   mechanism the RG could be statically configured, but the real goal is
   completely automated propagation down the tree so that an entire
   complex subtree can be rehomed without human intervention or service

10.2 RG for Destination Addresses

   Currently, an IPv6 address lookup for a DNS name returns the
   information in a "AAAA" record which is the full 16 bytes of the IPv6

   The GSE design proposes synthesizing the 16 bytes of information in a
   query response from two different sources: an "AAA" record and an
   "RG" record.  The "AAA" record carries the 8-byte ESD + ~2 byte STP
   for the DNS name in question and the "RG" record carries 6+ bytes of
   the appropriate Routing Goop.

   One interesting question is how the AAA record gets paired with an RG
   record in a given nameserver.  One simpleminded implementation would
   be to pair an RG record with a zone, but that has the problem of
   requiring all the systems in that zone to use the same Routing Goop
   and hence be in the same Site.

   A better scheme is to carry an "RG Name" in the "AAA" record which
   would allow a nameserver to concatenate an arbitrary RG prefix to the
   ESD+STP producing the full 16 byte response.  The "RG Name" would be
   a full DNS name which could be recursively translated (and the result
   cached).  Structured as an "upward delegation" with an appropriate
   Time-to-Live, a Site could import the Routing Goop information from
   their service provider completely automatically.  This capability
   will be used to great advantage in the discussions of re-homing which
   follows. [Interactions between RG TTL and zone TTL is an issue to be
   explored more.]

   Alternately, one special case for an RG record could be a delegation
   to a Site Border Router which could supply the correct RG
   automatically, at least in single-homed cases, and possibly in
   multi-homed cases.

   The result of this structure is that individual zone entries for
   individual nodes (AAA records) do NOT change when a Site rehomes.
   The only thing which changes (logically) is the RG information which
   is composed with the node's AAA record to produce a full 16-byte

O'Dell v3.7                                                    [Page 15]
Internet-Draft                GSE for IPv6        1997/02/24 01:32:32GMT

   response.  This means the general Dynamic DNS machinery is NOT
   required to support Site re-homing.

   One implication of the special Site-local Prefix RG for intra-Site
   traffic is that Sites will have to provide at least two "faces" on
   their nameservice - one that returns Site-local as the RG for queries
   from inside the site, and another that returns full RG responses for
   requests originating outside the Site.  This can be readily
   accomplished by inspecting the source address - if the source address
   contains the Site-local Prefix as RG, then return the same.
   Otherwise, return a fully-general RG-based response (possibly based
   on egress-path selection policy).

10. Re-homing A Site

   When a Site changes its point of attachment to the Global Internet,
   it is said to "rehome". One of the significant criticisms of IPv4
   CIDR and IPv6 "Provider-based Addressing" is the requirement to
   "renumber" a Site when it rehomes.  One of the explicit goals of the
   GSE architecture is to eliminate, or at least mitigate, the impact of

   It is important to reiterate the notion that the Routing Goop of a
   GSE address is not just a Locator, but that it encodes a PATH from
   the top level of the global hierarchy down to the Site.  Changing
   that path is what makes Re-homing and Multi-homing essentially
   equivalent operations.  We proceed with the simple case first.

   When a Site wishes to rehome, it must establish a new attachment
   point to the Global Internet, and hence establish a new access path.
   Then it must start using that new path before the old path is
   removed.  The procedure is as follows:

   A Site establishes a connection with a new ISP and it becomes able to
   carry the traffic.  At that point, the Site alters the upward
   delegation of the DNS RG records.  Henceforth, all new connections
   made with the new translations will follow the new path to the Site.
   The new connection path is then made the preferred egress path and
   source addresses in packets exiting the Site immediately start being
   marked with the new return path.  The old connection should be
   maintained for some administratively determined grace period to allow
   DNS timeouts to transition new sessions to the new path and for
   long-running sessions to terminate.

   At first blush, it might appear that when the egress path for the
   Site switches over to the new path and the Site Border Router starts
   marking packets with the new RG, the return path for long-running
   sessions would automatically switch over to the new path.  Alas, this

O'Dell v3.7                                                    [Page 16]
Internet-Draft                GSE for IPv6        1997/02/24 01:32:32GMT

   is not so because a long-running session will be using destination
   address containing the old RG acquired when the session first

   Consideration was given to providing some kind of "path redirect"
   which would allow the other end to deal with "flying cutovers" of a
   running session, but the security implications of this mechanism are
   too far-reaching to consider as part of initial deployment.  If at
   some later point it becomes clear how to accomplish this safely, then
   it could be added. But the complexity, security risks, and the
   magnitude of the added value do not seem worthwhile at present
   (although the author would love to be convinced otherwise).

   Alternately, the Site could request a "Re-homing Courtesy" from their
   old ISP which would effectively make it a multi-homed Site for some
   period of time.  After multi-homing was established, the old
   connection could be taken down and the long-running sessions would
   continue to survive as long as the Site was multi-homed by way of the
   Re-homing Courtesy.

   Note that at no time did the re-homing effect anything internal to
   the Site's Private Topology.  The only change was the attachment to
   the Public Topology and the Routing Goop which records that
   attachment location.

11. Multi-homing a Site

   One of the curiosities of IPv4 is that the network does a lot more
   work for a multi-homed site but it is very hard to pin it down so
   that the instigator of the effort can compensate the workers.

   In the GSE model, Multi-homing is an explicit service which is
   performed for a Site by the agents of the Public Topology which
   provide the access for the Site.  This mechanism can be made more
   sophisticated, but the notion is most readily explained by
   considering a Site which is dual homed to two different ISPs and
   hence has two distinct access paths represented by two distinct blobs
   of Routing Goop.

   The Site is attached to each ISP via some link and we postulate some
   kind of keep-alive protocol which determines when reachability to the
   Site's border router is lost. The ISP routers serving the dual-homed
   Site are identified to each other (via static configuration
   information in the simplest case or a dynamic protocol in the more
   general case), and when a link to the Site is lost, the ISP router
   anchoring the dead link simply tunnels any traffic destined for the
   Site via the other ISP router.

O'Dell v3.7                                                    [Page 17]
Internet-Draft                GSE for IPv6        1997/02/24 01:32:32GMT

   This approach clearly requires coordination between the two serving
   ISPs. This is not a new constraint - multi-homing already requires
   considerable coordination between the Site and is providers.  Of
   course, creating a protocol for dynamically creating a "homing group"
   is probably a very worthwhile investment but it is not absolutely
   necessary at the outset.

   It should be obvious now that the "Re-homing Courtesy" in the
   previous section is simply doing the router-pair coordination with
   the new ISP for some period of time.

   [Note: Yakov and Bates are working on a draft for a Site-side
   implementation of aggregation-efficient multi-homing which may
   simplify this even further.]

12. Re-homing a Reseller

   Re-homing a Reseller is a slightly more general case of re-homing a
   Site, primarily characterized by more lead time, a longer grace
   period, and some necessary coordination with customer Sites to insure
   that the Routing Goop propagates correctly.

   The Reseller will establish a new connection which will not only
   result in a new path for the Reseller's topology, but for that of his
   customer Sites. When the Reseller alters his upward delegation of
   Routing Goop, it will ripple downward to his customer Sites by nature
   of their upward delegations.  The downward ripple of Routing Goop via
   the upward delegations should cause the Site zone TTLs to be reduced
   appropriately to insure caches expire well within the dual-homed
   transition grace period for the Reseller.

   This essentially rehomes all the Reseller's customer Sites all at the
   same time the Reseller's infrastructure is re-homing and should be
   completely transparent except for long-lived sessions which do not
   terminate by the end of the grace period.

13. Multi-homing a Reseller

   There are two parts to multi-homing a Reseller - one part similar to
   the multi-homed Site case above, and one part which is quite

   For this discussion, assume a Reseller which is dual-homed and hence
   has two different Routing Goop prefixes (remember that each path to
   the top level of the hierarchy has a distinct prefix). The reseller
   can solicit multi-homed tunneling services from his two access point
   routers to provide alternate path service just like a multi-homed
   Site.  Why traffic is coming to any particular router, though, is

O'Dell v3.7                                                    [Page 18]
Internet-Draft                GSE for IPv6        1997/02/24 01:32:32GMT

   influenced entirely by what routes are advertised out that particular
   connection via BGP5 (or IDRP).  This is rather different from the
   multi-homed Site case where the ESD is the object of interest and the
   RG simply gets the traffic to the Site boundary.

   The question arises, however, as to which prefix gets used for
   extending downward to his customer Sites.  The answer in the simplest
   case is to pick one and use it, making the Sites "natural" in the
   chosen prefix.  The alternate prefix can, of course, be advertised
   out the alternate path if desired.  But this work can be ascribed to
   the instigator and the superior attachment points can charge for this
   service.  (This is somewhat akin to charging for routes, but only
   routes which create a discontinuity in the routing space.)

15. A Comment on NAT Boxes

   In discussions about requiring destination address re-writing for
   inbound packets, Brian Carpenter remarked that with the advent of
   symmetric re-writing (both inbound and outbound), the GSE
   architecture is essentially "NAT that works."  To some, this would be
   the ultimate insult, but I think it is essentially correct.  NAT
   Boxes provide for isolating a Site from topology changes but severely
   compromise the end-to-end model.  GSE affords very similar
   operational topological isolation but without violating the end-to-
   end model, at least not nearly as much.  If a Site wishes the
   additional isolation afforded by NAT Boxes, a firewalls will
   accomplish that task.

15. General Comments

   While some of GSE is a radical departure from IPv6 as we currently
   know it, in general it relies deeply on all the IPv6 underpinnings
   which contribute so much to the attractiveness of IPv6: Neighbor
   Discover, all the dynamic configuration machinery designed to make
   renumbering palatable even using "provider-based addressing", and the
   flexibility of the "salami headers" which make tunneling and security
   attractive.  The general forwarding operations based on longest-
   match-under-prefix-mask and the policy-based routing machinery of
   BGP5/IDRP are also simply assumed.

16. Closing Comments and Acknowledgments

   This document presents a revision of the "8+8" addressing model which
   has been under construction by the author since before Fall of 1995,
   at least.  Conversations with a great many people have contributed to
   the design presented in this document.  A skeletal version of this
   proposal first appeared in some email from Dave Clark of MIT who
   planted the seed and provided the original monicker "8+8". A great

O'Dell v3.7                                                    [Page 19]
Internet-Draft                GSE for IPv6        1997/02/24 01:32:32GMT

   many others have contributed ideas and observations, all of which
   went into the stew pot for the synthesis contained here.

   The original "8+8" draft cited the following individuals for a
   special thank-you: Vadim Antonov, Ran Atkinson, Scott Bradner, Brian
   Carpenter, Noel Chiappa, Steve Deering, Sean Doran, Joel Halpern,
   Christian Huitema, Tony Li, Peter Lothberg, Louis Mamakos, Radia
   Perlman, Yakov Rekhter, Paul Traina.

   This draft has benefited greatly from conversations with Masataka
   Ohta, who convinced the author of the importance of the IETF-NodeID
   in addition to the 8-byte IEEE MAC addresses, as well as Brian
   Carpenter, Scott Brander, Ran Atkinson, all the people who so
   graciously provided invaluable comments on the original "8+8" draft,
   and of course Steve Deering, Bob Hinden, and the IPng Working Group.

17. Security Considerations

   More than can be imagined.

18. Author's Address

   Mike O'Dell
   UUNET Technologies, Inc.
   3060 Williams Drive
   Fairfax, VA 22031
   voice: 703-206-5890
   fax:   703-206-5471

O'Dell v3.7                                                    [Page 20]