Network Working Group                                         J. Klensin
Internet-Draft                                          January 18, 2007
Expires: July 22, 2007


                  ASCII Escaping of Unicode Characters
                 draft-klensin-unicode-escapes-00.txt

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on July 22, 2007.

Copyright Notice

   Copyright (C) The IETF Trust (2007).

Abstract

   There are a number of circumstances in which an escape mechanism is
   needed in conjunction with a protocol to encode characters that
   cannot by represented or transmitted directly.  With ASCII coding the
   traditional escape has been either the decimal or hexadecimal offset
   of the character, written in a variety of different ways.  The move
   to Unicode, where characters occupy two or more octets and may be
   coded in several different forms, has further complicated the
   question of escapes.  This document discusses some the options now in
   use and makes a proposal for general use in IETF protocols.


Klensin                   Expires July 22, 2007                 [Page 1]

Internet-Draft               Unicode Escapes                January 2007


Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . . . 3
     1.1.  Context and Background  . . . . . . . . . . . . . . . . . . 3
     1.2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . 3
     1.3.  Discussion List . . . . . . . . . . . . . . . . . . . . . . 3
   2.  Proposal for a Standard Form  . . . . . . . . . . . . . . . . . 4
   3.  Rationale and Other Alternatives  . . . . . . . . . . . . . . . 4
     3.1.  Unicode Table Position versus UTF-8 Octets  . . . . . . . . 4
     3.2.  Presentation Variants for Unicode Table Position  . . . . . 5
   4.  Security Considerations . . . . . . . . . . . . . . . . . . . . 6
   5.  Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . 6
   6.  References  . . . . . . . . . . . . . . . . . . . . . . . . . . 6
     6.1.  Normative References  . . . . . . . . . . . . . . . . . . . 6
     6.2.  Informative References  . . . . . . . . . . . . . . . . . . 7
   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . . . 7
   Intellectual Property and Copyright Statements  . . . . . . . . . . 8


Klensin                   Expires July 22, 2007                 [Page 2]

Internet-Draft               Unicode Escapes                January 2007


1.  Introduction

1.1.  Context and Background

   There are a number of circumstances in which an escape mechanism is
   needed in conjunction with a protocol to encode characters that
   cannot by represented or transmitted directly.  With ASCII [ASCII]
   coding the traditional escape has been either the decimal or
   hexadecimal offset of the character, written in a variety of
   different ways.  For example, in different contexts, we have seen
   %dNN or %NN for the decimal form, %NN, %xNN, X'nn', and %X'NN' for
   the hexadecimal form. "%NN" has become popular in recent years to
   represent a hexadecimal value without further qualification, perhaps
   as a consequence of its use in URLs and their prevalence.  There are
   even some applications around in which octal forms are used and,
   while they do not generalize well, the MIME Quoted-Printable and
   Encoded-word forms can be thought of as yet another set of escapes.
   So, even for the fairly simple cases of ASCII and extended ASCII, we
   have been living with several different escaping forms, each the
   result of some history.

   When one moves to Unicode [Unicode] [ISO10646], where characters
   occupy two or more octets and may be coded in several different
   forms, the question of escapes becomes even more complicated.  In
   particular, we have seen fairly extensive use of both hexadecimal
   representations of the UTF-8 encoding [RFC3629] of a character and
   variations on the U+NNNN[N[N]] notation commonly used in conjunction
   with the Unicode Standard.  This document proposes that a specific
   variation on the latter SHOULD be used in protocols unless other
   considerations apply and explains that choice.

   In addition to the protocol contexts addressed in this specification,
   escapes to represent Unicode characters also appear in presentations
   to users, i.e., in user interfaces (UI).  The formats specified in,
   and the reasoning of, this document may be applicable in UI contexts
   as well, but this is not a proposal to standardize UI or presentation
   forms.

1.2.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].

1.3.  Discussion List

   Discussion of this document should be addressed to the
   discuss@apps.ietf.org mailing list.


Klensin                   Expires July 22, 2007                 [Page 3]

Internet-Draft               Unicode Escapes                January 2007


2.  Proposal for a Standard Form

   For the reasons discussed in the next section, the forms
   \UNNNNNNNN (for any Unicode character) and
   \uNNNN (for Unicode characters in plane 0)
   are generally preferred for use when an ASCII escape for embedded
   Unicode characters is needed in protocols.  Specifically, in ABNF
   [RFC4234],
   EmbeddedUnicodeChar =  BMP-form Full-form
   Hex-quad =  4*4 HexDigit
   BMP-form =  "\u" Hex-quad
   Full-form =  "\U" 2*2 Hex-quad
   HexDigit =  "0" / "1" / "2"/ "3"/ "4"/ "5"/ "6"/ "7"/ "8"/ "9"/ "A"/
      "B" / "C"/ "D"/ "E"/ "F"

   This form SHOULD be used in IETF protocols that require Unicode
   character escaping unless there are substantial reasons for using
   something else.  For the convenience of the reader, it is generally
   preferred for documentation in IETF-related running text as well
   (e.g., in RFCs) although the U+NNNN form MAY be used when Unicode
   character encoding is clearly expected.


3.  Rationale and Other Alternatives

   There are many different ways to designate, encode, or call out a
   Unicode character.  Given adequate decoding facilities, all of these
   other than the formal character name are equivalent.  However, when
   information about characters is to be processed by people,
   information about the Unicode code point is preferable to a further
   encoding of the encoded form of the character and it is desirable to
   reduce confusion by designating one form as preferable.  These issues
   are discussed in the following subsections.

3.1.  Unicode Table Position versus UTF-8 Octets

   There are two major families of ways to represent Unicode characters.
   One uses the code point position in the table in some representation
   (see the next section), the other encodes the octets of the UTF-8
   encoding.  Some other options are possible, but they have been rare
   in practice.  This specification recommends that, in the absence of
   compelling reasons to do otherwise, the Unicode code point forms be
   used rather than the UTF-8 ones.  There are several reasons for this,
   including:
   o  One reason for the success of many IETF protocols is that they use
      human-interpretable text forms to communicate, rather than
      encodings that generally require computer programs (or hand
      simulation of algorithms) to decode.  This suggests that the


Klensin                   Expires July 22, 2007                 [Page 4]

Internet-Draft               Unicode Escapes                January 2007


      presentation form should reference the Unicode tables for
      characters and to do so as simply as possible.
   o  The nature of UTF-8 implies that a decimal or hexadecimal numeral
      representation of UTF-8 requires conversion to the UTF-8 form,
      then conversion from the UTF-8 form to a Unicode character
      position form in order to look the character up in a table.  That
      may be appropriate in some cases where the goal is really to
      represent the UTF-8 form but, in general, it just obscures desired
      information and makes errors more likely and debugging harder.
   o  Except for characters in the ASCII subset of Unicode (U+0000
      through U+007F), the character code position form is generally
      more compact than forms based on coding UTF-8 octets, sometimes
      much more compact.

   The same considerations that apply to encoding of UTF-8 octets also
   apply to more compact ACE encodings such as the "bootstring" encoding
   [RFC3492] with or without its "Punycode" profile.

3.2.  Presentation Variants for Unicode Table Position

   There are a number of different ways to represent a Unicode code
   point position.  The forms suggested here -- "\U" followed by eight
   hexadecimal efforts for general use and, optionally, "\u" followed by
   four hexadecimal digits for references to Unicode Plane 0 (the "BMP")
   -- were chosen because of their use in several programming languages,
   notably the "new character" extensions to ISO Standard C
   [ISO-C-Chars].

   Other forms that were considered, and that may sometimes be
   encountered and justified, include:
   o  Perl uses the form \x(NNN...).  The advantage of this form is that
      there are explicit delimiters, resolving the issue of having
      variable-length strings or using the case-change mechanism of the
      proposed form to distinguish between Plane 0 and more general
      forms.  Some other programming languages would tend to favor
      X'NNN...' forms for hexadecimal strings and perhaps U'NNNN...' for
      Unicode-specific strings, but those forms do not seem to be in use
      around the IETF.
   o  Java uses the form \uNNNN, but can represent characters outside
      Plane 0 (i.e., above U+FFFF) only by the use of surrogate pairs.
      Decoding (or de-mapping) surrogates raises some of the same issues
      as the use of UTF-8 octets discussed above.  Codings that depend
      on surrogates SHOULD NOT be used.  For characters in Plane 0, the
      Java form is identical to the recommended Plane 0-only form
      recommended above.
   o  HTML and XML use the form &#xNNNN;.  Like the Perl form, this form
      has a clear terminator, reducing ambiguity.  However, it is
      generally considered ugly and awkward outside of its native HTML,


Klensin                   Expires July 22, 2007                 [Page 5]

Internet-Draft               Unicode Escapes                January 2007


      XML, and similar contexts.

   There is one significant disadvantage of the recommended form.  The
   use of a case variation (between "u" for the four digit form and "U"
   for the six digit form) may not seem natural in environments in which
   upper and lower case characters are generally considered equivalent
   and might be confusing to people who are not very familiar with
   Latin-based alphabets (although those people might have even more
   trouble reading relevant English text and explanations).  There
   appears to be consensus that existing standards and wide current use
   outweigh that objection.


4.  Security Considerations

   This document proposes a specific mechanism for encoding Unicode
   characters when other considerations do not apply.  Since the
   encoding is unambiguous and normalization issues are not involved, it
   should not introduce any security issues that are not present as a
   result of simple use of non-ASCII characters, no matter how they are
   encoded.  The mechanism suggested should slightly lower the risks of
   confusing users with encoded characters by making the identity of the
   characters being used somewhat more obvious than some of the
   alternatives.


5.  Acknowledgments

   This document was produced in response to a series of discussions
   within the IETF Applications Area and as part of work on email
   internationalization and internationalized domain name updates.  It
   is a synthesis of a large number of discussions, the comments of the
   participants in which are gratefully acknowledged.  The help of Mark
   Davis in constructing a list of alternative presentations and
   selecting among them was especially important.


6.  References

6.1.  Normative References

   [ISO10646]
              International Organization for Standardization,
              "Information Technology - Universal Multiple- Octet Coded
              Character Set (UCS)"", ISO/IEC 10646:2003, December 2003.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.


Klensin                   Expires July 22, 2007                 [Page 6]

Internet-Draft               Unicode Escapes                January 2007


   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
              10646", STD 63, RFC 3629, November 2003.

   [RFC4234]  Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
              Specifications: ABNF", RFC 4234, October 2005.

   [Unicode]  The Unicode Consortium, "The Unicode Standard, Version
              5.0", 2006.

              (Addison-Wesley, 2006.  ISBN 0-321-48091-0).

6.2.  Informative References

   [ASCII]    American National Standards Institute (formerly United
              States of America Standards Institute), "USA Code for
              Information Interchange", ANSI X3.4-1968, 1968.

              ANSI X3.4-1968 has been replaced by newer versions with
              slight modifications, but the 1968 version remains
              definitive for the Internet.

   [ISO-C-Chars]
              International Organization for Standardization,
              "Information technology -- Programming languages, their
              environments and system software inferfaces -- Extensions
              for the programming language C to support new character
              data types", ISO/IEC TR 19769:2004, July 2004.

   [RFC3492]  Costello, A., "Punycode: A Bootstring encoding of Unicode
              for Internationalized Domain Names in Applications
              (IDNA)", RFC 3492, March 2003.


Author's Address

   John C Klensin
   1770 Massachusetts Ave, #322
   Cambridge, MA  02140
   USA

   Phone: +1 617 245 1457
   Email: john-ietf@jck.com


Klensin                   Expires July 22, 2007                 [Page 7]

Internet-Draft               Unicode Escapes                January 2007


Full Copyright Statement

   Copyright (C) The IETF Trust (2007).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org.


Acknowledgment

   Funding for the RFC Editor function is provided by the IETF
   Administrative Support Activity (IASA).


Klensin                   Expires July 22, 2007                 [Page 8]