Network Working Group J. Klensin Internet-Draft September 30, 2007 Intended status: Best Current Practice Expires: April 2, 2008 ASCII Escaping of Unicode Characters draft-klensin-unicode-escapes-04.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on April 2, 2008. Copyright Notice Copyright (C) The IETF Trust (2007). Abstract There are a number of circumstances in which an escape mechanism is needed in conjunction with a protocol to encode characters that cannot be represented or transmitted directly. With ASCII coding the traditional escape has been either the decimal or hexadecimal offset of the character, written in a variety of different ways. The move to Unicode, where characters occupy two or more octets and may be coded in several different forms, has further complicated the Klensin Expires April 2, 2008 [Page 1] Internet-Draft Unicode Escapes September 2007 question of escapes. This document discusses some options now in use and discusses considerations for selecting one for use in new IETF protocols and protocols that are now being internationalized. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Context and Background . . . . . . . . . . . . . . . . . . 3 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 1.3. Discussion List . . . . . . . . . . . . . . . . . . . . . 4 2. Encodings that Represent Unicode Code Points: Table Position versus UTF-8 or UTF-16 Octets . . . . . . . . . . . . 4 3. Referring to Unicode Characters . . . . . . . . . . . . . . . 5 4. Syntax for Code Point Escapes . . . . . . . . . . . . . . . . 5 5. Recommended Presentation Variants for Unicode Code Point Excapes . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 5.1. Backslash-U with Delimiters . . . . . . . . . . . . . . . 7 5.2. XML and HTML . . . . . . . . . . . . . . . . . . . . . . . 7 6. Forms that are Normally Not Recommended . . . . . . . . . . . 7 6.1. The C Programming Language: Backslash-U . . . . . . . . . 7 6.2. Perl: A Hexadecimal String . . . . . . . . . . . . . . . . 8 6.3. Java: Escaped UTF-16 . . . . . . . . . . . . . . . . . . . 9 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 8. Security Considerations . . . . . . . . . . . . . . . . . . . 9 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 9 10. Change log . . . . . . . . . . . . . . . . . . . . . . . . . . 10 10.1. Changes in -01 . . . . . . . . . . . . . . . . . . . . . . 10 10.2. Major Changes in -02 . . . . . . . . . . . . . . . . . . . 10 10.3. Major Changes in -03 . . . . . . . . . . . . . . . . . . . 10 10.4. Changes in -04 . . . . . . . . . . . . . . . . . . . . . . 10 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11.1. Normative References . . . . . . . . . . . . . . . . . . . 11 11.2. Informative References . . . . . . . . . . . . . . . . . . 11 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 12 Intellectual Property and Copyright Statements . . . . . . . . . . 13 Klensin Expires April 2, 2008 [Page 2] Internet-Draft Unicode Escapes September 2007 1. Introduction 1.1. Context and Background There are a number of circumstances in which an escape mechanism is needed in conjunction with a protocol to encode characters that cannot be represented or transmitted directly. With ASCII [ASCII] coding the traditional escape has been either the decimal or hexadecimal offset of the character, written in a variety of different ways. For example, in different contexts, we have seen %dNN or %NN for the decimal form, %NN, %xNN, X'nn', and %X'NN' for the hexadecimal form. "%NN" has become popular in recent years to represent a hexadecimal value without further qualification, perhaps as a consequence of its use in URLs and their prevalence. There are even some applications around in which octal forms are used and, while they do not generalize well, the MIME Quoted-Printable and Encoded-word forms can be thought of as yet another set of escapes. So, even for the fairly simple cases of ASCII and standard built by extending ASCII, such as the ISO 8859 family, we have been living with several different escaping forms, each the result of some history. When one moves to Unicode [Unicode] [ISO10646], where characters occupy two or more octets and may be coded in several different forms, the question of escapes becomes even more complicated. In particular, we have seen fairly extensive use of both hexadecimal representations of the UTF-8 encoding [RFC3629] of a character and variations on the U+NNNN[N[N]] notation (i.e., "U+" and four to six hexadecimal digits) commonly used in conjunction with the Unicode Standard. In accordance with existing best-practices recommendations [RFC2277], new protocols that are required to carry textual content for human use SHOULD be designed in such a way that the full repertoire of Unicode characters may be represented in that text. This document proposes that existing protocols being internationalized, and that need an escape mechanism, SHOULD use some contextually-appropriate variation on references to code points as described in Section 2 unless other considerations outweigh those described here. This recommendation is not applicable to protocols that already accept native UTF-8 or some other encoding of Unicode. In general, when protocols are internationalized, it is preferable to accept those forms rather than using escapes. This recommendation applies to cases, including transition arrangements, in which that is not practical. Klensin Expires April 2, 2008 [Page 3] Internet-Draft Unicode Escapes September 2007 In addition to the protocol contexts addressed in this specification, escapes to represent Unicode characters also appear in presentations to users, i.e., in user interfaces (UI). The formats specified in, and the reasoning of, this document may be applicable in UI contexts as well, but this is not a proposal to standardize UI or presentation forms. This document does not make general recommendations for processing Unicode strings or for their contents. It assumes that the strings that one might want to escape are valid and reasonable and that the definition of "valid and reasonable" is the province of other documents. Recommendations about general treatment of Unicode strings may be found in many places, including the Unicode Standard itself and the W3C Character Model [W3C-CharMod] as well as specific rules in individual protocols. 1.2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 1.3. Discussion List Discussion of this document should be addressed to the discuss@apps.ietf.org mailing list. 2. Encodings that Represent Unicode Code Points: Table Position versus UTF-8 or UTF-16 Octets There are many different ways to designate, encode, or call out a Unicode character. Given adequate decoding facilities, all of these other than the formal character name are equivalent. However, when information about characters is to be processed by people, information about the Unicode code point is preferable to a further encoding of the encoded form of the character. It is also desirable to use hexadecimal references to code points because the Unicode Standard is organized on a hexadecimal basis. There are two major families of ways to represent Unicode characters. One uses the code point position in the table in some representation (see the next section), the other encodes the octets of the UTF-8 encoding or some other short-form encoding. Some other options are possible, but they have been rare in practice. This specification recommends that, in the absence of compelling reasons to do otherwise, the Unicode code point forms SHOULD be used rather than the UTF-8 (or UTF-16) ones. There are several reasons for this, Klensin Expires April 2, 2008 [Page 4] Internet-Draft Unicode Escapes September 2007 including: o One reason for the success of many IETF protocols is that they use human-interpretable text forms to communicate, rather than encodings that generally require computer programs (or hand simulation of algorithms) to decode. This suggests that the presentation form should reference the Unicode tables for characters and to do so as simply as possible. o The nature of UTF-8 implies that a decimal or hexadecimal numeral representation of UTF-8 requires conversion to the UTF-8 form, then conversion from the UTF-8 form to a Unicode character position form in order to look the character up in a table. That may be appropriate in some cases where the goal is really to represent the UTF-8 form but, in general, it just obscures desired information and makes errors more likely and debugging harder. o Except for characters in the ASCII subset of Unicode (U+0000 through U+007F), the character code position form is generally more compact than forms based on coding UTF-8 octets, sometimes much more compact. The same considerations that apply to encoding of UTF-8 octets also apply to more compact ACE encodings such as the "bootstring" encoding [RFC3492] with or without its "Punycode" profile. Similar considerations apply to UTF-16 encoding, such as the \uNNNN form used in Java (See Section 6.3). While those forms are equivalent to code point references for the Basic Multilingual Plane (BMP, Plane 0), a two-stage decoding process is needed to handle surrogates to access higher planes. 3. Referring to Unicode Characters Regardless of what decisions are made about escapes for Unicode characters in protocol or similar contexts, text referring to a Unicode code point SHOULD use the U+NNNN[N[N]] syntax, as specified in the Unicode Standard, where the NNNN... string consists of hexadecimal numbers. Text actually containing a Unicode character SHOULD use a syntax more suitable for automated processing. 4. Syntax for Code Point Escapes There are many options for code point escapes, some of which are summarized below. All are equivalent in content and semantics -- the differences lie in syntax. The best choice of syntax for a Klensin Expires April 2, 2008 [Page 5] Internet-Draft Unicode Escapes September 2007 particular protocol or other application depends on that application: one form may simply "fit" better in a given context than others. It is clear, however, that hexadecimal values are preferable to other alternatives: Systems based on decimal or octal offsets SHOULD NOT be used. Since this specification does not recommend one specific syntax, protocols specifications that use escapes MUST define the syntax they are using, including any necessary escapes to permit the escape sequence to be used literally. The application designer selecting a format should consider at least the following factors: o If similar or related protocols already use one form, it may be best to select that form for consistency and predictability. o A Unicode code point can fall in the range from U+0000 to U+10FFFF. Different escape systems may use four, five, six, or eight hexadecimal digits. To avoid clever syntax tricks and the consequent risk of confusion and errors, forms that use explicit string delimiters are generally preferred over other alternatives. In many contexts, symmetric paired delimiters are easier to recognize and understand than visually-unrelated ones. o Syntax forms starting in "\u", without explicit delimiters, have been used in several different escape systems, including the four or eight digit syntax of C Section 6.1, the UTF-16 encoding of Java Section 6.3, and some arrangements that may follow the "\u" with four, five, or six digits. The possible confusion about which option is actually being used may argue against use of any of these forms. o Forms that require decoding surrogate pairs share most of the problems that appear with encoding of UTF-8 octets. Internet protocols SHOULD NOT use surrogate pairs. 5. Recommended Presentation Variants for Unicode Code Point Excapes There are a number of different ways to represent a Unicode code point position. No one of them appears to be "best" for all contexts. In addition, when an escape is needed for the escape mechanism itself, the optimal one of those might differ from one context to another. Some forms that are in popular use and that might reasonably be considered for use in a given protocol are described below and Klensin Expires April 2, 2008 [Page 6] Internet-Draft Unicode Escapes September 2007 identified with a current-use context when feasible. The two in this section are recommended for use in Internet Protocols. Other popular ones appear in Section 6 with some discussion of their disadvantages. 5.1. Backslash-U with Delimiters One of the recommended forms is a variation of the many forms that start in "\u" (See Section 6.1, below>), but uses explicit delimiters for the reasons discussed elsewhere. Specifically, in ABNF [RFC4234], EmbeddedUnicodeChar = %x5C.75.27 4*6HEXDIG ; starting with lower case "\u and "'" ; Note that the encodings are considered to be abstractions ; for the relevant characters, not designations of ; specific octets. 5.2. XML and HTML The other recommended form is the one used in XML. It uses the form &#xNNNN;. Like the Perl form (Section 6.2), this form has a clear ending delimiter, reducing ambiguity. HTML uses a similar form, but the semicolon may be omitted in some cases. If that is done, the advantages of the the delimiter disappear so the HTML form without the semicolon SHOULD NOT be used. However, this format is often considered ugly and awkward outside of its native HTML, XML, and similar contexts. In ABNF: EmbeddedUnicodeChar = %x26.23.78 2*6HEXDIG ";" ; starts with "&#x" Note that a literal "&" can be expressed by "&" when using this style. 6. Forms that are Normally Not Recommended 6.1. The C Programming Language: Backslash-U The forms \UNNNNNNNN (for any Unicode character) and \uNNNN (for Unicode characters in plane 0) Klensin Expires April 2, 2008 [Page 7] Internet-Draft Unicode Escapes September 2007 are utilized in the C Programming Language [ISO-C] when an ASCII escape for embedded Unicode characters is needed. Specifically, in ABNF [RFC4234], EmbeddedUnicodeChar = BMP-form / Full-form BMP-form = %x5C.75 4HEXDIG ; starting with lower case "\u" ; In both this case and the one above, note that the ; encodings are considered to be abstractions for the ; relevant characters, not designations of specific octets. Full-form = %x5C.55 8HEXDIG ; starting with upper case "\U" HEXDIG = "0" / "1" / "2"/ "3"/ "4"/ "5"/ "6"/ "7"/ "8"/ "9"/ "A"/ "B" / "C"/ "D"/ "E"/ "F" ; effectively identical with definition in RFC 4234. There are disadvantages of this form which may be significant. First, the use of a case variation (between "u" for the four digit form and "U" for the eight digit form) may not seem natural in environments in which upper and lower case characters are generally considered equivalent and might be confusing to people who are not very familiar with Latin-based alphabets (although those people might have even more trouble reading relevant English text and explanations). Second, as discussed in Section 4 the very fact that there are several different conventions that start in \u or \U may become a source of confusion as people make incorrect assumptions about what they are looking at. 6.2. Perl: A Hexadecimal String Perl uses the form \x{NNNN...}. The advantage of this form is that there are explicit delimiters, resolving the issue of having variable-length strings or using the case-change mechanism of the proposed form to distinguish between Plane 0 and more general forms. Some other programming languages would tend to favor X'NNNN...' forms for hexadecimal strings and perhaps U'NNNN...' for Unicode-specific strings, but those forms do not seem to be in use around the IETF. In ABNF: EmbeddedUnicodeChar = %x5C.78 "{" 2*6HEXDIG "}" ; starts with "\x" Note that there is a possible ambiguity in how two-character or low- numbered sequences in this notation are understood, i.e., that octets in the range \x(00) through \x(FF) may be construed as being in the local character set, not as Unicode code points. Because of this Klensin Expires April 2, 2008 [Page 8] Internet-Draft Unicode Escapes September 2007 apparent ambiguity, and because IETF documents do not contain provision for pragmas (see [PERLIntro] for more information about the "encoding" pragma in Perl and other details) the Perl forms should be used with extreme caution if at all. 6.3. Java: Escaped UTF-16 Java [Java] uses the form \uNNNN, but as a reference to UTF-16 values, not Unicode code points. While it uses a syntax similar to that described in Section 6.1, this relationship to UTF-16 makes it, in many respects, more similar to the encodings of UTF-8 discussed above than to an escape that designates Unicode code points. Note that the UTF-16 form, and hence the Java escape notation, can represent characters outside Plane 0 (i.e., above U+FFFF) only by the use of surrogate pairs, raising some of the same issues as the use of UTF-8 octets discussed above. For characters in Plane 0, the Java form is indistinguishable from the Plane 0-only form described in Section 6.1. If only for that reason, it SHOULD NOT be used as an escape except in those Java contexts in which it is natural. 7. IANA Considerations This document specifies no actions for IANA". 8. Security Considerations This document proposes a set of rules for encoding Unicode characters when other considerations do not apply. Since all of the recommended encodings are unambiguous and normalization issues are not involved, it should not introduce any security issues that are not present as a result of simple use of non-ASCII characters, no matter how they are encoded. The mechanisms suggested should slightly lower the risks of confusing users with encoded characters by making the identity of the characters being used somewhat more obvious than some of the alternatives. An escape mechanism such as the one specified in this document can allow characters to be represented in more than one way. Where software interprets the escaped form, there is a risk that security checks, and any necessary checks for, e.g., minimal or normalized forms, are done at the wrong point. 9. Acknowledgments This document was produced in response to a series of discussions Klensin Expires April 2, 2008 [Page 9] Internet-Draft Unicode Escapes September 2007 within the IETF Applications Area and as part of work on email internationalization and internationalized domain name updates. It is a synthesis of a large number of discussions, the comments of the participants in which are gratefully acknowledged. The help of Mark Davis in constructing a list of alternative presentations and selecting among them was especially important. Tim Bray, Stephane Bortzmeyer, Chris Newman, Frank Ellermann, Clive D.W. Feather, Philip Guenther, Bjoern Hoehrmann, Simon Josefsson, Bill McQuillan, der Mouse, Phil Pennock, and Julian Reschke provided careful reading and some corrections and suggestions on early drafts. Taken together, their suggestions motivated the significant revision of this document and its recommendations between version -00 and version -01 and further improvements in the subsequent versions. 10. Change log [[anchor10: RFC Editor: Please remove this section before publication.]] 10.1. Changes in -01 o Corrected ABNF syntax for Hex-quad and Full-form. 10.2. Major Changes in -02 This version removes the recommendation of a particular format, discussing several of them and indicating considerations in making a choice. 10.3. Major Changes in -03 This version improves the ABNF and adds it for more of the escape techniques. It also contains several editorial and contextual changes. 10.4. Changes in -04 o Updated this section to reflect the changes in -02 and -03. o Modified the structure of the document to explicitly recommend the "\u'[N[N]]NNNN'" and XML forms (still trying to make a recommendation, not just a list). o Clarified the description of the Perl form, added a reference, and warned about the ambiguity with single octets. Klensin Expires April 2, 2008 [Page 10] Internet-Draft Unicode Escapes September 2007 o Some additional editorial changes for clarity. 11. References 11.1. Normative References [ISO10646] International Organization for Standardization, "Information Technology - Universal Multiple- Octet Coded Character Set (UCS)"", ISO/IEC 10646:2003, December 2003. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 10646", STD 63, RFC 3629, November 2003. [RFC4234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", RFC 4234, October 2005. [Unicode] The Unicode Consortium, "The Unicode Standard, Version 5.0", 2006. (Addison-Wesley, 2006. ISBN 0-321-48091-0). 11.2. Informative References [ASCII] American National Standards Institute (formerly United States of America Standards Institute), "USA Code for Information Interchange", ANSI X3.4-1968, 1968. ANSI X3.4-1968 has been replaced by newer versions with slight modifications, but the 1968 version remains definitive for the Internet. [ISO-C] International Organization for Standardization, "Information technology -- Programming languages -- C", ISO/IEC 9899:1999, 1999. [Java] Sun Microsystems, Inc., "Java Language Specification, Third Edition", 2005, . [PERLIntro] Hietaniemi, J., "perluniintro", Perl documentation 5.8.8, 2002, . Klensin Expires April 2, 2008 [Page 11] Internet-Draft Unicode Escapes September 2007 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and Languages", BCP 18, RFC 2277, January 1998. [RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA)", RFC 3492, March 2003. [W3C-CharMod] Duerst, M., "Character Model for the World Wide Web 1.0", W3C Recommendation, February 2005, . Author's Address John C Klensin 1770 Massachusetts Ave, #322 Cambridge, MA 02140 USA Phone: +1 617 245 1457 Email: john-ietf@jck.com Klensin Expires April 2, 2008 [Page 12] Internet-Draft Unicode Escapes September 2007 Full Copyright Statement Copyright (C) The IETF Trust (2007). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Acknowledgment Funding for the RFC Editor function is provided by the IETF Administrative Support Activity (IASA). Klensin Expires April 2, 2008 [Page 13]