Internet DRAFT - draft-resman-idna2008-mappings
draft-resman-idna2008-mappings
Network Working Group P. Resnick
Internet-Draft Qualcomm Incorporated
Intended status: Informational P. Hoffman
Expires: October 21, 2010 VPN Consortium
April 19, 2010
Mapping Characters in IDNA2008
draft-resman-idna2008-mappings-01
Abstract
In the original version of the Internationalized Domain Names in
Applications (IDNA) protocol, any Unicode code points taken from user
input were mapped into a set of Unicode code points that "made
sense", and then encoded and passed to the domain name system (DNS).
The IDNA2008 protocol presumes that the input to the protocol comes
from a set of "permitted" code points, which it then encodes and
passes to the DNS, but does not specify what to do with the result of
user input. This document describes the actions that can be taken by
an implementation between user input and passing permitted code
points to the new IDNA protocol.
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on October 21, 2010.
Copyright Notice
Copyright (c) 2010 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
Resnick & Hoffman Expires October 21, 2010 [Page 1]
Internet-Draft IDNA Mapping April 2010
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document.
1. Introduction
This document describes the operations that can be applied to user
input in order to get it into a form acceptable by the
Internationalized Domain Names in Applications (IDNA) protocol
[IDNA2008protocol]. It includes a general implementation procedure
for mapping.
It should be noted that this document does not specify the behavior
of a protocol that appears "on the wire". It describes an operation
that is to be applied to user input in order to prepare that user
input for use in an "on the network" protocol. As unusual as this
may be for a document concerning Internet protocols, it is necessary
to describe this operation for implementors who may have designed
around the original IDNA protocol, which conflates this user input
operation into the protocol.
It is very important to note that there are many potential valid
mappings of characters from user input. The mapping described in
this document is the basis for other mappings, and is not likely to
be useful without modification. Any useful mapping will have
features designed to reduce the surprise for users and is likely to
be slightly (or sometimes radically) different depending on the
locale of the user, the type of input being used (such as typing,
copy-and-paste, voice, and so on), the type application used, etc.
Although most common mappings will probably produce similar results
for the same input, there will be subtle differences between
applications.
1.1. The Dividing Line between User Interface and Protocol
The user interface to applications is much more complicated than most
network implementers think. When we say "the user enters an
internationalized domain name in the application", we are talking
about a very complex process that encompasses everything from the
user formulating the name and deciding which symbols to use to
express that name, to the user entering the symbols into the computer
using some input method (be it a keyboard, a stylus, or even a voice
recognition program), to the computer interpreting that input (be it
keyboard scan codes, a graphical representation, or digitized sounds)
into some representation of those symbols, through finally
normalizing those symbols into a particular character repertoire in
an encoding recognizable to IDNA processes and the domain name
Resnick & Hoffman Expires October 21, 2010 [Page 2]
Internet-Draft IDNA Mapping April 2010
system.
Consideration for user interface for internationalized domain names
involves taking into account culture, context, and locale for any
given user. A simple and well-known example is the lowercasing of
the letter Latin capital letter I (U+0049) when it is used in the
Turkish and other languages. A capital "I" in Turkish is properly
lowercased to a lowercase dotless "i" (U+0131), not to a Latin small
letter i (U+0069). This lowercasing is clearly dependent on the
locale of the system and/or the locale of the user. Using a single
context-free mapping without considering the user interface
properties has the potential of doing exactly the wrong thing for the
user.
The original version of IDNA conflated user interface processing and
protocol. It took whatever characters the user produced in whatever
encoding the application used, assumed some conversion to Unicode
code points, and then without regard to context, locale, or anything
about the user's intentions, mapped them into a particular set of
other characters, and then re-encoded them in Punycode, in order have
the entire operation be contained within the protocol. Ignoring
context, locale, and user preference in the IDNA protocol made life
significantly less complicated for the application developer, but at
the expense of violating the principle of "least user surprise" for
consumers and producers of domain names.
In IDNA2008, the dividing line between "user interface" and
"protocol" is clear. The IDNA2008 specification defines the protocol
part of IDNA: it explicitly does not deal with the user interface.
Mappings such as the one described in this document explicitly deal
with the user interface and not the protocol. That is, a mapping is
only to be applied before a string of characters is treated as a
domain name (in the "user interface") and is never to be applied
during domain name processing (in the "protocol").
1.2. The Design of this Mapping
The user interface mapping in this document is a set of expansions to
IDNA2008 that are meant to be sensible and friendly and mostly
obvious to people throughout the world when using typical
applications with domain names that are entered by hand. It is also
designed to let applications be mostly backwards compatible with
IDNA2003. By definition, it cannot meet all of those design goals
for all people, and in fact is known to fail on some of those goals
for quite large populations of people.
A good mapping in the real world might use the "sensible and friendly
and mostly obvious" design goal but come up with a different
Resnick & Hoffman Expires October 21, 2010 [Page 3]
Internet-Draft IDNA Mapping April 2010
algorithm. Many algorithms will have results that are close to what
is described here, but will differ in assumptions about the users'
way of thinking or typing. Having said that, it is likely that some
mappings will be significantly different. For example, a mapping
might apply to a spoken user interface instead of a typed one.
Another example is that a mapping might be different for users typing
than for users using copy-and-paste from different applications. Yet
another example is that a user interface that allows typed input that
is transliterated from Latin characters could have very different
mappings than one that applies to typing in other character sets;
this would be typical in a Pinyin input method for Chinese
characters.
2. The General Procedure
This section defines a general algorithm that applications ought to
implement in order to produce Unicode code points that will be valid
under the IDNA protocol. An application might implement the full
mapping as described below, or can choose a different mapping. This
mapping is very general and was designed to be very acceptable to the
widest user community, but as stated above, it does not take into
account any particular context, culture, or locale.
The general algorithm that an application (or the input method
provided by an operating system) ought to use is relatively
straightforward:
1. Upper case characters are mapped to their lower case equivalents
by using the algorithm for mapping case in Unicode characters.
This step was chosen because the output will behave more like
ASCII host names behave.
2. Full-width and half-width characters (those defined with
Decomposition Types <wide> and <narrow>) are mapped to their
decomposition mappings as shown in the Unicode character
database. This step was chosen because many input mechanisms,
particularly in Asia, do not allow you to easily enter characters
in the form used by IDNA2008. Even if they do allow the correct
character form, the user might not know which form they are
entering.
3. All characters are mapped using Unicode Normalization Form C
(NFC). This step was chosen because it maps combinations of
combining characters into canonical composed form. As with the
full-width/half-width mapping, users are not generally aware of
the particular form of characters that they are entering, and
IDNA2008 requires that only the canonical composed forms from NFC
Resnick & Hoffman Expires October 21, 2010 [Page 4]
Internet-Draft IDNA Mapping April 2010
are used.
4. [IDNA2008protocol] is specified such that the protocol acts on
the individual labels of the domain name. If an implementation
of this mapping is also performing the step of separation of the
parts of a domain name into labels by using the FULL STOP
character (U+002E), the IDEOGRAPHIC FULL STOP (U+3002) character
can be mapped to the FULL STOP before label separation occurs.
There are other characters that are used as "full stops" that one
could consider mapping as label separators, but their use as such
has not been investigated thoroughly. This step was chosen
because some input mechanisms do not allow the user to easily
enter proper label separators. Only the IDEOGRAPHIC FULL STOP
(U+3002) character is added in this mapping because the authors
have not fully investigated the applicability of other characters
and the environments where they should and should not be
considered domain name label separators.
Note that the steps above are ordered.
Definitions for the rules in this algorithm can be found in
[Unicode52]. Specifically:
o Unicode Normalization Form C can be found in Annex #15 of
[Unicode52].
o In order to map upper case characters to their lower case
equivalents (defined in section 3.13 of [Unicode52]), first map
characters to the "Lowercase_Mapping" property (the "<lower>"
entry in the second column) in
<http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt>, if any.
Then, map characters to the "Simple_Lowercase_Mapping" property
(the fourteenth column) in
<http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>, if any.
o In order to map full-width and half-width characters to their
decomposition mappings, map any character whose
"Decomposition_Type" (contained in the first part of of the sixth
column) in <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>
is either "<wide>" or "<narrow>" to the "Decomposition_Mapping" of
that character (contained in the second part of the sixth column)
in <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>.
o The Unicode Character Database [TR44] has useful descriptions of
the contents of these files.
If the mappings in this document are applied to versions of Unicode
later than Unicode 5.2, the later versions of the Unicode Standard
Resnick & Hoffman Expires October 21, 2010 [Page 5]
Internet-Draft IDNA Mapping April 2010
should be consulted.
These form a minimal set of mappings that an application should
strongly consider doing. Of course, there are many others that might
be done.
3. Implementing This Mapping
If you are implementing a mapping for an application or operating
system by using exactly the four steps in Section 2, the authors of
this document have a request: please don't. We mean it. Section 2
does not describe a universal mapping algorithm because, as we said,
there is no universally-applicable mapping algorithm.
If you read the material in Section 2 without reading Section 1, go
back and carefully read all of Section 1; in many ways, Section 1 is
more important than Section 2. Further, you can probably think of
user interface considerations that we did not list in Section 1. If
you did read Section 1 but somehow decided that the algorithm in
Section 2 is completely correct for the intended users of your
application or operating system, you are probably not thinking hard
enough about your intended users.
4. IANA Considerations
This memo includes no request to IANA.
5. Security Considerations
This document suggests creating mappings that might cause confusion
for some users while alleviating confusion in other users. Such
confusion is not covered in any depth in this document (nor in the
other IDNA-related documents).
6. Acknowledgements
This document is the product of many contributions from numerous
people in the IETF.
7. Normative References
[IDNA2008protocol]
Klensin, J., "Internationalized Domain Names in
Resnick & Hoffman Expires October 21, 2010 [Page 6]
Internet-Draft IDNA Mapping April 2010
Applications (IDNA): Protocol",
draft-ietf-idnabis-protocol (work in progress),
January 2010.
[TR44] The Unicode Consortium, "Unicode Character Database",
Unicode Standard Annex 44, 2009.
[Unicode52]
The Unicode Consortium, "The Unicode Standard, Version
5.2.0", 2009.
defined by: The Unicode Standard, Version 5.0, Boston, MA,
Addison-Wesley, 2007, ISBN 0-321-48091-0, as amended by
Unicode 5.2.0
(<http://www.unicode.org/versions/Unicode5.2.0/>).
Authors' Addresses
Peter W. Resnick
Qualcomm Incorporated
5775 Morehouse Drive
San Diego, CA 92121-1714
US
Phone: +1 858 651 4478
Email: presnick@qualcomm.com
URI: http://www.qualcomm.com/~presnick/
Paul Hoffman
VPN Consortium
127 Segre Place
Santa Cruz, CA 95060
US
Phone: 1-831-426-9827
Email: paul.hoffman@vpnc.org
Resnick & Hoffman Expires October 21, 2010 [Page 7]