<?xml version='1.0' encoding='utf-8'?>
<rfc xmlns:xi="http://www.w3.org/2001/XInclude" version="3" ipr="trust200902" docName="draft-bray-unichars-15" number="9839" updates="" obsoletes="" xml:lang="en" category="std" consensus="true" submissionType="IETF" tocInclude="true" sortRefs="true" symRefs="true" prepTime="2025-08-22T17:06:32" indexInclude="true" scripts="Common,Latin" tocDepth="3">
  <link href="https://datatracker.ietf.org/doc/draft-bray-unichars-15" rel="prev"/>
  <link href="https://dx.doi.org/10.17487/rfc9839" rel="alternate"/>
  <link href="urn:issn:2070-1721" rel="alternate"/>
  <front>
    <title abbrev="Unicode Subsets">Unicode Character Repertoire Subsets</title>
    <seriesInfo name="RFC" value="9839" stream="IETF"/>
    <author initials="T." surname="Bray" fullname="Tim Bray">
      <organization showOnFrontPage="true">Textuality Services</organization>
      <address>
        <email>tbray@textuality.com</email>
      </address>
    </author>
    <author initials="P." surname="Hoffman" fullname="Paul Hoffman">
      <organization showOnFrontPage="true">ICANN</organization>
      <address>
        <email>paul.hoffman@icann.org</email>
      </address>
    </author>
    <date month="08" year="2025"/>
    <area>ART</area>
    <abstract pn="section-abstract">
      <t indent="0" pn="section-abstract-1">This document discusses subsets of the Unicode character repertoire for use in protocols and data formats and specifies three subsets recommended for use in IETF specifications.</t>
    </abstract>
    <boilerplate>
      <section anchor="status-of-memo" numbered="false" removeInRFC="false" toc="exclude" pn="section-boilerplate.1">
        <name slugifiedName="name-status-of-this-memo">Status of This Memo</name>
        <t indent="0" pn="section-boilerplate.1-1">
            This is an Internet Standards Track document.
        </t>
        <t indent="0" pn="section-boilerplate.1-2">
            This document is a product of the Internet Engineering Task Force
            (IETF).  It represents the consensus of the IETF community.  It has
            received public review and has been approved for publication by
            the Internet Engineering Steering Group (IESG).  Further
            information on Internet Standards is available in Section 2 of 
            RFC 7841.
        </t>
        <t indent="0" pn="section-boilerplate.1-3">
            Information about the current status of this document, any
            errata, and how to provide feedback on it may be obtained at
            <eref target="https://www.rfc-editor.org/info/rfc9839" brackets="none"/>.
        </t>
      </section>
      <section anchor="copyright" numbered="false" removeInRFC="false" toc="exclude" pn="section-boilerplate.2">
        <name slugifiedName="name-copyright-notice">Copyright Notice</name>
        <t indent="0" pn="section-boilerplate.2-1">
            Copyright (c) 2025 IETF Trust and the persons identified as the
            document authors. All rights reserved.
        </t>
        <t indent="0" pn="section-boilerplate.2-2">
            This document is subject to BCP 78 and the IETF Trust's Legal
            Provisions Relating to IETF Documents
            (<eref target="https://trustee.ietf.org/license-info" brackets="none"/>) in effect on the date of
            publication of this document. Please review these documents
            carefully, as they describe your rights and restrictions with
            respect to this document. Code Components extracted from this
            document must include Revised BSD License text as described in
            Section 4.e of the Trust Legal Provisions and are provided without
            warranty as described in the Revised BSD License.
        </t>
      </section>
    </boilerplate>
    <toc>
      <section anchor="toc" numbered="false" removeInRFC="false" toc="exclude" pn="section-toc.1">
        <name slugifiedName="name-table-of-contents">Table of Contents</name>
        <ul bare="true" empty="true" indent="2" spacing="compact" pn="section-toc.1-1">
          <li pn="section-toc.1-1.1">
            <t indent="0" keepWithNext="true" pn="section-toc.1-1.1.1"><xref derivedContent="1" format="counter" sectionFormat="of" target="section-1"/>.  <xref derivedContent="" format="title" sectionFormat="of" target="name-introduction">Introduction</xref></t>
            <ul bare="true" empty="true" indent="2" spacing="compact" pn="section-toc.1-1.1.2">
              <li pn="section-toc.1-1.1.2.1">
                <t indent="0" keepWithNext="true" pn="section-toc.1-1.1.2.1.1"><xref derivedContent="1.1" format="counter" sectionFormat="of" target="section-1.1"/>.  <xref derivedContent="" format="title" sectionFormat="of" target="name-notation">Notation</xref></t>
              </li>
            </ul>
          </li>
          <li pn="section-toc.1-1.2">
            <t indent="0" pn="section-toc.1-1.2.1"><xref derivedContent="2" format="counter" sectionFormat="of" target="section-2"/>.  <xref derivedContent="" format="title" sectionFormat="of" target="name-characters-and-code-points">Characters and Code Points</xref></t>
            <ul bare="true" empty="true" indent="2" spacing="compact" pn="section-toc.1-1.2.2">
              <li pn="section-toc.1-1.2.2.1">
                <t indent="0" keepWithNext="true" pn="section-toc.1-1.2.2.1.1"><xref derivedContent="2.1" format="counter" sectionFormat="of" target="section-2.1"/>.  <xref derivedContent="" format="title" sectionFormat="of" target="name-encoding-forms">Encoding Forms</xref></t>
              </li>
              <li pn="section-toc.1-1.2.2.2">
                <t indent="0" pn="section-toc.1-1.2.2.2.1"><xref derivedContent="2.2" format="counter" sectionFormat="of" target="section-2.2"/>.  <xref derivedContent="" format="title" sectionFormat="of" target="name-problematic-code-points">Problematic Code Points</xref></t>
                <ul bare="true" empty="true" indent="2" spacing="compact" pn="section-toc.1-1.2.2.2.2">
                  <li pn="section-toc.1-1.2.2.2.2.1">
                    <t indent="0" pn="section-toc.1-1.2.2.2.2.1.1"><xref derivedContent="2.2.1" format="counter" sectionFormat="of" target="section-2.2.1"/>.  <xref derivedContent="" format="title" sectionFormat="of" target="name-surrogates">Surrogates</xref></t>
                  </li>
                  <li pn="section-toc.1-1.2.2.2.2.2">
                    <t indent="0" pn="section-toc.1-1.2.2.2.2.2.1"><xref derivedContent="2.2.2" format="counter" sectionFormat="of" target="section-2.2.2"/>.  <xref derivedContent="" format="title" sectionFormat="of" target="name-control-codes">Control Codes</xref></t>
                  </li>
                  <li pn="section-toc.1-1.2.2.2.2.3">
                    <t indent="0" pn="section-toc.1-1.2.2.2.2.3.1"><xref derivedContent="2.2.3" format="counter" sectionFormat="of" target="section-2.2.3"/>.  <xref derivedContent="" format="title" sectionFormat="of" target="name-noncharacters">Noncharacters</xref></t>
                  </li>
                </ul>
              </li>
            </ul>
          </li>
          <li pn="section-toc.1-1.3">
            <t indent="0" pn="section-toc.1-1.3.1"><xref derivedContent="3" format="counter" sectionFormat="of" target="section-3"/>.  <xref derivedContent="" format="title" sectionFormat="of" target="name-dealing-with-problematic-co">Dealing with Problematic Code Points</xref></t>
          </li>
          <li pn="section-toc.1-1.4">
            <t indent="0" pn="section-toc.1-1.4.1"><xref derivedContent="4" format="counter" sectionFormat="of" target="section-4"/>.  <xref derivedContent="" format="title" sectionFormat="of" target="name-subsets">Subsets</xref></t>
            <ul bare="true" empty="true" indent="2" spacing="compact" pn="section-toc.1-1.4.2">
              <li pn="section-toc.1-1.4.2.1">
                <t indent="0" pn="section-toc.1-1.4.2.1.1"><xref derivedContent="4.1" format="counter" sectionFormat="of" target="section-4.1"/>.  <xref derivedContent="" format="title" sectionFormat="of" target="name-unicode-scalars">Unicode Scalars</xref></t>
              </li>
              <li pn="section-toc.1-1.4.2.2">
                <t indent="0" pn="section-toc.1-1.4.2.2.1"><xref derivedContent="4.2" format="counter" sectionFormat="of" target="section-4.2"/>.  <xref derivedContent="" format="title" sectionFormat="of" target="name-xml-characters">XML Characters</xref></t>
              </li>
              <li pn="section-toc.1-1.4.2.3">
                <t indent="0" pn="section-toc.1-1.4.2.3.1"><xref derivedContent="4.3" format="counter" sectionFormat="of" target="section-4.3"/>.  <xref derivedContent="" format="title" sectionFormat="of" target="name-unicode-assignables">Unicode Assignables</xref></t>
              </li>
            </ul>
          </li>
          <li pn="section-toc.1-1.5">
            <t indent="0" pn="section-toc.1-1.5.1"><xref derivedContent="5" format="counter" sectionFormat="of" target="section-5"/>.  <xref derivedContent="" format="title" sectionFormat="of" target="name-using-subsets">Using Subsets</xref></t>
          </li>
          <li pn="section-toc.1-1.6">
            <t indent="0" pn="section-toc.1-1.6.1"><xref derivedContent="6" format="counter" sectionFormat="of" target="section-6"/>.  <xref derivedContent="" format="title" sectionFormat="of" target="name-iana-considerations">IANA Considerations</xref></t>
          </li>
          <li pn="section-toc.1-1.7">
            <t indent="0" pn="section-toc.1-1.7.1"><xref derivedContent="7" format="counter" sectionFormat="of" target="section-7"/>.  <xref derivedContent="" format="title" sectionFormat="of" target="name-security-considerations">Security Considerations</xref></t>
          </li>
          <li pn="section-toc.1-1.8">
            <t indent="0" pn="section-toc.1-1.8.1"><xref derivedContent="8" format="counter" sectionFormat="of" target="section-8"/>.  <xref derivedContent="" format="title" sectionFormat="of" target="name-references">References</xref></t>
            <ul bare="true" empty="true" indent="2" spacing="compact" pn="section-toc.1-1.8.2">
              <li pn="section-toc.1-1.8.2.1">
                <t indent="0" pn="section-toc.1-1.8.2.1.1"><xref derivedContent="8.1" format="counter" sectionFormat="of" target="section-8.1"/>.  <xref derivedContent="" format="title" sectionFormat="of" target="name-normative-references">Normative References</xref></t>
              </li>
              <li pn="section-toc.1-1.8.2.2">
                <t indent="0" pn="section-toc.1-1.8.2.2.1"><xref derivedContent="8.2" format="counter" sectionFormat="of" target="section-8.2"/>.  <xref derivedContent="" format="title" sectionFormat="of" target="name-informative-references">Informative References</xref></t>
              </li>
            </ul>
          </li>
          <li pn="section-toc.1-1.9">
            <t indent="0" pn="section-toc.1-1.9.1"><xref derivedContent="" format="none" sectionFormat="of" target="section-appendix.a"/><xref derivedContent="" format="title" sectionFormat="of" target="name-acknowledgements">Acknowledgements</xref></t>
          </li>
          <li pn="section-toc.1-1.10">
            <t indent="0" pn="section-toc.1-1.10.1"><xref derivedContent="" format="none" sectionFormat="of" target="section-appendix.b"/><xref derivedContent="" format="title" sectionFormat="of" target="name-authors-addresses">Authors' Addresses</xref></t>
          </li>
        </ul>
      </section>
    </toc>
  </front>
  <middle>
    <section anchor="intro" numbered="true" removeInRFC="false" toc="include" pn="section-1">
      <name slugifiedName="name-introduction">Introduction</name>
      <t indent="0" pn="section-1-1">Protocols and data formats frequently contain or are made up of textual data.
Such text is normally composed of Unicode <xref target="UNICODE" format="default" sectionFormat="of" derivedContent="UNICODE"/> characters, to support use by speakers of many languages.
Unicode characters are represented by numeric code points, and the "set of all Unicode code points" is generally not a good choice for use in text fields.
Unicode recognizes different types of code points, not all of which are appropriate in protocols or even associated with characters.
Therefore, even if the desire is to support "all Unicode characters", a subset of the Unicode code point repertoire should be specified.
Subsets such as those discussed in this document are appropriate choices when more-specific limitations do not apply.</t>
      <t indent="0" pn="section-1-2">In this document, "subset" means a subset of the Unicode character repertoire.
This document specifies subsets that exclude some or all of the code points that are "problematic" as defined in <xref target="problematic" format="default" sectionFormat="of" derivedContent="Section 2.2"/>.
Authors should have a way to concisely and exactly reference a stable specification that identifies which subset a protocol or data format accepts.</t>
      <t indent="0" pn="section-1-3">This document discusses issues that apply in choosing subsets, names two subsets that have been popular in practice, and suggests one new subset.
The intended use is to serve as a convenient target for cross-reference from other specifications whose authors wish to exclude problematic code points from the data format or protocol being specified.</t>
      <t indent="0" pn="section-1-4">Note that this document only provides guidance on avoiding the use of code points that cannot be used for interoperable interchange of Unicode textual data.
Dealing with strings, particularly in the context of user interfaces, requires addressing language, text rendering direction, alternate representations of the same abstract character, and so on. 
These issues, among many others, led to efforts by the Unicode Consortium,
efforts by the IETF such as <xref target="IDN" format="default" sectionFormat="of" derivedContent="IDN"/> and <xref target="PRECIS" format="default" sectionFormat="of" derivedContent="PRECIS"/>,
and internationalization efforts by W3C such as <xref target="W3C-CHAR" format="default" sectionFormat="of" derivedContent="W3C-CHAR"/>.
The results of these efforts should be consulted by anyone engaging in such work.</t>
      <section anchor="notation" numbered="true" removeInRFC="false" toc="include" pn="section-1.1">
        <name slugifiedName="name-notation">Notation</name>
        <t indent="0" pn="section-1.1-1">In this document, the numeric values assigned to Unicode characters are provided in hexadecimal.
This document uses Unicode's standard notation of "U+" followed by four or more hexadecimal digits.
For example, "A", decimal 65, is expressed as U+0041, and "🖤" (Black Heart), decimal 128,420, is U+1F5A4.</t>
        <t indent="0" pn="section-1.1-2">Groups of numeric values described in <xref target="subsets" format="default" sectionFormat="of" derivedContent="Section 4"/> are given in ABNF <xref target="RFC5234" format="default" sectionFormat="of" derivedContent="RFC5234"/>.
In ABNF, hexadecimal values are preceded by "%x" rather than "U+".</t>
        <t indent="0" pn="section-1.1-3">All the numeric ranges in this document are inclusive.</t>
        <t indent="0" pn="section-1.1-4">The subsets are described in ABNF.</t>
      </section>
    </section>
    <section anchor="char-concepts" numbered="true" removeInRFC="false" toc="include" pn="section-2">
      <name slugifiedName="name-characters-and-code-points">Characters and Code Points</name>
      <t indent="0" pn="section-2-1">Definition D9 in Section 3.4 of <xref target="UNICODE" format="default" sectionFormat="of" derivedContent="UNICODE"/> defines "Unicode codespace" as "a range of integers from 0 to 10FFFF<sub>16</sub>".
Definition D10 defines "code point" as "Any value in the Unicode codespace".</t>
      <t indent="0" pn="section-2-2">The Unicode Standard's definition of "Unicode character" is conceptual.
However, each Unicode character is assigned a code point, used to represent the characters in computer memory and storage systems and to specify allowed subsets in specifications.</t>
      <t indent="0" pn="section-2-3">There are 1,114,112 (17 * 2<sup>16</sup>) code points; as of Unicode 16.0 (2024), about 155,000 have been assigned to characters.

Since unassigned code points regularly become assigned when new characters are added to Unicode, it is usually not a good practice to specify that unassigned code points should be avoided.</t>
      <section anchor="encoding" numbered="true" removeInRFC="false" toc="include" pn="section-2.1">
        <name slugifiedName="name-encoding-forms">Encoding Forms</name>
        <t indent="0" pn="section-2.1-1">Unicode describes a variety of encoding forms that can be used to marshal code points into byte sequences.
A survey of these is beyond the scope of this document.
However, it is useful to note that "UTF-16" represents each code point with one or two 16-bit chunks, while "UTF-8" uses variable-length byte sequences <xref target="RFC3629" format="default" sectionFormat="of" derivedContent="RFC3629"/>.</t>
        <t indent="0" pn="section-2.1-2">The "IETF Policy on Character Sets and Languages", BCP 18 <xref target="RFC2277" format="default" sectionFormat="of" derivedContent="RFC2277"/>, says "Protocols <bcp14>MUST</bcp14> be able to use the UTF-8 charset", which becomes a mandate to use UTF-8 for any protocol or data format that specifies a single encoding form.
UTF-8 is widely used for interoperable data formats such as JSON, YAML, CBOR, and XML.</t>
      </section>
      <section anchor="problematic" numbered="true" removeInRFC="false" toc="include" pn="section-2.2">
        <name slugifiedName="name-problematic-code-points">Problematic Code Points</name>
        <t indent="0" pn="section-2.2-1">This section classifies all the code points that can never represent useful text and, in some cases, can lead to software misbehavior as "problematic".
This is a low bar; the PRECIS <xref target="RFC8264" format="default" sectionFormat="of" derivedContent="RFC8264"/> framework's "IdentifierClass" and "FreeformClass" exclude many more code points that can cause problems when displayed to humans, in some cases presenting security risks.
Specifications of fields in protocols and data formats whose contents are designed for display to and interactions with humans would benefit from careful consideration of the issues described by PRECIS; its more-restrictive subsets might be better choices than those specified in this document.</t>
        <t indent="0" pn="section-2.2-2">Definition D10a in Section 3.4 of <xref target="UNICODE" format="default" sectionFormat="of" derivedContent="UNICODE"/> defines seven code point types.
Three types of code points are assigned to entities that are not actually characters or whose value as Unicode characters in text fields is questionable: "Surrogate", "Control", and "Noncharacter".
In this document, "problematic" refers to code points whose type is "Surrogate" or "Noncharacter" and to "legacy controls" as defined in <xref target="legacy-controls" format="default" sectionFormat="of" derivedContent="Section 2.2.2.2"/> below.</t>
        <t indent="0" pn="section-2.2-3">Definition D49 in <xref target="UNICODE" format="default" sectionFormat="of" derivedContent="UNICODE"/> concerns the "private-use" type, and Section 3.5.10 states that they "are considered to be assigned characters".
Section 23.5 further states that these characters' "use may be determined by private agreement among cooperating users".
Because private-use code points may have uses based on private agreements, this document does not classify them as "problematic".</t>
        <section anchor="surrogates" numbered="true" removeInRFC="false" toc="include" pn="section-2.2.1">
          <name slugifiedName="name-surrogates">Surrogates</name>
          <t indent="0" pn="section-2.2.1-1">A total of 2,048 code points, in the range U+D800-U+DFFF, are divided into two blocks called "high surrogates" and "low surrogates"; collectively, the 2,048 code points are referred to as "surrogates".
Section 23.6 of <xref target="UNICODE" format="default" sectionFormat="of" derivedContent="UNICODE"/> specifies how surrogates may be used in Unicode texts encoded in UTF-16,
where a high-surrogate/low-surrogate pair represents a code point greater than U+FFFF.</t>
          <t indent="0" pn="section-2.2.1-2">A surrogate that occurs in text encoded in any encoding form other than UTF-16 has no meaning.
In particular, Section 3.9.3 of <xref target="UNICODE" format="default" sectionFormat="of" derivedContent="UNICODE"/> forbids representing a surrogate in UTF-8.</t>
        </section>
        <section anchor="controls" numbered="true" removeInRFC="false" toc="include" pn="section-2.2.2">
          <name slugifiedName="name-control-codes">Control Codes</name>
          <t indent="0" pn="section-2.2.2-1">Section 23.1 of <xref target="UNICODE" format="default" sectionFormat="of" derivedContent="UNICODE"/> introduces the control codes for compatibility with legacy pre-Unicode standards.
They comprise 65 code points in the ranges U+0000-U+001F ("C0 controls") and U+0080-U+009F ("C1 controls"), plus U+007F, "DEL".</t>
          <section anchor="useful-controls" numbered="true" removeInRFC="false" toc="exclude" pn="section-2.2.2.1">
            <name slugifiedName="name-useful-controls">Useful Controls</name>
            <t indent="0" pn="section-2.2.2.1-1">The C0 controls include newline (U+000A), carriage return (U+000D), and tab (U+0009); this document refers to these three characters as the "useful controls".</t>
          </section>
          <section anchor="legacy-controls" numbered="true" removeInRFC="false" toc="exclude" pn="section-2.2.2.2">
            <name slugifiedName="name-legacy-controls">Legacy Controls</name>
            <t indent="0" pn="section-2.2.2.2-1">Aside from the useful controls,  both the C0 and C1 control codes are mostly obsolete and generally lack interoperable semantics.
This document uses the phrase "legacy controls" to describe control codes that are not useful controls.</t>
            <t indent="0" pn="section-2.2.2.2-2">Because the code points for C0 controls include the 32 smallest integers including zero, they are likely to occur in data as a result of programming errors.</t>
          </section>
        </section>
        <section anchor="noncharacters" numbered="true" removeInRFC="false" toc="include" pn="section-2.2.3">
          <name slugifiedName="name-noncharacters">Noncharacters</name>
          <t indent="0" pn="section-2.2.3-1">Certain code points are classified as "noncharacters", and <xref target="UNICODE" format="default" sectionFormat="of" derivedContent="UNICODE"/> asserts repeatedly that they are not designed or used for open interchange.</t>
          <t indent="0" pn="section-2.2.3-2">Code points are organized into 17 "planes", each containing 2<sup>16</sup> code points.
The last two code points in each plane are noncharacters: U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, and so on, up to U+10FFFE, U+10FFFF.</t>
          <t indent="0" pn="section-2.2.3-3">The code points in the range U+FDD0-U+FDEF are noncharacters.</t>
        </section>
      </section>
    </section>
    <section anchor="dealing" numbered="true" removeInRFC="false" toc="include" pn="section-3">
      <name slugifiedName="name-dealing-with-problematic-co">Dealing with Problematic Code Points</name>
      <t indent="0" pn="section-3-1">"Maintaining Robust Protocols" <xref target="RFC9413" format="default" sectionFormat="of" derivedContent="RFC9413"/> provides a thorough discussion of strategies for dealing with issues in input data.</t>
      <t indent="0" pn="section-3-2">Different types of problematic code points cause different issues.
Noncharacters and legacy controls are unlikely to cause software failures, but they cannot usefully be displayed to humans, and they can be used in attacks based on attempting to display text that includes them.</t>
      <t indent="0" pn="section-3-3">The behavior of software that encounters surrogates is unpredictable and differs among programming-language implementations, even between different API calls in the same language.</t>
      <t indent="0" pn="section-3-4">Section 3.9 of <xref target="UNICODE" format="default" sectionFormat="of" derivedContent="UNICODE"/> makes it clear that a UTF-8 byte sequence that would map to a surrogate is ill-formed.
If a specification requires that input data be encoded with UTF-8, and if all input were well-formed, implementors would never have to concern themselves with surrogates.</t>
      <t indent="0" pn="section-3-5">Unfortunately, industry experience teaches that problematic code points, including surrogates, can and do occur in program input where the source of input data is not controlled by the implementor.
In particular, the specification of JSON allows any code point to appear in object member names and string values <xref target="RFC8259" format="default" sectionFormat="of" derivedContent="RFC8259"/>.</t>
      <t indent="0" pn="section-3-6">For example, the following is a conforming JSON text:</t>
      <sourcecode type="json" markers="false" pn="section-3-7">{"example": "\u0000\u0089\uDEAD\uD9BF\uDFFF"}</sourcecode>
      <t indent="0" pn="section-3-8">The value of the "example" field contains the C0 control NUL, the C1 control "CHARACTER TABULATION WITH JUSTIFICATION", an unpaired surrogate, and the noncharacter U+7FFFF encoded per JSON rules as two escaped UTF-16 surrogate code points as described in <xref target="RFC8259" section="7" format="default" sectionFormat="of" derivedLink="https://rfc-editor.org/rfc/rfc8259#section-7" derivedContent="RFC8259"/>.
It is unlikely to be useful as the value of a text field.
That value cannot be serialized into well-formed UTF-8, but the behavior of libraries asked to parse the sample is unpredictable; some will silently parse this and generate an ill-formed UTF-8 string.</t>
      <t indent="0" pn="section-3-9">Two reasonable options for dealing with problematic input are either rejecting text containing problematic code points or replacing the problematic code points with placeholders.</t>
      <t indent="0" pn="section-3-10">Silently deleting an ill-formed part of a string is a known security risk.
Responding to that risk, Section 3.2 of <xref target="UNICODE" format="default" sectionFormat="of" derivedContent="UNICODE"/> recommends dealing with ill-formed byte sequences by signaling an error or replacing problematic code points, ideally with "�" (U+FFFD, REPLACEMENT CHARACTER).</t>
    </section>
    <section anchor="subsets" numbered="true" removeInRFC="false" toc="include" pn="section-4">
      <name slugifiedName="name-subsets">Subsets</name>
      <t indent="0" pn="section-4-1">This section describes three increasingly restrictive subsets that can be used in specifying acceptable content for text fields in protocols and data types.
Specifications can refer to these subsets by the names "Unicode Scalars", "XML Characters", and "Unicode Assignables".</t>
      <section anchor="scalars" numbered="true" removeInRFC="false" toc="include" pn="section-4.1">
        <name slugifiedName="name-unicode-scalars">Unicode Scalars</name>
        <t indent="0" pn="section-4.1-1">Definition D76 in Section 3.9 of <xref target="UNICODE" format="default" sectionFormat="of" derivedContent="UNICODE"/> defines the term "Unicode scalar value" as "Any Unicode code point except high-surrogate and low-surrogate code points".</t>
        <t indent="0" pn="section-4.1-2">The "Unicode Scalars" subset can be expressed as an ABNF production:</t>
        <sourcecode type="abnf" markers="false" pn="section-4.1-3">
unicode-scalar =
   %x0-D7FF /    ; exclude surrogates
   %xE000-10FFFF
</sourcecode>
        <t indent="0" pn="section-4.1-4">This subset is the default for Concise Binary Object Representation (CBOR) <xref target="RFC8949" format="default" sectionFormat="of" derivedContent="RFC8949"/> and has the advantage of excluding surrogates.
However, it includes legacy controls and noncharacters.</t>
      </section>
      <section anchor="xml" numbered="true" removeInRFC="false" toc="include" pn="section-4.2">
        <name slugifiedName="name-xml-characters">XML Characters</name>
        <t indent="0" pn="section-4.2-1">The XML 1.0 Specification (Fifth Edition) <xref target="XML" format="default" sectionFormat="of" derivedContent="XML"/>, in its grammar production labeled "Char", specifies a subset of Unicode code points that excludes surrogates, legacy C0 controls, and the noncharacters U+FFFE and U+FFFF.</t>
        <t indent="0" pn="section-4.2-2">The "XML Characters" subset can be expressed as an ABNF production:</t>
        <sourcecode type="abnf" markers="false" pn="section-4.2-3">
xml-character =
   %x9 / %xA / %xD /   ; useful controls
   %x20-D7FF /         ; exclude surrogates
   %xE000-FFFD /       ; exclude FFFE and FFFF nonchars
   %x10000-10FFFF
</sourcecode>
        <t indent="0" pn="section-4.2-4">While this subset does not exclude all the problematic code points, the C1 controls are less likely than the C0 controls to appear erroneously in data and have not been observed to be a frequent source of problems.
Also, the noncharacters greater in value than U+FFFF are rarely encountered.</t>
      </section>
      <section anchor="unicode-assignables" numbered="true" removeInRFC="false" toc="include" pn="section-4.3">
        <name slugifiedName="name-unicode-assignables">Unicode Assignables</name>
        <t indent="0" pn="section-4.3-1">This document defines the "Unicode Assignables" subset as all the Unicode code points that are not problematic.
This, a proper subset of each of the others, comprises all code points that are currently assigned, excluding legacy control codes, or that might be assigned in the future.</t>
        <t indent="0" pn="section-4.3-2">Unicode Assignables can be expressed as an ABNF production:</t>
        <sourcecode type="abnf" markers="false" pn="section-4.3-3">
unicode-assignable =
   %x9 / %xA / %xD /               ; useful controls
   %x20-7E /                       ; exclude C1 controls and DEL
   %xA0-D7FF /                     ; exclude surrogates
   %xE000-FDCF /                   ; exclude FDD0 nonchars
   %xFDF0-FFFD /                   ; exclude FFFE and FFFF nonchars
   %x10000-1FFFD / %x20000-2FFFD / ; (repeat per plane)
   %x30000-3FFFD / %x40000-4FFFD /
   %x50000-5FFFD / %x60000-6FFFD /
   %x70000-7FFFD / %x80000-8FFFD /
   %x90000-9FFFD / %xA0000-AFFFD /
   %xB0000-BFFFD / %xC0000-CFFFD /
   %xD0000-DFFFD / %xE0000-EFFFD /
   %xF0000-FFFFD / %x100000-10FFFD
</sourcecode>
      </section>
    </section>
    <section anchor="restricting" numbered="true" removeInRFC="false" toc="include" pn="section-5">
      <name slugifiedName="name-using-subsets">Using Subsets</name>
      <t indent="0" pn="section-5-1">Many IETF specifications rely on well-known data formats such as JSON, Internet JSON (I-JSON), CBOR, YAML, and XML.
These formats specify default subsets.
For example, JSON allows object member names and string values to include any Unicode code point, including all the problematic types.</t>
      <t indent="0" pn="section-5-2">A protocol based on JSON can be made more robust and implementor-friendly by restricting the contents of object member names and string values to one of the subsets described in <xref target="subsets" format="default" sectionFormat="of" derivedContent="Section 4"/>.
Equivalent restrictions are possible for other packaging formats such as I-JSON, XML, YAML, and CBOR.</t>
      <t indent="0" pn="section-5-3">Note that escaping techniques such as those in the JSON example in <xref target="dealing" format="default" sectionFormat="of" derivedContent="Section 3"/> cannot be used to circumvent this sort of restriction, which applies to data content, not textual representation in packaging formats.
If a specification restricted a JSON field value to the Unicode Assignables, the example would remain a conforming JSON text but the data it represents would not constitute Unicode Assignable code points.</t>
    </section>
    <section anchor="iana-considerations" numbered="true" removeInRFC="false" toc="include" pn="section-6">
      <name slugifiedName="name-iana-considerations">IANA Considerations</name>
      <t indent="0" pn="section-6-1">This document has no IANA actions.</t>
    </section>
    <section anchor="security-considerations" numbered="true" removeInRFC="false" toc="include" pn="section-7">
      <name slugifiedName="name-security-considerations">Security Considerations</name>
      <t indent="0" pn="section-7-1"><xref target="dealing" format="default" sectionFormat="of" derivedContent="Section 3"/> of this document discusses security issues.</t>
      <t indent="0" pn="section-7-2">Unicode Security Considerations <xref target="TR36" format="default" sectionFormat="of" derivedContent="TR36"/> is a wide-ranging survey of the issues implementors should consider while writing software to process Unicode text.
Unicode Source Code Handling <xref target="TR55" format="default" sectionFormat="of" derivedContent="TR55"/> discusses use of Unicode in programming languages, with a focus on security issues.
Many of the attacks they discuss are aimed at deceiving human readers, but vulnerabilities involving issues such as surrogates and noncharacters are also covered and, in fact, can contribute to human-deceiving exploits.</t>
      <t indent="0" pn="section-7-3">The security considerations in <xref target="RFC8264" section="12" format="default" sectionFormat="of" derivedLink="https://rfc-editor.org/rfc/rfc8264#section-12" derivedContent="RFC8264"/> generally apply to this document as well.</t>
      <t indent="0" pn="section-7-4">Note that the Unicode-character subsets specified in this document are increasingly restrictive, omitting more and more problematic code points, and thus should be less and less susceptible to many of these exploits.
The subset in <xref target="unicode-assignables" format="default" sectionFormat="of" derivedContent="Section 4.3"/>, "Unicode Assignables", excludes all of these code points.</t>
    </section>
  </middle>
  <back>
    <references pn="section-8">
      <name slugifiedName="name-references">References</name>
      <references pn="section-8.1">
        <name slugifiedName="name-normative-references">Normative References</name>
        <reference anchor="RFC5234" target="https://www.rfc-editor.org/info/rfc5234" quoteTitle="true" derivedAnchor="RFC5234">
          <front>
            <title>Augmented BNF for Syntax Specifications: ABNF</title>
            <author fullname="D. Crocker" initials="D." role="editor" surname="Crocker"/>
            <author fullname="P. Overell" initials="P." surname="Overell"/>
            <date month="January" year="2008"/>
            <abstract>
              <t indent="0">Internet technical specifications often need to define a formal syntax. Over the years, a modified version of Backus-Naur Form (BNF), called Augmented BNF (ABNF), has been popular among many Internet specifications. The current specification documents ABNF. It balances compactness and simplicity with reasonable representational power. The differences between standard BNF and ABNF involve naming rules, repetition, alternatives, order-independence, and value ranges. This specification also supplies additional rule definitions and encoding for a core lexical analyzer of the type common to several Internet specifications. [STANDARDS-TRACK]</t>
            </abstract>
          </front>
          <seriesInfo name="STD" value="68"/>
          <seriesInfo name="RFC" value="5234"/>
          <seriesInfo name="DOI" value="10.17487/RFC5234"/>
        </reference>
        <reference anchor="TR36" target="https://www.unicode.org/reports/tr36/" quoteTitle="true" derivedAnchor="TR36">
          <front>
            <title abbrev="Unicode Security Considerations">Unicode Security Considerations</title>
            <author fullname="Mark Davis" role="editor"/>
            <author fullname="Michel Suignard" role="editor"/>
          </front>
        </reference>
        <reference anchor="TR55" target="https://www.unicode.org/reports/tr55/" quoteTitle="true" derivedAnchor="TR55">
          <front>
            <title abbrev="Unicode Source Code Handling">Unicode Source Code Handling</title>
            <author fullname="Robin Leroy" role="editor"/>
            <author fullname="Mark Davis" role="editor"/>
          </front>
        </reference>
        <reference anchor="UNICODE" target="http://www.unicode.org/versions/latest/" quoteTitle="true" derivedAnchor="UNICODE">
          <front>
            <title abbrev="Unicode">The Unicode Standard</title>
            <author>
              <organization showOnFrontPage="true">The Unicode Consortium</organization>
              <address/>
            </author>
          </front>
          <annotation>Note that this reference is to the latest version of
Unicode, rather than to a specific release. It is not expected that
future changes in the Unicode Standard will affect the referenced
definitions.</annotation>
        </reference>
      </references>
      <references pn="section-8.2">
        <name slugifiedName="name-informative-references">Informative References</name>
        <reference anchor="IDN" target="https://datatracker.ietf.org/group/idn/" quoteTitle="true" derivedAnchor="IDN">
          <front>
            <title>Internationalized Domain Name Working Group</title>
            <author>
              <organization showOnFrontPage="true"/>
            </author>
            <date/>
          </front>
        </reference>
        <reference anchor="PRECIS" target="https://datatracker.ietf.org/group/precis/" quoteTitle="true" derivedAnchor="PRECIS">
          <front>
            <title>PRECIS Working Group</title>
            <author>
              <organization showOnFrontPage="true"/>
            </author>
            <date/>
          </front>
        </reference>
        <reference anchor="RFC2277" target="https://www.rfc-editor.org/info/rfc2277" quoteTitle="true" derivedAnchor="RFC2277">
          <front>
            <title>IETF Policy on Character Sets and Languages</title>
            <author fullname="H. Alvestrand" initials="H." surname="Alvestrand"/>
            <date month="January" year="1998"/>
            <abstract>
              <t indent="0">This document is the current policies being applied by the Internet Engineering Steering Group (IESG) towards the standardization efforts in the Internet Engineering Task Force (IETF) in order to help Internet protocols fulfill these requirements. This document specifies an Internet Best Current Practices for the Internet Community, and requests discussion and suggestions for improvements.</t>
            </abstract>
          </front>
          <seriesInfo name="BCP" value="18"/>
          <seriesInfo name="RFC" value="2277"/>
          <seriesInfo name="DOI" value="10.17487/RFC2277"/>
        </reference>
        <reference anchor="RFC3629" target="https://www.rfc-editor.org/info/rfc3629" quoteTitle="true" derivedAnchor="RFC3629">
          <front>
            <title>UTF-8, a transformation format of ISO 10646</title>
            <author fullname="F. Yergeau" initials="F." surname="Yergeau"/>
            <date month="November" year="2003"/>
            <abstract>
              <t indent="0">ISO/IEC 10646-1 defines a large character set called the Universal Character Set (UCS) which encompasses most of the world's writing systems. The originally proposed encodings of the UCS, however, were not compatible with many current applications and protocols, and this has led to the development of UTF-8, the object of this memo. UTF-8 has the characteristic of preserving the full US-ASCII range, providing compatibility with file systems, parsers and other software that rely on US-ASCII values but are transparent to other values. This memo obsoletes and replaces RFC 2279.</t>
            </abstract>
          </front>
          <seriesInfo name="STD" value="63"/>
          <seriesInfo name="RFC" value="3629"/>
          <seriesInfo name="DOI" value="10.17487/RFC3629"/>
        </reference>
        <reference anchor="RFC8259" target="https://www.rfc-editor.org/info/rfc8259" quoteTitle="true" derivedAnchor="RFC8259">
          <front>
            <title>The JavaScript Object Notation (JSON) Data Interchange Format</title>
            <author fullname="T. Bray" initials="T." role="editor" surname="Bray"/>
            <date month="December" year="2017"/>
            <abstract>
              <t indent="0">JavaScript Object Notation (JSON) is a lightweight, text-based, language-independent data interchange format. It was derived from the ECMAScript Programming Language Standard. JSON defines a small set of formatting rules for the portable representation of structured data.</t>
              <t indent="0">This document removes inconsistencies with other specifications of JSON, repairs specification errors, and offers experience-based interoperability guidance.</t>
            </abstract>
          </front>
          <seriesInfo name="STD" value="90"/>
          <seriesInfo name="RFC" value="8259"/>
          <seriesInfo name="DOI" value="10.17487/RFC8259"/>
        </reference>
        <reference anchor="RFC8264" target="https://www.rfc-editor.org/info/rfc8264" quoteTitle="true" derivedAnchor="RFC8264">
          <front>
            <title>PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols</title>
            <author fullname="P. Saint-Andre" initials="P." surname="Saint-Andre"/>
            <author fullname="M. Blanchet" initials="M." surname="Blanchet"/>
            <date month="October" year="2017"/>
            <abstract>
              <t indent="0">Application protocols using Unicode code points in protocol strings need to properly handle such strings in order to enforce internationalization rules for strings placed in various protocol slots (such as addresses and identifiers) and to perform valid comparison operations (e.g., for purposes of authentication or authorization). This document defines a framework enabling application protocols to perform the preparation, enforcement, and comparison of internationalized strings ("PRECIS") in a way that depends on the properties of Unicode code points and thus is more agile with respect to versions of Unicode. As a result, this framework provides a more sustainable approach to the handling of internationalized strings than the previous framework, known as Stringprep (RFC 3454). This document obsoletes RFC 7564.</t>
            </abstract>
          </front>
          <seriesInfo name="RFC" value="8264"/>
          <seriesInfo name="DOI" value="10.17487/RFC8264"/>
        </reference>
        <reference anchor="RFC8949" target="https://www.rfc-editor.org/info/rfc8949" quoteTitle="true" derivedAnchor="RFC8949">
          <front>
            <title>Concise Binary Object Representation (CBOR)</title>
            <author fullname="C. Bormann" initials="C." surname="Bormann"/>
            <author fullname="P. Hoffman" initials="P." surname="Hoffman"/>
            <date month="December" year="2020"/>
            <abstract>
              <t indent="0">The Concise Binary Object Representation (CBOR) is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation. These design goals make it different from earlier binary serializations such as ASN.1 and MessagePack.</t>
              <t indent="0">This document obsoletes RFC 7049, providing editorial improvements, new details, and errata fixes while keeping full compatibility with the interchange format of RFC 7049. It does not create a new version of the format.</t>
            </abstract>
          </front>
          <seriesInfo name="STD" value="94"/>
          <seriesInfo name="RFC" value="8949"/>
          <seriesInfo name="DOI" value="10.17487/RFC8949"/>
        </reference>
        <reference anchor="RFC9413" target="https://www.rfc-editor.org/info/rfc9413" quoteTitle="true" derivedAnchor="RFC9413">
          <front>
            <title>Maintaining Robust Protocols</title>
            <author fullname="M. Thomson" initials="M." surname="Thomson"/>
            <author fullname="D. Schinazi" initials="D." surname="Schinazi"/>
            <date month="June" year="2023"/>
            <abstract>
              <t indent="0">The main goal of the networking standards process is to enable the long-term interoperability of protocols. This document describes active protocol maintenance, a means to accomplish that goal. By evolving specifications and implementations, it is possible to reduce ambiguity over time and create a healthy ecosystem.</t>
              <t indent="0">The robustness principle, often phrased as "be conservative in what you send, and liberal in what you accept", has long guided the design and implementation of Internet protocols. However, it has been interpreted in a variety of ways. While some interpretations help ensure the health of the Internet, others can negatively affect interoperability over time. When a protocol is actively maintained, protocol designers and implementers can avoid these pitfalls.</t>
            </abstract>
          </front>
          <seriesInfo name="RFC" value="9413"/>
          <seriesInfo name="DOI" value="10.17487/RFC9413"/>
        </reference>
        <reference anchor="W3C-CHAR" target="https://www.w3.org/International/articles/definitions-characters/" quoteTitle="true" derivedAnchor="W3C-CHAR">
          <front>
            <title>Character encodings: Essential concepts</title>
            <author>
              <organization showOnFrontPage="true">W3C</organization>
            </author>
            <date/>
          </front>
        </reference>
        <reference anchor="XML" target="http://www.w3.org/TR/2008/REC-xml-20081126/" quoteTitle="true" derivedAnchor="XML">
          <front>
            <title abbrev="XML 1.0">Extensible Markup Language (XML) 1.0 (Fifth Edition)</title>
            <author fullname="Tim Bray" surname="Bray" role="editor">
              <organization showOnFrontPage="true">Textuality and Netscape</organization>
            </author>
            <author fullname="Jean Paoli" surname="Paoli" role="editor">
              <organization showOnFrontPage="true">Microsoft</organization>
            </author>
            <author fullname="C.M. Sperberg-McQueen" initials="C.M." surname="McQueen" role="editor">
              <organization showOnFrontPage="true">W3C</organization>
            </author>
            <author fullname="Eve Maler" surname="Maler" role="editor">
              <organization showOnFrontPage="true">Sun Microsystems, Inc.</organization>
            </author>
            <author fullname="François Yergeau" surname="Yergeau" role="editor"/>
            <date year="2008" month="November" day="26"/>
          </front>
          <refcontent>W3C Recommendation</refcontent>
        </reference>
      </references>
    </references>
    <section numbered="false" anchor="acknowledgements" removeInRFC="false" toc="include" pn="section-appendix.a">
      <name slugifiedName="name-acknowledgements">Acknowledgements</name>
      <t indent="0" pn="section-appendix.a-1">Thanks are due to <contact fullname="Guillaume Fortin-Debigaré"/>, who
filed an errata report against RFC 8259, "The JavaScript Object Notation (JSON) Data Interchange Format",
noting frequent references to "Unicode characters", when in fact the RFC
formally specifies the use of Unicode code points.</t>
      <t indent="0" pn="section-appendix.a-2">Thanks also to <contact fullname="Asmus Freytag"/> for careful review and
many constructive suggestions aimed at making the language more consistent
with the structure of the Unicode Standard.</t>
      <t indent="0" pn="section-appendix.a-3">Thanks also to <contact fullname="James Manger"/> for the correctness of
the ABNF and JSON samples.</t>
      <t indent="0" pn="section-appendix.a-4">Thanks also to <contact fullname="Addison Phillips"/> and the W3C
Internationalization Working Group for helpful suggestions on language and
references.</t>
      <t indent="0" pn="section-appendix.a-5">Thoughtful comments during the many draft versions of this document, which helped
tighten up wording and make difficult points clearer, were contributed by
<contact fullname="Harald Alvestrand"/>, <contact fullname="Martin J. Dürst"/>,
<contact fullname="Donald E. Eastlake"/>, <contact fullname="John Klensin"/>,
<contact fullname="Barry Leiba"/>, <contact fullname="Glyn Normington"/>,
<contact fullname="Peter Saint-Andre"/>, and <contact fullname="Rob Sayre"/>.</t>
    </section>
    <section anchor="authors-addresses" numbered="false" removeInRFC="false" toc="include" pn="section-appendix.b">
      <name slugifiedName="name-authors-addresses">Authors' Addresses</name>
      <author initials="T." surname="Bray" fullname="Tim Bray">
        <organization showOnFrontPage="true">Textuality Services</organization>
        <address>
          <email>tbray@textuality.com</email>
        </address>
      </author>
      <author initials="P." surname="Hoffman" fullname="Paul Hoffman">
        <organization showOnFrontPage="true">ICANN</organization>
        <address>
          <email>paul.hoffman@icann.org</email>
        </address>
      </author>
    </section>
  </back>
</rfc>
