<?xml version="1.0" encoding="US-ASCII"?>

<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
  <!ENTITY RFC1945 PUBLIC "" "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.1945.xml">
  <!ENTITY RFC2046 PUBLIC "" "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.2046.xml">
  <!ENTITY RFC2119 PUBLIC "" "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml">
  <!ENTITY RFC2616 PUBLIC "" "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.2616.xml">
  <!ENTITY RFC3629 PUBLIC "" "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.3629.xml">
  <!ENTITY RFC3986 PUBLIC "" "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.3986.xml">
  <!ENTITY RFC5234 PUBLIC "" "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.5234.xml">
  <!ENTITY RFC8174 PUBLIC "" "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml">
  <!ENTITY RFC8288 PUBLIC "" "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8288.xml">
]>

<rfc ipr="trust200902" category="info" docName="draft-koster-rep-07" >

<?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>

<?rfc toc="yes" ?>
<?rfc tocdepth="4" ?>
<?rfc symrefs="yes" ?>
<?rfc sortrefs="yes"?>
<?rfc compact="yes" ?>
<?rfc subcompact="no"?>

  <front>
    <title abbrev="REP">Robots Exclusion Protocol</title>

    <author initials="M." surname="Koster" fullname="Martijn Koster" role="editor">
      <organization>Stalworthy Computing, Ltd.</organization>
      <address>
        <postal>
          <street>Suton Lane</street>
          <city>Wymondham, Norfolk</city>
          <code>NR18 9JG</code>
          <country>United Kingdom</country>
        </postal>
        <email>m.koster@greenhills.co.uk</email>
      </address>
    </author>
    <author initials="G." surname="Illyes" fullname="Gary Illyes" role="editor">
      <organization>Google LLC.</organization>
      <address>
        <postal>
          <street>Brandschenkestrasse 110</street>
          <city>Zurich</city>
          <code>8002</code>
          <country>Switzerland</country>
        </postal>
        <email>garyillyes@google.com</email>
      </address>
    </author>
    <author initials="H." surname="Zeller" fullname="Henner Zeller" role="editor">
      <organization>Google LLC.</organization>
      <address>
        <postal>
          <street>1600 Amphitheatre Pkwy</street>
          <city>Mountain View, CA</city>
          <code>94043</code>
          <country>USA</country>
        </postal>
        <email>henner@google.com</email>
      </address>
    </author>
    <author initials="L." surname="Sassman" fullname="Lizzi Sassman" role="editor">
      <organization>Google LLC.</organization>
      <address>
        <postal>
          <street>Brandschenkestrasse 110</street>
          <city>Zurich</city>
          <code>8002</code>
          <country>Switzerland</country>
        </postal>
        <email>lizzi@google.com</email>
      </address>
    </author>

    <date year="2022" month="May" day="5"/>

    <area>General</area>

    <keyword>internet-drafts</keyword>

    <abstract>
      <t> This document specifies and extends the &quot;Robots Exclusion Protocol&quot;
          method originally defined by Martijn Koster in 1996 for service owners
          to control how content served by their services may be accessed, if at
          all, by automatic clients known as crawlers. </t>
    </abstract>
  </front>

  <middle>
    <section anchor="introduction" title="Introduction">
      <t> This document applies to services that provide resources that clients
          can access through URIs as defined in <xref target="RFC3986"/>. For example,
          in the context of HTTP, a browser is a client that displays the content of a
          web page. </t>

      <t> Crawlers are automated clients. Search engines for instance have crawlers to
          recursively traverse links for indexing as defined in
          <xref target="RFC8288"/>. </t>

      <t> It may be inconvenient for service owners if crawlers visit the entirety of
          their URI space. This document specifies the rules originally defined by
          the &quot;Robots Exclusion Protocol&quot; <xref target="ROBOTSTXT"/> that crawlers
          are expected to obey when accessing URIs. </t>

      <t> These rules are not a form of access authorization. </t>

      <section anchor="requirements-language" title="Requirements Language">
        <t> The key words &quot;<bcp14>MUST</bcp14>&quot;,
            &quot;<bcp14>MUST NOT</bcp14>&quot;, &quot;<bcp14>REQUIRED</bcp14>&quot;,
            &quot;<bcp14>SHALL</bcp14>&quot;, &quot;<bcp14>SHALL NOT</bcp14>&quot;,
            &quot;<bcp14>SHOULD</bcp14>&quot;, &quot;<bcp14>SHOULD NOT</bcp14>&quot;,
            &quot;<bcp14>RECOMMENDED</bcp14>&quot;,
            &quot;<bcp14>NOT RECOMMENDED</bcp14>&quot;, &quot;<bcp14>MAY</bcp14>&quot;,
            and &quot;<bcp14>OPTIONAL</bcp14>&quot; in this document are to be
            interpreted as described in
            BCP 14 <xref target="RFC2119"/> <xref target="RFC8174"/> when, and only
            when, they appear in all capitals, as shown here. </t>
      </section>
    </section>
    <section anchor="specification" title="Specification">
      <section anchor="protocol-definition" title="Protocol Definition">
        <t> The protocol language consists of rule(s) and group(s) that the service
            makes available in a file named &#39;robots.txt&#39; as described in
            section 2.3: </t>
        <t>
          <list style="symbols">
            <t> Rule: A line with a key-value pair that defines how a
                crawler may access URIs. See section 2.2.2. </t>
            <t> Group: One or more user-agent lines that is followed by
                one or more rules. The group is terminated by a user-agent line
                or end of file. See section 2.2.1. The last group may have no rules,
                which means it implicitly allows everything. </t>
          </list> </t>
      </section>
      <section anchor="formal-syntax" title="Formal Syntax">
        <t> Below is an Augmented Backus-Naur Form (ABNF) description, as described
            in <xref target="RFC5234"/>. </t>

        <figure><artwork>
  <![CDATA[
    robotstxt = *(group / emptyline)
    group = startgroupline                ; We start with a user-agent
           *(startgroupline / emptyline)  ; ... and possibly more
                                          ; user-agents
           *(rule / emptyline)            ; followed by rules relevant
                                          ; for UAs

    startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL

    rule = *WS ("allow" / "disallow") *WS ":"
          *WS (path-pattern / empty-pattern) EOL

    ; parser implementors: add additional lines you need (for
    ; example, sitemaps), and be lenient when reading lines that don't
    ; conform. Apply Postel's law.

    product-token = identifier / "*"
    path-pattern = "/" *UTF8-char-noctl ; valid URI path pattern
    empty-pattern = *WS

    identifier = 1*(%x2D / %x41-5A / %x5F / %x61-7A)
    comment = "#" *(UTF8-char-noctl / WS / "#")
    emptyline = EOL
    EOL = *WS [comment] NL ; end-of-line may have
                           ; optional trailing comment
    NL = %x0D / %x0A / %x0D.0A
    WS = %x20 / %x09

    ; UTF8 derived from RFC3629, but excluding control characters

    UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4
    UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, '#'
    UTF8-2 = %xC2-DF UTF8-tail
    UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2UTF8-tail /
             %xED %x80-9F UTF8-tail / %xEE-EF 2UTF8-tail
    UTF8-4 = %xF0 %x90-BF 2UTF8-tail / %xF1-F3 3UTF8-tail /
             %xF4 %x80-8F 2UTF8-tail

    UTF8-tail = %x80-BF
    ]]>
        </artwork></figure>
        <section anchor="the-user-agent-line" title="The User-Agent Line">
          <t> Crawlers set a product token to find relevant groups. The product token
              <bcp14>MUST</bcp14> contain only &quot;a-zA-Z_-&quot; characters. The
              product token <bcp14>SHOULD</bcp14> be part of the identification string
              that the crawler sends to the service (for example, in the case of HTTP,
              the product name <bcp14>SHOULD</bcp14> be in the user-agent header).
              The identification string <bcp14>SHOULD</bcp14> describe the purpose of
              the crawler. Here&#39;s an example of an HTTP header with a link
              pointing to a page describing the purpose of the ExampleBot crawler
              which appears both in the HTTP header and as a product token: </t>

          <texttable title="Example of a user-agent header and user-agent robots.txt token for ExampleBot">
            <ttcol align="left">HTTP header</ttcol>
            <ttcol align="left">robots.txt user-agent line</ttcol>
            <c>user-agent: Mozilla/5.0 (compatible; ExampleBot/0.1; https://www.example.com/bot.html)</c>
            <c>user-agent: ExampleBot</c>
          </texttable>

          <t> Crawlers MUST find the group that matches the product token exactly,
              and then obey the rules of the group. If there is more than one
              group matching the user-agent, the matching groups' rules MUST be
              combined into one group. The matching MUST be case-insensitive. If
              no matching group exists, crawlers MUST obey the first group with a
              user-agent line with a "*" value, if present. If no group satisfies
              either condition, or no groups are present at all, no rules apply. </t>
        </section>
        <section anchor="the-allow-and-disallow-lines" title="The Allow and Disallow Lines">
          <t> These lines indicate whether accessing a URI that matches the
              corresponding path is allowed or disallowed. </t>
          <t> To evaluate if access to a URI is allowed, a robot <bcp14>MUST</bcp14>
              match the paths in allow and disallow rules against the URI.
              The matching <bcp14>SHOULD</bcp14> be case sensitive. The most
              specific match found <bcp14>MUST</bcp14> be used. The most specific
              match is the match that has the most octets. If an allow and disallow
              rule is equivalent, the allow <bcp14>SHOULD</bcp14> be used. If no
              match is found amongst the rules in a group for a matching user-agent,
              or there are no rules in the group, the URI is allowed. The
              /robots.txt URI is implicitly allowed. </t>

          <t> Octets in the URI and robots.txt paths outside the range of the
              US-ASCII coded character set, and those in the reserved range defined
              by <xref target="RFC3986"/>, <bcp14>MUST</bcp14> be percent-encoded as
              defined by <xref target="RFC3986"></xref> prior to comparison. </t>

          <t> If a percent-encoded US-ASCII octet is encountered in the URI, it
              <bcp14>MUST</bcp14> be unencoded prior to comparison, unless it is a
              reserved character in the URI as defined by <xref target="RFC3986"/>
              or the character is outside the unreserved character range. The match
              evaluates positively if and only if the end of the path from the rule
              is reached before a difference in octets is encountered. </t>

          <t> For example: </t>

          <texttable title="Examples of matching percent-encoded URI components">
            <ttcol align='left'>Path</ttcol>
            <ttcol align='left'>Encoded Path</ttcol>
            <ttcol align='left'>Path to Match</ttcol>
            <c>/foo/bar?baz=quz</c>
            <c>/foo/bar?baz=quz</c>
            <c>/foo/bar?baz=quz</c>
            <c>/foo/bar?baz=http<br />://foo.bar</c>
            <c>/foo/bar?baz=http%3A<br />%2F%2Ffoo.bar</c>
            <c>/foo/bar?baz=http%3A<br />%2F%2Ffoo.bar</c>
            <c>/foo/bar/U+E38384</c>
            <c>/foo/bar/%E3%83%84</c>
            <c>/foo/bar/%E3%83%84</c>
            <c>/foo/bar/%E3%83%84</c>
            <c>/foo/bar/%E3%83%84</c>
            <c>/foo/bar/%E3%83%84</c>
            <c>/foo/bar/%62%61%7A</c>
            <c>/foo/bar/%62%61%7A</c>
            <c>/foo/bar/baz</c>
          </texttable>

          <t> The crawler <bcp14>SHOULD</bcp14> ignore &quot;disallow&quot; and
              &quot;allow&quot; rules that are not in any group (for example, any
              rule that precedes the first user-agent line). </t>

          <t> Implementers <bcp14>MAY</bcp14> bridge encoding mismatches if they
              detect that the robots.txt file is not UTF8 encoded. </t>
        </section>
        <section anchor="special-characters" title="Special Characters">
          <t> Crawlers <bcp14>SHOULD</bcp14> allow the following special characters: </t>

          <texttable title="List of special characters in robots.txt files">
            <ttcol align='left'>Character</ttcol>
            <ttcol align='left'>Description</ttcol>
            <ttcol align='left'>Example</ttcol>
            <c>&quot;#&quot;</c>
            <c>Designates an end of line comment.</c>
            <c>&quot;allow: / # comment in line&quot;<br /><br />&quot;# comment on its own line&quot;</c>
            <c>&quot;$&quot;</c>
            <c>Designates the end of the match pattern.</c>
            <c>&quot;allow: /this/path/exactly$&quot;</c>
            <c>&quot;*&quot;</c>
            <c>Designates 0 or more instances of any character.</c>
            <c>&quot;allow: /this/*/exactly&quot;</c>
          </texttable>

          <t> If crawlers match special characters verbatim in the URI, crawlers
              <bcp14>SHOULD</bcp14> use &quot;%&quot; encoding. For example: </t>

          <texttable title="Example of percent-encoding">
            <ttcol align='left'>Percent-encoded Pattern</ttcol>
            <ttcol align='left'>URI</ttcol>
            <c>/path/file-with-a-%2A.html</c>
            <c>https://www.example.com/path/file-with-a-*.html</c>
            <c>/path/foo-%24</c>
            <c>https://www.example.com/path/foo-$</c>
          </texttable>
        </section>
        <section anchor="other-records" title="Other Records">
          <t> Clients <bcp14>MAY</bcp14> interpret other records that are not
              part of the robots.txt protocol. For example, &#39;sitemap&#39;
              <xref target="SITEMAPS"/>. Parsing of other records
              <bcp14>MUST NOT</bcp14> interfere with the parsing of explicitly
              defined records in section 2. </t>
        </section>
      </section>
      <section anchor="access-method" title="Access Method">
      <t> The rules <bcp14>MUST</bcp14> be accessible in a file named
          &quot;/robots.txt&quot; (all lower case) in the top level path of
          the service. The file <bcp14>MUST</bcp14> be UTF-8 encoded (as
          defined in <xref target="RFC3629"/>) and Internet Media Type
          &quot;text/plain&quot;
          (as defined in <xref target="RFC2046"/>). </t>
      <t> As per <xref target="RFC3986"/>, the URI of the robots.txt is: </t>
      <t> &quot;scheme:[//authority]/robots.txt&quot; </t>
      <t> For example, in the context of HTTP or FTP, the URI is: </t>

      <figure>
        <artwork><![CDATA[
          http://www.example.com/robots.txt

          https://www.example.com/robots.txt

          ftp://ftp.example.com/robots.txt
          ]]></artwork>
      </figure>

      <section anchor="access-results" title="Access Results">
        <section anchor="successful-access" title="Successful Access">
          <t> If the crawler successfully downloads the robots.txt, the
              crawler <bcp14>MUST</bcp14> follow the parseable rules. </t>
        </section>
        <section anchor="redirects" title="Redirects">
          <t> The server may respond to a robots.txt fetch request with a
              redirect, such as HTTP 301 and HTTP 302. The crawlers
              <bcp14>SHOULD</bcp14> follow at least five consecutive
              redirects, even across authorities (for example, hosts in case
              of HTTP), as defined in <xref target="RFC1945"/>. </t>
          <t> If a robots.txt file is reached within five consecutive
              redirects, the robots.txt file <bcp14>MUST</bcp14> be fetched,
              parsed, and its rules followed in the context of the initial
              authority. </t>
          <t> If there are more than five consecutive redirects, crawlers
              <bcp14>MAY</bcp14> assume that the robots.txt is
              unavailable. </t>
        </section>
        <section anchor="unavailable-status" title="Unavailable Status">
          <t> Unavailable means the crawler tries to fetch the robots.txt,
              and the server responds with unavailable status codes. For
              example, in the context of HTTP, unavailable status codes are
              in the 400-499 range. </t>

          <t> If a server status code indicates that the robots.txt file is
              unavailable to the client, then crawlers MAY access any
              resources on the server. </t>
        </section>
        <section anchor="unreachable-status" title="Unreachable Status">
          <t> If the robots.txt is unreachable due to server or network
              errors, this means the robots.txt is undefined and the crawler
              <bcp14>MUST</bcp14> assume complete disallow. For example, in
              the context of HTTP, an unreachable robots.txt has a response
              code in the 500-599 range. For other undefined status codes,
              the crawler <bcp14>MUST</bcp14> assume the robots.txt is
              unreachable. </t>
          <t> If the robots.txt is undefined for a reasonably long period of
              time (for example, 30 days), clients <bcp14>MAY</bcp14> assume
              the robots.txt is unavailable or continue to use a cached
              copy. </t>
        </section>
        <section anchor="parsing-errors" title="Parsing Errors">
          <t> Crawlers <bcp14>SHOULD</bcp14> try to parse each line of the
              robots.txt file. Crawlers <bcp14>MUST</bcp14> use the parseable
              rules. </t>

        </section>
      </section>
    </section>
      <section anchor="caching" title="Caching">
      <t> Crawlers <bcp14>MAY</bcp14> cache the fetched robots.txt file&#39;s
          contents. Crawlers <bcp14>MAY</bcp14> use standard cache control as
          defined in <xref target="RFC2616"/>. Crawlers
          <bcp14>SHOULD NOT</bcp14> use the cached version for more than 24
          hours, unless the robots.txt is unreachable. </t>
    </section>
      <section anchor="limits" title="Limits">
      <t> Crawlers <bcp14>MAY</bcp14> impose a parsing limit that
      <bcp14>MUST</bcp14> be at least 500 kibibytes (KiB). </t>
    </section>
    </section>
    <section anchor="security" title="Security Considerations">
      <t> The Robots Exclusion Protocol is not a substitute for more valid
          content security measures. Listing URIs in the robots.txt file
          exposes the URI publicly and thus makes the URIs discoverable. </t>
    </section>
    <section anchor="IANA" title="IANA Considerations">
      <t> This document has no actions for IANA. </t>
    </section>
    <section anchor="examples" title="Examples">
      <section anchor="simple-example" title="Simple Example">
      <t> The following example shows: </t>
      <t>
        <list style="symbols">
          <t> foobot: A regular case. A single user-agent token followed
              by rules. </t>
          <t> barbot and bazbot: A group that&#39;s relevant for more
              than one user-agent. </t>
          <t> quxbot: An empty group at end of the file. </t>
        </list>
      </t>
      <figure>
        <artwork><![CDATA[
          User-Agent : foobot
          Disallow : /example/page.html
          Disallow : /example/disallowed.gif

          User-Agent : barbot
          User-Agent : bazbot
          Allow : /example/page.html
          Disallow : /example/disallowed.gif

          User-Agent: quxbot

          EOF
        ]]></artwork>
      </figure>
    </section>
      <section anchor="longest-match" title="Longest Match">
      <t> The following example shows that in the case of two rules, the
          longest one is used for matching. In the following case,
          /example/page/disallowed.gif <bcp14>MUST</bcp14> be used for
          the URI example.com/example/page/disallow.gif. </t>
      <figure>
        <artwork><![CDATA[
          User-Agent : foobot
          Allow : /example/page/
          Disallow : /example/page/disallowed.gif
        ]]></artwork>
      </figure>
    </section>
    </section>
  </middle>

  <back>
    <references title='Normative References'>
      &RFC1945;
      &RFC2046;
      &RFC2119;
      &RFC2616;
      &RFC3629;
      &RFC3986;
      &RFC5234;
      &RFC8174;
      &RFC8288;
    </references>

    <references title='Informative References'>
      <reference anchor="ROBOTSTXT" target="http://www.robotstxt.org/">
        <front>
          <title>Robots Exclusion Protocol</title>
          <author >
            <organization></organization>
          </author>
          <date year="n.d."/>
        </front>
      </reference>
      <reference anchor="SITEMAPS" target="https://www.sitemaps.org/index.html">
        <front>
          <title>Sitemaps Protocol</title>
          <author >
            <organization></organization>
          </author>
          <date year="n.d."/>
        </front>
      </reference>
    </references>
  </back>
</rfc>
