Robots Exclusion ProtocolStalworthy Manor FarmSuton LaneWymondham, NorfolkNR18 9JGUnited Kingdomm.koster@greenhills.co.ukGoogle LLCBrandschenkestrasse 110Zürich8002Switzerlandgaryillyes@google.comGoogle LLC1600 Amphitheatre PkwyMountain ViewCA94043United States of Americahenner@google.comGoogle LLCBrandschenkestrasse 110Zürich8002Switzerlandlizzi@google.comrobotcrawlerrobots.txt This document specifies and extends the "Robots Exclusion Protocol"
method originally defined by Martijn Koster in 1994 for service owners
to control how content served by their services may be accessed, if at
all, by automatic clients known as crawlers. Specifically, it adds
definition language for the protocol, instructions for handling
errors, and instructions for caching. Status of This Memo
This is an Internet Standards Track document.
This document is a product of the Internet Engineering Task Force
(IETF). It represents the consensus of the IETF community. It has
received public review and has been approved for publication by
the Internet Engineering Steering Group (IESG). Further
information on Internet Standards is available in Section 2 of
RFC 7841.
Information about the current status of this document, any
errata, and how to provide feedback on it may be obtained at
.
Copyright Notice
Copyright (c) 2022 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
() in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with
respect to this document. Code Components extracted from this
document must include Revised BSD License text as described in
Section 4.e of the Trust Legal Provisions and are provided without
warranty as described in the Revised BSD License.
Table of Contents
. Introduction
. Requirements Language
. Specification
. Protocol Definition
. Formal Syntax
. The User-Agent Line
. The "Allow" and "Disallow" Lines
. Special Characters
. Other Records
. Access Method
. Access Results
. Successful Access
. Redirects
. "Unavailable" Status
. "Unreachable" Status
. Parsing Errors
. Caching
. Limits
. Security Considerations
. IANA Considerations
. Examples
. Simple Example
. Longest Match
. References
. Normative References
. Informative References
Authors' Addresses
Introduction This document applies to services that provide resources that clients
can access through URIs as defined in . For example,
in the context of HTTP, a browser is a client that displays the content of a
web page. Crawlers are automated clients. Search engines, for instance, have crawlers to
recursively traverse links for indexing as defined in
. It may be inconvenient for service owners if crawlers visit the entirety of
their URI space. This document specifies the rules originally defined by
the "Robots Exclusion Protocol" that crawlers
are requested to honor when accessing URIs. These rules are not a form of access authorization. Requirements LanguageThe key words "MUST", "MUST NOT",
"REQUIRED", "SHALL",
"SHALL NOT", "SHOULD",
"SHOULD NOT",
"RECOMMENDED", "NOT RECOMMENDED",
"MAY", and "OPTIONAL" in this document
are to be interpreted as described in BCP 14
when, and only
when, they appear in all capitals, as shown here.SpecificationProtocol Definition The protocol language consists of rule(s) and group(s) that the service
makes available in a file named "robots.txt" as described in
:
Rule:
A line with a key-value pair that defines how a
crawler may access URIs. See
.
Group:
One or more user-agent lines that are followed by
one or more rules. The group is terminated by a user-agent line
or end of file. See .
The last group may have no rules, which means it implicitly
allows everything.
Formal Syntax Below is an Augmented Backus-Naur Form (ABNF) description, as described
in .
robotstxt = *(group / emptyline)
group = startgroupline ; We start with a user-agent
; line
*(startgroupline / emptyline) ; ... and possibly more
; user-agent lines
*(rule / emptyline) ; followed by rules relevant
; for the preceding
; user-agent lines
startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL
rule = *WS ("allow" / "disallow") *WS ":"
*WS (path-pattern / empty-pattern) EOL
; parser implementors: define additional lines you need (for
; example, Sitemaps).
product-token = identifier / "*"
path-pattern = "/" *UTF8-char-noctl ; valid URI path pattern
empty-pattern = *WS
identifier = 1*(%x2D / %x41-5A / %x5F / %x61-7A)
comment = "#" *(UTF8-char-noctl / WS / "#")
emptyline = EOL
EOL = *WS [comment] NL ; end-of-line may have
; optional trailing comment
NL = %x0D / %x0A / %x0D.0A
WS = %x20 / %x09
; UTF8 derived from RFC 3629, but excluding control characters
UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4
UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, "#"
UTF8-2 = %xC2-DF UTF8-tail
UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2UTF8-tail /
%xED %x80-9F UTF8-tail / %xEE-EF 2UTF8-tail
UTF8-4 = %xF0 %x90-BF 2UTF8-tail / %xF1-F3 3UTF8-tail /
%xF4 %x80-8F 2UTF8-tail
UTF8-tail = %x80-BF
The User-Agent Line Crawlers set their own name, which is called a product token, to find
relevant groups. The product token MUST contain only
uppercase and lowercase letters ("a-z" and "A-Z"),
underscores ("_"), and hyphens ("-").
The product token SHOULD
be a substring of the identification string that the crawler sends to
the service. For example, in the case of HTTP
, the product token
SHOULD be a substring in the User-Agent header.
The identification string SHOULD describe the purpose of
the crawler. Here's an example of a User-Agent HTTP request header
with a link pointing to a page describing the purpose of the
ExampleBot crawler, which appears as a substring in the User-Agent HTTP
header and as a product token in the robots.txt user-agent line: Note that the product token (ExampleBot) is a substring of
the User-Agent HTTP header. Crawlers MUST use case-insensitive matching
to find the group that matches the product token and then
obey the rules of the group. If there is more than one
group matching the user-agent, the matching groups' rules
MUST be combined into one group and parsed
according to
. If no matching group exists, crawlers MUST obey the group
with a user-agent line with the "*" value, if present. If no group matches the product token and there is no group with a user-agent
line with the "*" value, or no groups are present at all, no
rules apply. The "Allow" and "Disallow" Lines These lines indicate whether accessing a URI that matches the
corresponding path is allowed or disallowed. To evaluate if access to a URI is allowed, a crawler MUST
match the paths in "allow" and "disallow" rules against the URI.
The matching SHOULD be case sensitive. The matching
MUST start with the first octet of the path. The most
specific match found MUST be used. The most specific
match is the match that has the most octets. Duplicate rules in a
group MAY be deduplicated. If an "allow" rule and a "disallow"
rule are equivalent, then the "allow" rule SHOULD be used. If no
match is found amongst the rules in a group for a matching user-agent
or there are no rules in the group, the URI is allowed. The
/robots.txt URI is implicitly allowed. Octets in the URI and robots.txt paths outside the range of the
ASCII coded character set, and those in the reserved range defined
by , MUST be percent-encoded as
defined by prior to comparison. If a percent-encoded ASCII octet is encountered in the URI, it
MUST be unencoded prior to comparison, unless it is a
reserved character in the URI as defined by
or the character is outside the unreserved character range. The match
evaluates positively if and only if the end of the path from the rule
is reached before a difference in octets is encountered. For example: The crawler SHOULD ignore "disallow" and
"allow" rules that are not in any group (for example, any
rule that precedes the first user-agent line). Implementors MAY bridge encoding mismatches if they
detect that the robots.txt file is not UTF-8 encoded. Special Characters Crawlers MUST support the following special characters: If crawlers match special characters verbatim in the URI, crawlers
SHOULD use "%" encoding. For example: Other Records Crawlers MAY interpret other records that are not
part of the robots.txt protocol -- for example, "Sitemaps"
. Crawlers MAY be lenient when
interpreting other records. For example, crawlers may accept
common misspellings of the record. Parsing of other records
MUST NOT interfere with the parsing of explicitly
defined records in .
For example, a "Sitemaps" record MUST NOT terminate a
group. Access Method The rules MUST be accessible in a file named
"/robots.txt" (all lowercase) in the top-level path of
the service. The file MUST be UTF-8 encoded (as
defined in ) and Internet Media Type
"text/plain"
(as defined in ). As per , the URI of the robots.txt file is: "scheme:[//authority]/robots.txt" For example, in the context of HTTP or FTP, the URI is:
https://www.example.com/robots.txt
ftp://ftp.example.com/robots.txt
Access ResultsSuccessful Access If the crawler successfully downloads the robots.txt file, the
crawler MUST follow the parseable rules. Redirects It's possible that a server responds to a robots.txt fetch
request with a redirect, such as HTTP 301 or HTTP 302 in the
case of HTTP. The crawlers SHOULD follow at
least five consecutive redirects, even across authorities
(for example, hosts in the case of HTTP). If a robots.txt file is reached within five consecutive
redirects, the robots.txt file MUST be fetched,
parsed, and its rules followed in the context of the initial
authority. If there are more than five consecutive redirects, crawlers
MAY assume that the robots.txt file is
unavailable. "Unavailable" Status "Unavailable" means the crawler tries to fetch the robots.txt file
and the server responds with status codes indicating that the resource in question is unavailable. For
example, in the context of HTTP, such status codes are
in the 400-499 range. If a server status code indicates that the robots.txt file is
unavailable to the crawler, then the crawler MAY access any
resources on the server. "Unreachable" Status If the robots.txt file is unreachable due to server or network
errors, this means the robots.txt file is undefined and the crawler
MUST assume complete disallow. For example, in
the context of HTTP, server errors are identified by status codes
in the 500-599 range. If the robots.txt file is undefined for a reasonably long period of
time (for example, 30 days), crawlers MAY assume that
the robots.txt file is unavailable as defined in
or continue to use a cached
copy. Parsing Errors Crawlers MUST try to parse each line of the
robots.txt file. Crawlers MUST use the parseable
rules. Caching Crawlers MAY cache the fetched robots.txt file's
contents. Crawlers MAY use standard cache control as
defined in . Crawlers
SHOULD NOT use the cached version for more than 24
hours, unless the robots.txt file is unreachable. Limits Crawlers SHOULD impose a parsing limit to protect their systems;
see . The parsing limit MUST be at least
500 kibibytes . Security Considerations The Robots Exclusion Protocol is not a substitute for valid
content security measures. Listing paths in the robots.txt file
exposes them publicly and thus makes the paths discoverable. To
control access to the URI paths in a robots.txt file, users of
the protocol should employ a valid security measure relevant to
the application layer on which the robots.txt file is served --
for example, in the case of HTTP, HTTP Authentication as defined in
. To protect against attacks against their system, implementors
of robots.txt parsing and matching logic should take the
following considerations into account:
Memory management:
defines the lower
limit of bytes that must be processed, which inherently also
protects the parser from out-of-memory scenarios.
Invalid characters:
defines
a set of characters that parsers and matchers can expect in
robots.txt files. Out-of-bound characters should be rejected
as invalid, which limits the available attack vectors that
attempt to compromise the system.
Untrusted content:
Implementors should treat the content of
a robots.txt file as untrusted content, as defined by the
specification of the application layer used. For example,
in the context of HTTP, implementors should follow the
Security Considerations section of
.
IANA Considerations This document has no IANA actions. ExamplesSimple Example The following example shows:
*:
A group that's relevant to all user agents that
don't have an explicitly defined matching group. It allows
access to the URLs with the /publications/ path prefix, and it
restricts access to the URLs with the /example/ path prefix
and to all URLs with a .gif suffix. The "*" character designates
any character, including the otherwise-required forward
slash; see .
foobot:
A regular case. A single user agent followed
by rules. The crawler only has access to two URL path
prefixes on the site -- /example/page.html and
/example/allowed.gif. The rules of the group are missing
the optional space character, which is acceptable as
defined in .
barbot and bazbot:
A group that's relevant for more
than one user agent. The crawlers are not allowed to access
the URLs with the /example/page.html path prefix but
otherwise have unrestricted access to the rest of the URLs
on the site.
quxbot:
An empty group at the end of the file. The crawler has
unrestricted access to the URLs on the site.
User-Agent: *
Disallow: *.gif$
Disallow: /example/
Allow: /publications/
User-Agent: foobot
Disallow:/
Allow:/example/page.html
Allow:/example/allowed.gif
User-Agent: barbot
User-Agent: bazbot
Disallow: /example/page.html
User-Agent: quxbot
EOF
Longest Match The following example shows that in the case of two rules, the
longest one is used for matching. In the following case,
/example/page/disallowed.gif MUST be used for
the URI example.com/example/page/disallow.gif.
User-Agent: foobot
Allow: /example/page/
Disallow: /example/page/disallowed.gif
ReferencesNormative ReferencesMultipurpose Internet Mail Extensions (MIME) Part Two: Media TypesThis second document defines the general structure of the MIME media typing system and defines an initial set of media types. [STANDARDS-TRACK]Key words for use in RFCs to Indicate Requirement LevelsIn many standards track documents several words are used to signify the requirements in the specification. These words are often capitalized. This document defines these words as they should be interpreted in IETF documents. This document specifies an Internet Best Current Practices for the Internet Community, and requests discussion and suggestions for improvements.UTF-8, a transformation format of ISO 10646ISO/IEC 10646-1 defines a large character set called the Universal Character Set (UCS) which encompasses most of the world's writing systems. The originally proposed encodings of the UCS, however, were not compatible with many current applications and protocols, and this has led to the development of UTF-8, the object of this memo. UTF-8 has the characteristic of preserving the full US-ASCII range, providing compatibility with file systems, parsers and other software that rely on US-ASCII values but are transparent to other values. This memo obsoletes and replaces RFC 2279.Uniform Resource Identifier (URI): Generic SyntaxA Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource. This specification defines the generic URI syntax and a process for resolving URI references that might be in relative form, along with guidelines and security considerations for the use of URIs on the Internet. The URI syntax defines a grammar that is a superset of all valid URIs, allowing an implementation to parse the common components of a URI reference without knowing the scheme-specific requirements of every possible identifier. This specification does not define a generative grammar for URIs; that task is performed by the individual specifications of each URI scheme. [STANDARDS-TRACK]Augmented BNF for Syntax Specifications: ABNFInternet technical specifications often need to define a formal syntax. Over the years, a modified version of Backus-Naur Form (BNF), called Augmented BNF (ABNF), has been popular among many Internet specifications. The current specification documents ABNF. It balances compactness and simplicity with reasonable representational power. The differences between standard BNF and ABNF involve naming rules, repetition, alternatives, order-independence, and value ranges. This specification also supplies additional rule definitions and encoding for a core lexical analyzer of the type common to several Internet specifications. [STANDARDS-TRACK]Ambiguity of Uppercase vs Lowercase in RFC 2119 Key WordsRFC 2119 specifies common key words that may be used in protocol specifications. This document aims to reduce the ambiguity by clarifying that only UPPERCASE usage of the key words have the defined special meanings.Web LinkingThis specification defines a model for the relationships between resources on the Web ("links") and the type of those relationships ("link relation types").It also defines the serialisation of such links in HTTP headers with the Link header field.HTTP SemanticsThe Hypertext Transfer Protocol (HTTP) is a stateless application-level protocol for distributed, collaborative, hypertext information systems. This document describes the overall architecture of HTTP, establishes common terminology, and defines aspects of the protocol that are shared by all versions. In this definition are core protocol elements, extensibility mechanisms, and the "http" and "https" Uniform Resource Identifier (URI) schemes.This document updates RFC 3864 and obsoletes RFCs 2818, 7231, 7232, 7233, 7235, 7538, 7615, 7694, and portions of 7230.HTTP CachingThe Hypertext Transfer Protocol (HTTP) is a stateless application-level protocol for distributed, collaborative, hypertext information systems. This document defines HTTP caches and the associated header fields that control cache behavior or indicate cacheable response messages.This document obsoletes RFC 7234.Informative ReferencesKibibyteSimple English Wikipedia, the free encyclopediaThe Web Robots Pages (including /robots.txt)2007What are Sitemaps? (Sitemap protocol)April 2020Authors' AddressesStalworthy Manor FarmSuton LaneWymondham, NorfolkNR18 9JGUnited Kingdomm.koster@greenhills.co.ukGoogle LLCBrandschenkestrasse 110Zürich8002Switzerlandgaryillyes@google.comGoogle LLC1600 Amphitheatre PkwyMountain ViewCA94043United States of Americahenner@google.comGoogle LLCBrandschenkestrasse 110Zürich8002Switzerlandlizzi@google.com