What is the robots.txt file and how does it work?

Author: HOSTTEST Editorial   | 6 Oct 2021

What is the robots.txtMajor search engines like Google and Microsoft Bing, as well as smaller providers such as DuckDuckGo, crawl the World Wide Web (WWW) and parts of the connected Internet with special programs (crawlers) permanently and automatically for content that they index and evaluate. There are several ways to control their behaviour - one of the most important and versatile alongside .htaccess is the robots.txt file, which allows you to define precise instructions. These instructions can either apply to all clients or specify different search engines to dictate individual settings to them. Due to its function and the available options, the robots.txt file plays an important role in SEO, as well as to, for example, separate parts of a website from others or hide certain files from search engines.

What does the robots.txt file consist of?
What task does the robots.txt file perform?
How does a robots.txt file work and what is its impact?
What should be considered when creating a robots.txt file?
What does a robots.txt file look like?

What does the robots.txt file consist of?

The robots.txt is a simple text file containing instructions in a readable form. Therefore, it can be easily created with a simple text editor like gedit or mousepad on Linux or Notepad on Microsoft Windows. The content consists of multiple lines, which can either refer to a single crawler like the googlebot or apply universally to all visitors. Each entry contains at least two pieces of information, separated by a line break: The robots.txt file defines in the first position which search engines the following instructions apply to. Subsequently, in a new line, there are individual details about how a search engine should crawl and index the website.

What task does the robots.txt file perform?

Generally, the robots.txt file offers four different options that can be combined:

  • Allow: Permission to browse specified parts of a website
  • Disallow: Blocking access to specific paths or files
  • Sitemap: Reference to an external file with instructions on how a website should be browsed
  • Crawl-Delay: Delay in calling between individual subpages (only some crawlers)

The purpose of a robots.txt file is that the owner of a website can control the traffic generated by search engines. This is particularly useful and helpful for large sites or those with a highly branched structure, but can also have a positive impact, for example, on a small web hosting or a virtual server with low performance. Furthermore, the robots.txt file proves useful in specifically excluding large files such as videos or other multimedia content from being retrieved by search engines to reduce bandwidth and generate minimal traffic. Since search engines like Googlebot operate impartially - i.e., they do not set preferences for indexing or block content themselves - the robots.txt file provides a convenient way to control them. Additionally, through a sitemap, priorities for individual subpages can also be set, so that, for example, content that changes frequently is crawled and indexed more often and quickly than static information.

How does a robots.txt file work and what is its impact?

When a search engine calls a website through one of its crawlers, the crawler automatically follows every identifiable link and retrieves the underlying content to analyze and evaluate it according to its own algorithm. As its first file, it tries to find a robots.txt file in the root directory - i.e., in the lowest path of a domain - to obtain information about the desired approach. For this reason, it must be saved directly under the website address and can be found, for example, at www.example.org/robots.txt or example.com/robots.txt.

If web hosting does not allow the user access to this area, for example, because it uses a structure like https://provider.com/customer, unfortunately, robots.txt cannot be used. However, it is possible to reserve an external domain and link it to this web space, creating a redirection. In this case, the settings would only affect this presence, for example, linking example.com to http://provider.com/example, with the robots.txt applying to the first domain but not the second.

Furthermore, it is important to note that robots.txt is not an official or binding standard, but an independently developed Robots Exclusion Standard adopted in the summer of 2008 by international corporations such as Google, Microsoft, and Yahoo. Compliance with the specified rules is purely voluntary and not mandatory, although nowadays all major companies respect them. It is therefore crucial to explicitly state that a robots.txt file is not an effective block for all search engines and certainly not for external access, especially for malicious purposes. Additionally, each crawler, such as the Googlebot or the Bingbot used by Microsoft, is programmed differently and does not necessarily support all commands beyond Disallow. For example, Crawl-Delay is not compatible with the Googlebot, and some search engines like the Russian Yandex or the Chinese Baidu and Sogou also ignore Allow rules and only interpret Disallow.

What should be considered when creating a robots.txt file?

While creating a robots.txt file is possible with any text editor, to ensure maximum compatibility, it is recommended to use the Linux standard, which differs from that of Microsoft Windows, especially in the special character for a line break and is supported by free freeware programs like Notepad .

The file itself consists of one or more paragraphs, separated by a blank line, containing various instructions for specific crawlers. Each paragraph starts with the User-agent: statement, precisely defining which bot the instructions are for. The most common legal crawlers active on the internet and the World Wide Web are:

  • *: This placeholder (wildcard) stands for all crawlers
  • Googlebot: the most common and active crawler
  • Bingbot: the crawler used by Microsoft since 2010 instead of msnbot
  • Slurp: crawler used by Yahoo mainly for mobile search indexing
  • DuckDuckBot: crawler of the privacy-focused search engine DuckDuckGo
  • Baiduspider: crawler of the largest Chinese search engine Baidu
  • YandexBot: used by the Russian search engine Yandex
  • FaceBot: crawler of Facebook, only active when following links outside the platform
  • ia_archiver: from Amazon Alexa, mainly collects statistical information

In addition to these "official" crawlers that adhere to the robots.txt guidelines (if they support them), there are also providers who do not care about such regulations or deliberately ignore them. Examples include PetalBot or DotBot - to efficiently block them, the detour of creating a .htaccess file must be taken, redirecting or rejecting crawlers based on their user-agent identification.

What does a robots.txt look like?

Each robots.txt consists of one or more blocks that a crawler reads chronologically and applies the rules that apply to it. Like in many programming languages, the hash symbol # allows for inserting comments that are not interpreted as code. Some examples of a robots.txt look like this:

Example 1:

User-agent: *
# Blocks all crawlers
Disallow: /private/
# Prohibits access to the /private directory and all subdirectories
Allow: /website/
# Explicitly allows access to /website and all subdirectories

Example 2:

User-agent: Googlebot
User-agent: Bingbot
# Block applies only to Google and Microsoft Bing
Allow: /website/
# Google and Bing are allowed to index the website
Disallow: /website/private
# Blocks all directories or files starting with private

User-agent: *
# Excludes all other bots
Disallow: /
# Prohibits access to the entire domain

There are some specific rules that must be followed in a robots.txt:

  • The robots.txt file must be located in the root directory of a domain
  • Uppercase and lowercase letters are not distinguished
  • Bots must be named exactly and explicitly
  • Spaces in a line are only allowed after the :
  • Each robots.txt file may contain a maximum of one block for all crawlers (*)
  • The first entry applicable to a crawler will be evaluated
  • A Disallow: without further specification allows everything
  • Wildcards such as * are supported by some, but not all crawlers
  • The entry /private/ refers to a directory, /private refers to all directories and files starting with private

As it is a relatively small and manageable file, a robots.txt file can easily be created or edited directly on a VPS hosting via SSH access. An alternative is to create it locally and upload it to the root directory via FTP or a web interface.

Photo: Free-Photos on Pixabay

Write a comment


More web hosts


More interesting articles

Web Hosting vs. Dedicated Server: What are the differences?

Where exactly are the differences between a classic Shared Web Space package and a Dedicated Server?

What is a Cronjob?

A Cronjob is a recurring task in the server area that is automatically executed at a specific time.