What is the robots.txt file and how does it work?

Author: HOSTTEST Editorial | 6 Oct 2021

What is the robots.txt Major search engines like Google and Microsoft Bing, as well as smaller providers such as DuckDuckGo, crawl the World Wide Web (WWW) and parts of the connected Internet with special programs (crawlers) permanently and automatically for content that they index and evaluate. There are several ways to control their behaviour - one of the most important and versatile alongside .htaccess is the robots.txt file, which allows you to define precise instructions. These instructions can either apply to all clients or specify different search engines to dictate individual settings to them. Due to its function and the available options, the robots.txt file plays an important role in SEO, as well as to, for example, separate parts of a website from others or hide certain files from search engines.

What does the robots.txt file consist of?
What task does the robots.txt file perform?
How does a robots.txt file work and what is its impact?
What should be considered when creating a robots.txt file?
What does a robots.txt file look like?

What does the robots.txt file consist of?

The robots.txt is a simple text file containing instructions in a readable form. Therefore, it can be easily created with a simple text editor like gedit or mousepad on Linux or Notepad on Microsoft Windows. The content consists of multiple lines, which can either refer to a single crawler like the googlebot or apply universally to all visitors. Each entry contains at least two pieces of information, separated by a line break: The robots.txt file defines in the first position which search engines the following instructions apply to. Subsequently, in a new line, there are individual details about how a search engine should crawl and index the website.

What task does the robots.txt file perform?

Generally, the robots.txt file offers four different options that can be combined:

Allow: Permission to browse specified parts of a website
Disallow: Blocking access to specific paths or files
Sitemap: Reference to an external file with instructions on how a website should be browsed
Crawl-Delay: Delay in calling between individual subpages (only some crawlers)

The purpose of a robots.txt file is that the owner of a website can control the traffic generated by search engines. This is particularly useful and helpful for large sites or those with a highly branched structure, but can also have a positive impact, for example, on a small web hosting or a virtual server with low performance. Furthermore, the robots.txt file proves useful in specifically excluding large files such as videos or other multimedia content from being retrieved by search engines to reduce bandwidth and generate minimal traffic. Since search engines like Googlebot operate impartially - i.e., they do not set preferences for indexing or block content themselves - the robots.txt file provides a convenient way to control them. Additionally, through a sitemap, priorities for individual subpages can also be set, so that, for example, content that changes frequently is crawled and indexed more often and quickly than static information.

How does a robots.txt file work and what is its impact?

When a search engine calls a website through one of its crawlers, the crawler automatically follows every identifiable link and retrieves the underlying content to analyze and evaluate it according to its own algorithm. As its first file, it tries to find a robots.txt file in the root directory - i.e., in the lowest path of a domain - to obtain information about the desired approach. For this reason, it must be saved directly under the website address and can be found, for example, at www.example.org/robots.txt or example.com/robots.txt.

If web hosting does not allow the user access to this area, for example, because it uses a structure like https://provider.com/customer, unfortunately, robots.txt cannot be used. However, it is possible to reserve an external domain and link it to this web space, creating a redirection. In this case, the settings would only affect this presence, for example, linking example.com to http://provider.com/example, with the robots.txt applying to the first domain but not the second.

Furthermore, it is important to note that robots.txt is not an official or binding standard, but an independently developed Robots Exclusion Standard adopted in the summer of 2008 by international corporations such as Google, Microsoft, and Yahoo. Compliance with the specified rules is purely voluntary and not mandatory, although nowadays all major companies respect them. It is therefore crucial to explicitly state that a robots.txt file is not an effective block for all search engines and certainly not for external access, especially for malicious purposes. Additionally, each crawler, such as the Googlebot or the Bingbot used by Microsoft, is programmed differently and does not necessarily support all commands beyond Disallow. For example, Crawl-Delay is not compatible with the Googlebot, and some search engines like the Russian Yandex or the Chinese Baidu and Sogou also ignore Allow rules and only interpret Disallow.

What should be considered when creating a robots.txt file?

While creating a robots.txt file is possible with any text editor, to ensure maximum compatibility, it is recommended to use the Linux standard, which differs from that of Microsoft Windows, especially in the special character for a line break and is supported by free freeware programs like Notepad .

The file itself consists of one or more paragraphs, separated by a blank line, containing various instructions for specific crawlers. Each paragraph starts with the User-agent: statement, precisely defining which bot the instructions are for. The most common legal crawlers active on the internet and the World Wide Web are:

*: This placeholder (wildcard) stands for all crawlers
Googlebot: the most common and active crawler
Bingbot: the crawler used by Microsoft since 2010 instead of msnbot
Slurp: crawler used by Yahoo mainly for mobile search indexing
DuckDuckBot: crawler of the privacy-focused search engine DuckDuckGo
Baiduspider: crawler of the largest Chinese search engine Baidu
YandexBot: used by the Russian search engine Yandex
FaceBot: crawler of Facebook, only active when following links outside the platform
ia_archiver: from Amazon Alexa, mainly collects statistical information

In addition to these "official" crawlers that adhere to the robots.txt guidelines (if they support them), there are also providers who do not care about such regulations or deliberately ignore them. Examples include PetalBot or DotBot - to efficiently block them, the detour of creating a .htaccess file must be taken, redirecting or rejecting crawlers based on their user-agent identification.

What does a robots.txt look like?

Each robots.txt consists of one or more blocks that a crawler reads chronologically and applies the rules that apply to it. Like in many programming languages, the hash symbol # allows for inserting comments that are not interpreted as code. Some examples of a robots.txt look like this:

Example 1:

User-agent: *
# Blocks all crawlers
Disallow: /private/
# Prohibits access to the /private directory and all subdirectories
Allow: /website/
# Explicitly allows access to /website and all subdirectories

Example 2:

User-agent: Googlebot
User-agent: Bingbot
# Block applies only to Google and Microsoft Bing
Allow: /website/
# Google and Bing are allowed to index the website
Disallow: /website/private
# Blocks all directories or files starting with private

User-agent: *
# Excludes all other bots
Disallow: /
# Prohibits access to the entire domain

There are some specific rules that must be followed in a robots.txt:

The robots.txt file must be located in the root directory of a domain
Uppercase and lowercase letters are not distinguished
Bots must be named exactly and explicitly
Spaces in a line are only allowed after the :
Each robots.txt file may contain a maximum of one block for all crawlers (*)
The first entry applicable to a crawler will be evaluated
A Disallow: without further specification allows everything
Wildcards such as * are supported by some, but not all crawlers
The entry /private/ refers to a directory, /private refers to all directories and files starting with private

As it is a relatively small and manageable file, a robots.txt file can easily be created or edited directly on a VPS hosting via SSH access. An alternative is to create it locally and upload it to the root directory via FTP or a web interface.

Photo: Free-Photos on Pixabay

Write a comment

Tags for this article

Webhosting
Webspace
HTTP

More web hosts

Web Hosting vs. Dedicated Server: What are the differences?

Where exactly are the differences between a classic Shared Web Space package and a Dedicated Server?

What is a Cronjob?

A Cronjob is a recurring task in the server area that is automatically executed at a specific time.

Perl Webhosting Comparison UK

BSD Webhosting Comparison UK

Comparison of Web Hosting with Git Support UK

Permanently Free Web Hosting: The Top 10 Providers for Free Webspace

Media Partner:

About hosttest

We launched hosttest in 2006 to bring greater transparency to the web hosting market in the German-speaking region. In 2025, we introduced hosttest.co.uk. With over 400 hosting providers and more than 10,000 offers, we offer you the ideal foundation to find the right hosting provider.

Since 2015, we have been awarding our Hoster of the Year annually and would be happy to receive your vote in the future.
More about us...

Latest Homepage News

one.com brings online success into fo...

19 Sept 2025

Largest web hosting providers in the ...

16 Sept 2025

Cloud86 grows to over 40,000 customers

9 Sept 2025

Largest web hosting provider in Switz...

2 Sept 2025

Latest Web Hosts

What is the robots.txt file and how does it work?

Screen resolution less than 1400px