What is it? The robots.txt file is a plain text file that must comply with the robots exclusion standard.
You can create the file with Windows Notepad and save it under the name robots.txt.
This file consists of one or more rules and each rule blocks or allows a particular crawler to access a particular file path on a website.
The robots.txt file is used to manage crawler traffic to your site.
It is used to prevent the requests your website receives from overloading it, with the robots.txt file properly configured, you can prevent the speed of your website or even the Cloud itself from being negatively affected when you receive several visits from these indexers at the same time.
What do we block? The crawler, also known as crawler spider, robot or bot. It is a program that analyzes website documents. Search engines use very powerful crawlers that browse and analyze websites creating a database with the information collected.
What elements make up the robots.txt?? When generating the robots.txt file, you must take into account specific commands and rules.
Commands User agent: This is the command used to specify the robots/spiders of the search engines that we allow to crawl our website.
The syntax of this command is: User-agent: (name of the robot)
(In each rule, there must be at least one entry Disallow or Allow)
Disallow: Indicates a directory or a page of the root domain that you do not want the user-agent to crawl.
Allow: Specifies the directories or pages in the root domain that the user-agent specified in the group should crawl. It is used to override the Disallow directive and allow a specific subdirectory or page of a blocked directory to be crawled.
One option is to put an asterisk, this means that you allow all search engines to crawl the site.
User-agent: (*)
Disallow
The following command is to tell search engines not to crawl, access or index a specific part of the website, such as the wp-admin folder.
Disallow: /wp-admin/
Allow
With the following command you indicate the opposite, you mark to the search engines what they can crawl. In this example it only allows a file from a specific folder.
Allow: /wp-admin/admin-ajax.php
Other elements to take into account.
When adding elements for blocking, you must place the slash (/) at the beginning and end. The code can be simplified. *. The asterisk is used to lock a sequence of characters. $. The dollar sign is used when you want to block URL's with a specific ending.
Examples of commands used in robots.txt.
Exclude all robots from the server:
User-agent: *
Disallow: /
Allow all robots to have access to scan everything:
User-agent: *
Disallow:
Exclude only one bot, in this case Badbot:
User-agent: BadBot
Disallow: /
Allow only one bot, in this case Google:
User-agent: Google
Disallow:
User-agent: *
Disallow: /
Exclude a directory for all bots:
User-agent: *
Disallow: /nombre-directorio/
Exclude a specific page:
User-agent: *
Disallow: /url-pagina.html
Block images from the web:
User-agent: Googlebot-Image
Disallow: /
Lock an image for one bot only:
User-agent: Googlebot-Image
Disallow: /imagen/bloqueada.jpeg
Exclude a specific file type:
User-agent: Googlebot
Dissallow: /*.jpeg$
Exclude URL's with a specific ending:
User-agent: *
Disallow: //pdf$
These are examples of use, use the one that suits your needs or create your own.
Once you have created the robots.txt file, upload it via FTP into the /tudomain/data/web/ directory.