How to create a robots. How to edit robots txt file

31.07.2022

Setting up robots.txt for Yandex and Google

For Yandex be sure to add the host directive so that duplicate pages do not appear. This word is understood only by a bot from Yandex, so write instructions for it separately.

For Google there are no extras. The only thing you need to know is how to deal with it. In the User-agent section, you need to write:

Googlebot;
Googlebot-Image - if you limit the indexing of images;
Googlebot-Mobile - for mobile version site.

How to check the functionality of the robots.txt file

This can be done in the "Webmaster Tools" section of Google search engine or on the Yandex.Webmaster website in the Check robots.txt section.

If there are errors, correct them and check again. Woo good result, then don't forget to copy the correct code to robots.txt and upload it to the site.

Now you have an idea how to create robots.txt for all search engines. For beginners, I recommend using finished file by substituting the name of your site.

Robots.txt is a text file that contains site indexing parameters for the search engine robots.

Recommendations on the content of the file

Yandex supports the following directives:

Directive	What does it do
user-agent*
Disallow
Sitemap
Clean param
allow
Crawl-delay	We recommend using the crawl speed setting

Directive	What does it do
user-agent*	Indicates the robot to which the rules listed in robots.txt apply.
Disallow	Prohibits indexing site sections or individual pages.
Sitemap	Specifies the path to the Sitemap file that is posted on the site.
Clean param	Indicates to the robot that the page URL contains parameters (like UTM tags) that should be ignored when indexing it.
allow	Allows indexing site sections or individual pages.
Crawl-delay	Specifies the minimum interval (in seconds) for the search robot to wait after loading one page, before starting to load another. We recommend using the crawl speed setting in Yandex.Webmaster instead of the directive.

* Mandatory directive.

You"ll most often need the Disallow, Sitemap, and Clean-param directives. For example:

User-agent: * #specify the robots that the directives are set for Disallow: /bin/ # disables links from the Shopping Cart. Disallow: /search/ # disables page links of the search embedded on the site Disallow: /admin/ # disables links from the admin panel Sitemap: http://example.com/sitemap # specify for the robot the sitemap file of the site Clean-param: ref /some_dir/get_book.pl

Robots from other search engines and services may interpret the directives in a different way.

note. The robot takes into account the case of substrings (file name or path, robot name) and ignores the case in the names of directives.

Using Cyrillic characters

The use of the Cyrillic alphabet is not allowed in the robots.txt file and server HTTP headers.

For domain names, use Punycode . For page addresses, use the same encoding as that of the current site structure.

Robots.txt file — text file in .txt format, which restricts search robots access to content on the http server. How definition, Robots.txt- This robot exception standard, which was adopted by the W3C on January 30, 1994, and is voluntarily used by most search engines. The robots.txt file consists of a set of instructions for search robots to disable indexing specific files, pages or directories on the site. Consider the description of robots.txt for the case when the site does not restrict access to the site by robots.

A simple robots.txt example:

User-agent: * Allow: /

Here, robots completely allows the indexing of the entire site.

The robots.txt file must be uploaded to the root directory of your website so that it is available at:

Your_site.ru/robots.txt

Placing a robots.txt file at the root of a site usually requires FTP access. However, some management systems (CMS) allow you to create robots.txt directly from the site's control panel or through the built-in FTP manager.

If the file is available, then you will see the contents of robots.txt in the browser.

What is robots.txt for?

Roots.txt for the site is an important aspect. Why robots.txt is needed? For example, in SEO robots.txt is needed in order to exclude from indexing pages that do not contain useful content and much more. How, what, why and why it is excluded has already been described in the article about, we will not dwell on this here. Do I need a robots.txt file all sites? Yes and no. If the use of robots.txt implies the exclusion of pages from the search, then for small sites with a simple structure and static pages, such exclusions may be unnecessary. However, even for a small site, some robots.txt directives, such as the Host or Sitemap directive, but more on that below.

How to create robots.txt

Since robots.txt is a text file, and to create a robots.txt file, you can use any text editor, for example notepad. Once you open a new Text Document, you have already started creating robots.txt, it remains only to compose its content, depending on your requirements, and save it as text file called robots in txt format. It's simple, and creating a robots.txt file should not cause problems even for beginners. Below I will show you how to write robots.txt and what to write in robots.

Create robots.txt online

Option for the lazy create robots online and download robots.txt file already ready. Creating robots txt online offers many services, the choice is yours. The main thing is to clearly understand what will be prohibited and what is allowed, otherwise creating a robots.txt file online can turn into a tragedy which can then be difficult to correct. Especially if something that should have been closed gets into the search. Be careful - check your robots file before uploading it to the site. Yet custom robots.txt file more accurately reflects the structure of restrictions than the one that was automatically generated and downloaded from another site. Read on to know what to pay special attention to when editing robots.txt.

Editing robots.txt

Once you have managed to create a robots.txt file online or by hand, you can edit robots.txt. You can change its content as you like, the main thing is to follow some rules and syntax of robots.txt. In the process of working on the site, the robots file may change, and if you edit robots.txt, then do not forget to upload an updated, up-to-date version of the file with all the changes on the site. Next, consider the rules for setting up a file in order to know how to change robots.txt file and "do not chop wood."

Proper setting of robots.txt

Correct setting robots.txt allows you to avoid getting private information in the search results of major search engines. However, do not forget that robots.txt commands are nothing more than a guide to action, not a defense. Reliable search engine robots like Yandex or Google follow robots.txt instructions, but other robots can easily ignore them. Proper understanding and use of robots.txt is the key to getting results.

To understand how to make correct robots txt, first you need to deal with general rules, syntax and directives of the robots.txt file.

Correct robots.txt starts with User-agent directive, which indicates to which robot the specific directives are addressed.

User-agent examples in robots.txt:

# Specifies directives for all robots simultaneously User-agent: * # Specifies directives for all Yandex robots User-agent: Yandex # Specifies directives for only the main Yandex indexing robot User-agent: YandexBot # Specifies directives for all Google robots User-agent: Googlebot

Please note that such setting up the robots.txt file tells the robot to use only directives that match the user-agent with its name.

Robots.txt example with multiple User-agent entries:

# Will be used by all Yandex robots User-agent: Yandex Disallow: /*utm_ # Will be used by all Google robots User-agent: Googlebot Disallow: /*utm_ # Will be used by all robots except Yandex and Google robots User-agent: * Allow: / *utm_

User agent directive creates only an indication to a specific robot, and immediately after the User-agent directive there should be a command or commands with a direct indication of the condition for the selected robot. The example above uses the disable directive "Disallow", which has the value "/*utm_". Thus, we close everything. Correct setting of robots.txt prevents the presence of empty line breaks between "User-agent", "Disallow" directives and directives following "Disallow" within the current "User-agent".

An example of an incorrect line feed in robots.txt:

An example of a correct line feed in robots.txt:

User-agent: Yandex Disallow: /*utm_ Allow: /*id= User-agent: * Disallow: /*utm_ Allow: /*id=

As you can see from the example, instructions in robots.txt come in blocks, each of which contains instructions either for a specific robot or for all robots "*".

It is also important to keep the correct order and sorting of the commands in robots.txt when using directives such as "Disallow" and "Allow" together. The "Allow" directive is the permissive directive and is the opposite of the robots.txt "Disallow" command, which is a disallow directive.

Example sharing directives in robots.txt:

User-agent: * Allow: /blog/page Disallow: /blog

This example prohibits all robots from indexing all pages starting with "/blog", but allows indexing pages starting with "/blog/page".

The previous example of robots.txt in the correct sort:

User-agent: * Disallow: /blog Allow: /blog/page

First we disable the entire section, then we allow some of its parts.

Another correct robots.txt example with joint directives:

User-agent: * Allow: / Disallow: /blog Allow: /blog/page

Pay attention to the correct sequence of directives in this robots.txt.

The "Allow" and "Disallow" directives can also be specified without parameters, in which case the value will be interpreted inversely to the "/" parameter.

An example of a "Disallow/Allow" directive without parameters:

User-agent: * Disallow: # is equivalent to Allow: / Disallow: /blog Allow: /blog/page

How to compose the correct robots.txt and how to use the interpretation of directives is your choice. Both options will be correct. The main thing is not to get confused.

For the correct compilation of robots.txt, it is necessary to accurately specify the priorities in the parameters of the directives and what will be prohibited for download by robots. We will look at the use of the "Disallow" and "Allow" directives more fully below, but now let's look at the robots.txt syntax. Knowing the syntax of robots.txt will get you closer to create the perfect robots txt with your own hands.

Robots.txt Syntax

Search engine robots voluntarily follow robots.txt commands- the robots exclusion standard, however, not all search engines interpret the robots.txt syntax in the same way. The robots.txt file has a strictly defined syntax, but at the same time write robots txt is not difficult as its structure is very simple and easy to understand.

Here is a specific list of simple rules, following which you will exclude common robots.txt errors:

Each directive starts on a new line;
Do not include more than one directive on a single line;
Don't put a space at the beginning of a line;
The directive parameter must be on one line;
You don't need to enclose directive parameters in quotation marks;
Directive parameters do not require closing semicolons;
The command in robots.txt is specified in the format - [directive_name]:[optional space][value][optional space];
Comments are allowed in robots.txt after the pound sign #;
An empty newline can be interpreted as the end of a User-agent directive;
"Disallow:" directive (with empty value) is equivalent to "Allow: /" - allow everything;
The "Allow", "Disallow" directives specify no more than one parameter;
The name of the robots.txt file does not allow the presence of capital letters, the erroneous spelling of the file name is Robots.txt or ROBOTS.TXT;
Writing the names of directives and parameters in capital letters is considered bad manners, and if, according to the standard, robots.txt is case-insensitive, file and directory names are often case-sensitive;
If the directive parameter is a directory, then the directory name is always preceded by a slash "/", for example: Disallow: /category
Too large robots.txt (more than 32 KB) are considered fully permissive, equivalent to "Disallow: ";
Robots.txt that is inaccessible for some reason may be treated as completely permissive;
If robots.txt is empty, then it will be treated as completely permissive;
As a result of listing multiple "User-agent" directives without an empty newline, all subsequent "User-agent" directives except the first one can be ignored;
The use of any symbols of national alphabets in robots.txt is not allowed.

Since different search engines may interpret the robots.txt syntax differently, some points can be omitted. So, for example, if you specify several "User-agent" directives without an empty line break, all "User-agent" directives will be accepted correctly by Yandex, since Yandex highlights entries by the presence in the "User-agent" line.

The robots should strictly indicate only what is needed, and nothing more. Don't think how to write everything in robots txt what is possible and how to fill it. Perfect robots txt is the one with fewer lines but more meaning. "Brevity is the soul of wit". This expression is very useful here.

How to check robots.txt

In order to check robots.txt for the correct syntax and structure of the file, you can use one of the online services. For example, Yandex and Google offer their own services for webmasters, which include robots.txt parsing:

Checking the robots.txt file in Yandex.Webmaster: http://webmaster.yandex.ru/robots.xml

In order to check robots.txt online necessary upload robots.txt to the site in the root directory. Otherwise, the service may report that failed to load robots.txt. It is recommended to first check robots.txt for availability at the address where the file is located, for example: your_site.ru/robots.txt.

In addition to verification services from Yandex and Google, there are many others online. robots.txt validators.

Robots.txt vs Yandex and Google

There is a subjective opinion that Yandex perceives the indication of a separate block of directives "User-agent: Yandex" in robots.txt more positively than the general block of directives with "User-agent: *". A similar situation with robots.txt and Google. Specifying separate directives for Yandex and Google allows you to manage site indexing through robots.txt. Perhaps they are flattered by a personal appeal, especially since for most sites the content of the robots.txt blocks of Yandex, Google and other search engines will be the same. With rare exceptions, all "User-agent" blocks will have default for robots.txt set of directives. Also, using different "User-agent" you can install prohibition of indexing in robots.txt for Yandex, but, for example, not for Google.

Separately, it is worth noting that Yandex takes into account such an important directive as "Host", and the correct robots.txt for Yandex should include this directive to indicate the main mirror of the site. The "Host" directive will be discussed in more detail below.

Disable indexing: robots.txt Disallow

Disallow - prohibiting directive, which is most often used in the robots.txt file. Disallow prohibits indexing of the site or part of it, depending on the path specified in the parameter of the Disallow directive.

An example of how to disable site indexing in robots.txt:

User-agent: * Disallow: /

This example closes the entire site from indexing for all robots.

In the parameter of the Disallow directive, you can use special characters* and $:

* - any number of any characters, for example, the /page* parameter satisfies /page, /page1, /page-be-cool, /page/kak-skazat, etc. However, there is no need to specify * at the end of each parameter, since, for example, the following directives are interpreted in the same way:

User-agent: Yandex Disallow: /page User-agent: Yandex Disallow: /page*

$ - indicates the exact match of the exception to the parameter value:

User agent: Googlebot Disallow: /page$

In this case, the Disallow directive will disallow /page, but will not disallow /page1, /page-be-cool, or /page/kak-skazat from being indexed.

If close robots.txt site indexing, search engines may respond to such a move with the error “Blocked in robots.txt file” or “url restricted by robots.txt” (url prohibited by robots.txt file). If you need disable page indexing, you can use not only robots txt, but also similar html tags:

- do not index the content of the page;
- do not follow links on the page;
- it is forbidden to index content and follow links on the page;
- similar to content="none".

Allow indexing: robots.txt Allow

Allow - allowing directive and the opposite of the Disallow directive. This directive has a syntax similar to Disallow.

An example of how to disable site indexing in robots.txt except for some pages:

User-agent: * Disallow: /Allow: /page

It is forbidden to index the entire site, except for pages starting with /page.

Disallow and Allow with an empty parameter value

An empty Disallow directive:

User-agent: * Disallow:

Do not prohibit anything or allow indexing of the entire site and is equivalent to:

User-agent: * Allow: /

Empty directive Allow:

User-agent: * Allow:

Allow nothing or complete prohibition of site indexing is equivalent to:

User-agent: * Disallow: /

Main site mirror: robots.txt Host

The Host directive is used to indicate to the Yandex robot the main mirror of your site. Of all the popular search engines, the directive Host is recognized only by Yandex robots. The Host directive is useful if your site is available on multiple sites, for example:

mysite.ru mysite.com

Or to prioritize between:

Mysite.ru www.mysite.ru

You can tell the Yandex robot which mirror is the main one. The Host directive is specified in the "User-agent: Yandex" directive block and as a parameter, the preferred site address without "http://" is indicated.

An example of robots.txt indicating the main mirror:

User-agent: Yandex Disallow: /page Host: mysite.ru

The primary mirror is Domain name mysite.ru without www. Thus, this type of address will be indicated in the search results.

User-agent: Yandex Disallow: /page Host: www.mysite.ru

The domain name www.mysite.ru is indicated as the main mirror.

Host directive in robots.txt file can be used only once, if the Host directive is specified more than once, only the first one will be taken into account, other Host directives will be ignored.

If you want to specify the main mirror for google robot, use the service Google Tools for webmasters.

Sitemap: robots.txt sitemap

Using the Sitemap directive, you can specify the location on the site in robots.txt.

Robots.txt example with sitemap address:

User-agent: * Disallow: /page Sitemap: http://www.mysite.ru/sitemap.xml

Specifying the address of the site map through sitemap directive in robots.txt allows the search robot to find out about the presence of a sitemap and start indexing it.

Clean-param Directive

The Clean-param directive allows you to exclude pages with dynamic parameters from indexing. Similar pages can serve the same content with different page URLs. Simply put, as if the page is available at different addresses. Our task is to remove all unnecessary dynamic addresses, which can be a million. To do this, we exclude all dynamic parameters, using the Clean-param directive in robots.txt.

Syntax of the Clean-param directive:

Clean-param: parm1[&parm2&parm3&parm4&..&parmn] [Path]

Consider the example of a page with the following URL:

www.mysite.ru/page.html?&parm1=1&parm2=2&parm3=3

Example robots.txt Clean-param:

Clean-param: parm1&parm2&parm3 /page.html # page.html only

Clean-param: parm1&parm2&parm3 / # for all

Crawl-delay directive

This instruction allows you to reduce the load on the server if robots visit your site too often. This directive is relevant mainly for sites with a large volume of pages.

Example robots.txt Crawl-delay:

User-agent: Yandex Disallow: /page Crawl-delay: 3

In this case, we "ask" Yandex robots to download the pages of our site no more than once every three seconds. Some search engines support decimal format as a parameter Crawl-delay robots.txt directives.

The Host directive is a command or rule that tells the search engine which (with or without www) should be considered the host. The Host directive is located in the file and is intended exclusively for Yandex.

Often there is a need for the search engine not to index some pages of the site or its mirrors. For example, a resource is located on the same server, but there is an identical domain name on the Internet, which is indexed and displayed in search results.

Yandex search robots crawl website pages and add the collected information to the database according to their own schedule. During the indexing process, they decide on their own which page needs to be processed. For example, robots bypass various forums, message boards, directories and other resources where indexing is pointless. They can also define the main site and mirrors. The former are subject to indexation, the latter are not. Mistakes often occur in the process. You can influence this by using the Host directive in the Robots.txt file.

Why is the Robots.txt file needed?

Robots is a plain text file. It can be created through notepad, but it is recommended to work with it (open and edit information) in text editor Notepad++ . Necessity given file When optimizing web resources, it is determined by several factors:

If the Robots.txt file is missing, the site will be constantly overloaded due to the work of search engines.
There is a risk that they will be indexed extra pages or mirror sites.

Indexing will be much slower, and if incorrectly installed settings it may completely disappear from the results of Google and Yandex search results.

How to format the Host directive in the Robots.txt file

The Robots file includes a Host directive, which tells the search engine where the main site is and where its mirrors are.

The directive has following form spellings: Host: [optional space] [value] [optional space]. The rules for writing a directive require the following points to be observed:

The presence of the HTTPS protocol in the Host directive to support encryption. It must be used if access to the mirror is carried out only through a secure channel.
The domain name, which is not an IP address, and the port number of the web resource.

A properly composed directive will allow the webmaster to indicate to search engines where the main mirror is. The rest will be considered minor and therefore will not be indexed. As a rule, mirrors can be distinguished by the presence or absence of the abbreviation www. If the user does not specify the main mirror of the web resource via Host, the Yandex search engine will send a corresponding notification to Webmaster. Also, a notification will be sent if an inconsistent Host directive is specified in the Robots file.

You can determine where the main mirror of the site is through a search engine. It is necessary to drive the address of the resource into the search bar and look at the results of the issue: the site where www is in front of the domain in the address bar is the main domain.

If the resource is not displayed on the issue page, the user can independently designate it as the main mirror by going to the appropriate section in Yandex.Webmaster. If the webmaster needs the domain name of the site not to contain www, then it should not be specified in the Host.

Many webmasters use Cyrillic domains as additional mirrors for their sites. However, Cyrillic is not supported in the Host directive. To do this, you need to duplicate the words in Latin, with the condition that they can be easily recognized by copying the site address from the address bar.

Host file Robots

The main purpose of this directive is to solve problems with duplicate pages. It is necessary to use Host if the work of the web resource is focused on the Russian-speaking audience and, accordingly, the sorting of the site must take place in the Yandex system.

Not all search engines support the Host directive. The function is available only in Yandex. At the same time, even here there are no guarantees that the domain will be assigned as the main mirror, but according to Yandex itself, the priority always remains with the name specified in the host.

In order for search engines to correctly read information when processing the robots.txt file, you must add the Host directive to the appropriate group, starting after the words User-Agent. However, robots will be able to use Host regardless of whether the directive is written by the rules or not, since it is cross-sectional.

Greetings friends and subscribers of my blog. Today Robots.txt is on the agenda, everything you wanted to know about it, in short, without unnecessary water.

What is Robots.txt and why is it needed

Robots.txt is needed in order to indicate to the search engine (Yandex, Google, etc.) how correctly (from your point of view) the site should be indexed. Which pages, sections, products, articles need to be indexed, and which ones, on the contrary, are not needed.

Robots.txt is a plain text file (with .txt resolution) that was adopted by the W3C on January 30, 1994, and is used by most search engines, and it usually looks like this:

How does it affect the promotion of your site?

For successful site promotion, it is necessary that the index (base) of Yandex and Google contains only the necessary pages of the site. Under the right pages I understand the following:

Home;
pages of sections, categories;
Goods;
Articles;
Pages “About the company”, “Contacts”, etc.

By NOT the right pages, I mean the following:

Duplicate pages;
Print pages;
Search results pages;
System pages, registration, login, logout pages;
Subscription pages (feed);

For example, if the search engine index contains duplicates of the main promoted pages, this will cause problems with the uniqueness of the content within the site, and will also negatively affect positions.

Where is he located?

The file is usually at the root of the public_html folder on your hosting, here:

What You Should Know About the Robots.txt File

The robots.txt instructions are advisory in nature. This means that the settings are guidelines, not direct commands. But as a rule, both Yandex and Google follow the instructions without any problems;
The file can only be hosted on the server;
It must be in the root of the site;
Syntax violation leads to incorrectness of the file, which can negatively affect indexing;
Be sure to check the correct syntax in the Yandex Webmaster panel!

How to close a page, section, file from indexing?

For example, I want to close the page from indexing in Yandex: http://site/page-for-robots/

To do this, I need to use the "Disallow" directive and the URL of the page (section, file). It looks like this:

User agent: Yandex
Disallow: /page-for-robots/
host: site

If I want close category
User agent: Yandex
Disallow: /category/case/
host: site

If I want to close the entire site from indexing, except for the section http://site/category/case/, then you will need to do this:

User agent: Yandex
disallow: /
Allow: /category/case/
host: site

The "Allow" directive, on the contrary, says which page, section, file, should be indexed.

I think the logic of construction has become clear to you. Please note that the rules will only apply to Yandex, since User-agent: Yandex is specified. Google, on the other hand, will ignore this construct and index the entire site.

If you want to write universal rules for all search engines, use: User-agent: *. Example:

User-agent: *
disallow: /
Allow: /category/case/
host: site

user-agent is the name of the robot for which the instruction is intended. The default value is * (asterisk) - this means that the instruction is intended for absolutely all search robots.
The most common robot names are:

Yandex - all Yandex search engine robots
YandexImages - image indexer
Googlebot - Google robot
BingBot - Bing Robot
YaDirectBot - system robot contextual advertising Yandex.

Links to detailed overview all Yandex and Google directives.

What must be in your Robots.txt file

The Host Directive has been configured. It must be spelled out main mirror your site. Main mirrors: site.ru or www.site.ru. If your site is with http s, then this must also be specified. The main mirror in host and in Yandex.Webmaster must match.
Sections and pages of the site that do not carry a payload, as well as pages with duplicate content, print pages, search results and system pages, should be closed from indexing (with the Disallow: directive).
Provide a link to sitemap.xml (a map of your site in xml format).
Sitemap: http://site.ru/sitemap.xml

Indication of the main mirror

First you need to find out which mirror you have by default. To do this, enter the URL of your site in Yandex, hover over the URL in the search results, and at the bottom left in the browser window it will be indicated whether the domain is www or not. In this case, without WWW.

If the domain is specified with https, then both Robots and Yandex.Webmaster must specify https! It looks like this: