How to create robots.txt

If you ask SEO-experts to evaluate the importance of properly drawn up robots.txt, SEO will give it 5 out of 5.

How to make robots.txt properly

2016-02-15

Incorrectly created robots.txt, not taking into account all the details of the site, could harm its presence in search engines.

In 2011 one of the largest EU mobile operator discovered to search engines and users tons of SMS texts of its clients. This is how Google and others can add to their index base those pages you want to hide. Or on the contrary, the entire website can be hidden from searchers with just one incorrect movement.

If you are already familiar with the basics of creating a robots.txt file, you can skip to step 3, "Making robots.txt".

1. What is the robots.txt

Let's define what the file constitutes and why it is so important.

In Yandex help we'll find the following definition:

Robots.txt is a text file that contains the site indexing parameters for the search engine robots ...

Session (search engine robot) begins with a robots.txt file download. If the file does not exist, or in other words, a robot obtains the different than 200 HTTP code, the robot considers that access to documents (website pages) is not limited.

To put it shortly, robots.txt is a set of guidelines, which are uniquely subject to search engine spiders when indexing the web site. "Yes" commands allow to index website pages, "no" commands restrict indexing.

However, despite the importance of this file, the vast majority of sites in Internet do not have a properly compiled robots.txt file.

2. Directives of robots.txt file

First enable the directives:

<Directive> <colon> <space> <document to which the directive applies>

First, let's say what guidelines can be used in your robots.txt file.

User-agent - an indication of the robot, for which a list of directives is compiled below. It's a mandatory directive for robots.txt, which is indicated at the beginning of the file.

The main User-agent of Yandex search engine is Yande.
The main User-agent of Google search engine is Googlebot .
If the directive specifies a list of all possible User-agents, just put down *.

Disallow is the indexing banning directive. You can specify a directory, and the name of the document and the full path of the document.

When banning the indexing of the document, path is determined from the site root.

To disable indexing of documents of the second level and deeper, you can specify the full path of the document.

When banning the indexing, the whole parent directory will also be prohibited from indexing. You can disable indexing documents in the url that contain specific characters.

Allow - Directive permits indexing of documents. This is the default directive to all documents on the site, unless otherwise indicated.

You can permit indexing of documents in the url containing specific symbols. It is worth paying attention to the rules of application-Disallow Allow directives: "Allow and Disallow directive from the corresponding User-agent unit sorted by length prefix URL (from smallest to largest) and are applied consistently."

Sitemap is the directive to specify the path to the file xml-sitemaps. If the site has more than one xml document, it's possible to specify multiple paths.

User-agent: *
Sitemap: http://samplesite.com/sitemap1.xml
Sitemap: http://samplesite.com/sitemap2.xml

Special Characters

  •     *  means any sequence of characters. Added default at the end of each directive
  •     $ is used to delete a character "*" at the end of the directive
  •     #  is a describing comments sign. All that is indicated on the right of this sign will be ignored by robots.

Host is a directive specifying the main mirror site. It's taken into account by Yandex only. Google and others simply ignore it.

This directive provides "glueing" of the mirror www.site.com and site.com and other sites as well which have main host indicated in robots.txt.

If the mirror is only available via a secure protocol, the address with the https protocol must be used. In other cases, the protocol is not specified. To set up the primary mirror in the Google search engine, use the "Site Settings" in Google Search Console.

Crawl-delay is the minimum time (in seconds) between several pages downloading. This directive is used by search engine spiders and does not allow the website to be overloaded. For limiting this time in the Google search engine, use the "Site Settings" in Google Search Console.

Clean-param is used to remove the parameters from the site url-addresses. It's taken into account by Yandex robots only.

It can be used for removal of labels, filters, session identifiers, and other parameters.

For proper handling Google robots tags, use the "Settings URL» in Google Search Console.

2. Google Search Console (GSC) Manual

As mentioned earlier, some of the functions that can be specified for Yandex in robots.txt, must be specified in the Google Search Console the Google bots .

To specify the primary mirror in Google you have to confirm the two mirrors (www.samplesite.com and samplesite.com) in the GSC. View Site Settings (gear sign), then select the link "Site Configuration" and "main domain" box to select the main mirror and save the changes.

To limit the Google bots site scanning speed it is necessary to confirm the site in the GSC. View Site Settings (gear sign), there select the link "Site Configuration" in the "Scan Rate" box select "Limit maximum scanning speed of Google" and set the acceptable value, then save your changes.

To specify how Google will handle the settings in the url-address of the site you need to verify your site in the GSC. View "Scan" section - "the URL Parameters", click on "Add parameter", fill in the appropriate fields and save the changes.

3. Creating robots.txt file

Having reviewed the basic guidelines for a robots.txt file let's proceed to the compilation of the robots.txt file.

Firstly, we do not recommend blindly copying of template robots.txt content, that you might find on the Internet, because they simply can not take into account all the details of your site.

1. The first step is to add a robots.txt three User-Agent with a single blank line between each directive

    User-agent: Yandex
    User-agent: Googlebot
    User-agent: *

A * user-agent is added if directives will vary depending on search engine bots.

2. We recommend not to index different file extensions.
    Disallow: * .pdf
    Disallow: * .xls
    Disallow: * .doc
    Disallow: * .ppt
    Disallow: * .txt

Documents are closed from indexation because that they may seem more relevant than landing pages specially optimized for the query.

Even if your website has no documents in the mentioned formats, be sure not to remove these lines and leave them for the future.

3. Add each User-agent directives permitting indexing of JS and CSS files

    Allow: * / <folder containing css> / * css.
    Allow: * / <folder containing the js> / * js.

JS and CSS files are opened for indexing, because often they are located in the system folder, but they are required for proper indexing.

4. Add each User-agent directive permitting indexing of the most common image formats

    Allow: * / <folder containing media files> / * jpg.
    Allow: * / <folder containing media files> / * jpeg.
    Allow: * / <folder containing media files> / * png.
    Allow: * / <folder containing media files> / * gif.

Pictures are open to avoid accidental refusal from indexing.

5. To avoid indexing of utm tags and others, please specify that for Yandex bots.

    Clean-param: utm_source & utm_medium & utm_term & utm_content & utm_campaign & yclid & gclid & _openstat & from /

6. The same parameters should be closed from Google in GSC section "Parameters URL"/

Attention! If you close Google from indexing of tags with the help of the ban directive, it is likely that you will not be able to run ads on those pages in Google Adwords.

7. Banning indexing tags from a * user agent

    Disallow: * utm
    Disallow: * clid =
    Disallow: * openstat
    Disallow: * from

8. Next, limit indexing of all the system documents and duplicates. Here is the example of such pages:

    The administrator of the website
    Personal User Forums
    Baskets and design stages
    The filters and sorting in the catalogs

9. Host stands last in the file and is only comprehensible to Yandex.

    Host: samplesite.com

10. Directive last, after all the directives through the empty string specified directives xml-sitemaps, if used on the site

    Sitemap: http://samplesite.com/sitemap.xml

Template robots.txt file

Here is the template, which can be used as a basis for compiling robots.txt file.

User-agent: Yandex

Disallow: /*.pdf
Disallow: /*.xls
Disallow: /*.doc
Disallow: /*.ppt
Disallow: /*.txt
Allow: /*/<folder containing css>/*.css
Allow: /*/<folder containing js>/*.js
Allow: /*/<folder containing media files>/*.jpg
Allow: /*/<folder containing media files>/*.jpeg
Allow: /*/<folder containing media files>/*.png
Allow: /*/<folder containing media files>/*.gif
Clean-param: utm_source&utm_medium&utm_term&utm_content&utm_campaign&yclid&gclid&_openstat&from
Host: site.com

 

User-agent: Googlebot
Disallow: /*.pdf
Disallow: /*.xls
Disallow: /*.doc
Disallow: /*.ppt
Disallow: /*.txt
Allow: /*/<folder containing css>/*.css
Allow: /*/<folder containing js>/*.js
Allow: /*/<folder containing media files>/*.jpg
Allow: /*/<folder containing media files>/*.jpeg
Allow: /*/<folder containing media files>/*.png
Allow: /*/<folder containing media files>/*.gif
 

User-agent: *

Disallow: /*utm
Disallow: /*clid=
Disallow: /*openstat
Disallow: /*from
Disallow: /*.pdf
Disallow: /*.xls
Disallow: /*.doc
Disallow: /*.ppt
Disallow: /*.txt
Allow: /*/<folder containing css>/*.css
Allow: /*/<folder containing js>/*.js
Allow: /*/<folder containing media files>/*.jpg
Allow: /*/<folder containing media files>/*.jpeg
Allow: /*/<folder containing media files>/*.png
Allow: /*/<folder containing media files>/*.gif
Sitemap: http://site.com/sitemap.xml

 

Conclusion

In addition to the robots.txt file, there are many other ways to manage site indexing. But in our experience, a valid robots.txt helps to promote the website and protect it from many serious errors.

We hope our experience outlined in this article will help you understand the basic principles of how to create a robots.txt file for Google and Yandex.

About The Author

MW covers marketing technology for Websimka,
Have any questions? Write us via now@websimka.com