Robots.txt – A Comprehensive Guide

What is Robots.txt?

Robot.txt protocol is also known as The Robot Exclusion Standard, which is a pattern to prevent the admittance of web robots and web crawlers to all other websites that are viewed by public. It is a text file not html that searches the pages that you want to view on your site with the help of search boots. Well, robots.txt is not compulsory for search engines, as they obey what is restricted to them properly. However, the search engine robots are automatic, and before they enter any page of a site, they ensures about the presence of a robots.txt file that would prevent it from entering some pages. If you want that your site should be indexed by search engines, then you do not require Robot.txt file. Well, it’s very important to locate robots.txt in an appropriate place.

What is Robots.txt?

Location of robots.txt

As far as location is concerned, you must locate robots.txt in the main directory else, search engines cannot detect it. The search engines search for my domain (dot) com/robots.txt in the main directory instead of searching for robots.txt file in an entire site. If it is unable to find in the main directory, they presume that there is no robots.txt file on this site and start indexing the whole site. Hence always try to locate robots.txt file in the right place. Well, it has been a long time that the conception and structure of this file have been designed. However, we will briefly discuss on it.

Don’t Miss :  Quick Tip: How To Know If Google Is Indexing My Articles Or Posts

Structure of a robots.txt file

The structure contains a list of infinite disallowed files, user agents and directories, but the structure is very simple. Let us have a look at the syntax of robots.txt file:

User-agent:

Disallow:

The search engine crawlers are termed as “User-agent” and the list of directories and files that should not be included in indexing are termed “Disallow.” However, if you want to write any comment line, then start your line with # sign like:

# All user agents are disallowed to view the/temp directory

User-agent *

Disallow: /temp/

How to create a robots.txt file?

There are certain points that you should keep in mind while creating a robots.txt file. First, enlist the directories and files that you want to block from being indexed in your server. Second, decide whether you want to put some extra information’s for a specific search engine besides the general directives for crawling. Third, create a robots.txt file and commands by using a text editor to block your content. Fourth, to your sitemap file, you can add a reference, but this is optional. Fifth, conform your robots.txt file by checking the errors and finally, in the main directory, upload the robots.txt file. But there are certain rules that should be followed while creating a robots.txt file. Let us have a look at some examples that can make us clear.

Examples of robots.txt format

Allow indexing of everything

User-agent: *

Disallow:

Disallow indexing of everything

User-agent: *

Disallow: /

Disallowing indexing of a particular folder

User-agent: *

Disallow: /folder/

Except allowing indexing for one file in a folder, disallow Googlebot from crawling of a folder

User-agent: Googlebot

Disallow: /folder1/

Allow: /folder1/myfile.htm

Important rules for creating a robots.txt file

  • Parameters like “follow, noindex” should be written with some meta robots to control the indexation or crawling
  • For each URL, you can write only 1 Disallow line
  • Different robots.txt files are used by each sub domain that comes under a root domain
  • Talking about pattern expression, Bing and Google accept two particular expressions, (* and $)
  • Use robots.txt instead of Robots.TXT, because robots.txt is case sensitive
  • To separate query parameters, never use spaces as it is not accepted by robots.txt

Well, there certain tools that correct the mistakes in a robots.txt file.

Test a robots.txt file

With the help of these tools, you can know whether robots.txt file is blocking your file from your site or not. However, the search engine robot finds the robots.txt file and stops crawling of your sites.

Well, it is good that search engines visit our site regularly and index our content but sometimes indexing of content is not according to what we want. There are some sensitive data that should not be viewed by the whole world. So with the help of robots.txt file can prevent the search engines to index your site.

Author: Kelly is a writer/blogger. She loves writing, travelling and reading books. She contributes in Bret Clark Microsoft.

About Amit Shaw

Amit Shaw is a Founder and CEO of iTechCode.He is a 21 Year Ordinary Simple guy from West Bengal,India. He writes about Blogging, Technology, Gadgets, Programming etc.Connect with him on Facebook, Add him on Google+ and Follow him on Twitter.

Comments

  1. Hi Amit,

    Great Post!
    It is Quite Helpful for SEO and Webmaster Beginners!
    Thanks For Sharing :)

    Mosam

  2. Robots.txt is the major SEO component that no one should miss. But if one is not aware to its usage then its a thing to avoid. IF its configured incorrectly , you may stop search engines to access your content. Very well written Tutorial Amit Bro.

  3. Great Article.
    It is great that You have Posted this one at Right Time for me as I have Just got Started with Blogging and I am Confused about Robots.txt, can you mention any ideal Robots.txt file to prefer?

    • Thanks Ravi. Glad that i posted this article on Right Time :)
      I hope this article will help you to solve your issue.
      See robots.txt depends on your blog and you, which link do you want to index or Deindex from SE.
      Tough here is the one which i mainly prefer :

      Sitemap: http://www.Yourdomainname.com/sitemap.xml

      User-agent: *
      Disallow: /wp-content/
      Allow: /wp-content/uploads/
      Disallow: /wp-content/downloads/
      Disallow: /downloads/
      Disallow: /feed/
      Disallow: /recommends/
      Disallow: /go/
      Disallow: /category/
      Disallow: /tag/
      Disallow: /tag/*
      Disallow: /archives/
      Disallow: /author/
      Disallow: /search?
      Disallow: /cgi-bin/
      Disallow: /wp-admin/
      Disallow: /wp-includes/
      Disallow: /recommended/
      Disallow: /comments/feed/
      Disallow: /index.php
      Disallow: /xmlrpc.php
      Disallow: *?wptheme
      Disallow: ?comments=*
      Disallow: /?p=*
      Disallow: /*.pdf$
      Disallow: /*.php$
      Disallow: /*.js$
      Disallow: /*.cgi$
      Disallow: /*.xhtml$
      Disallow: /*.php*
      Disallow: /*.inc$
      Disallow: /*.css$
      Disallow: /*.txt$
      Disallow: /*?*
      Disallow: */feed/
      Disallow: */trackback/
      Disallow: /cgi-bin/
      Disallow: /images/
      Disallow: /embed.js?pname=wordpress&pver=*
      Disallow: ?comments=*
      Disallow: /*?replytocom=*
      Disallow: /?p=*
      Disallow: /search?
      Disallow: /stats/
      Disallow: /general/
      Disallow: /date/
      Disallow: /trackback/
      Disallow: *?wptheme
      Disallow: /?attachment_id*
      Disallow: /search/?*
      Disallow: */trackback/*
      Disallow: /*.js$
      Disallow: /*.inc$
      Disallow: /*.css$
      Disallow: /*.cgi$
      Disallow: /*.wmv$
      Disallow: /*.cgi$
      Disallow: /*.xhtml$
      Disallow: /?cat*
      Disallow: /?m*
      Disallow: /*?utm_source*
      Disallow: /index.php/*
      Disallow: /wp-*
      Disallow: /custom-search/

      User-agent: Googlebot-Image
      Allow: /images/post/

      User-agent: Adsbot-Google
      Allow: /*

      User-agent: Mediapartners-Google
      Allow: /*

      Thanks.

  4. On must be careful when configuring the robots.txt as you may accidentally placed there folders or locations that you need bots to crawl. So mind the “Allow” and “Disallow” when configuring it.

  5. hey its awesome post.robot.txt plays an important role if you want to restrict SE from indexing unnecessary things of your blog.after reading this article one should pay attention to robot.txt specially for blogspot users.

Speak Your Mind

*