How to use robots.txt

If you are a webmaster or a site owner sometimes its imperative that you know how to use robots.txt to restrict access to search engine crawlers to access certain files/pages on your server.

Using robots.txt for SEO allows you to exclude the pages that you don’t want search engines to crawl. You may find numerous example on how to use robots.txt file. This is a short tutorial on how to use it to block search engine crawlers to not index the pages that you don’t want to. BTW robots.txt is The Robots Exclusion Protocol.

robots-txt

A robots.txt file restricts access to your site by search engine robots that crawl the web.

Robots.txt file is what a search engine crawler looks for first. It also comes handy when you think of how to submit your Blogger Blog sitemap to Google Webmaster Tools.

Remember though that in order to use robots.txt, you need to have the access to your root domain.

Here are some of the search engine bots:

#Microsoft Search Engine Robot
User-Agent: msnbot

#Yandex Search Engine Robot
User-agent: Yandex

#Yahoo! Search Engine Robot
User-Agent: Slurp

#Google Search Engine Robot
User-agent: Googlebot

And here’s how you do it:

#Allow indexing of everything

User-agent: *
Disallow:

The “User-agent: * means this section applies to all robots. The “Disallow: /” tells the robot that it should not visit any pages on the site. This come especially handy when you have uploaded a new site and testing it online.

#Disallow indexing of everything

User-agent: *
Disallow: /

#Disallow a single robot

User-agent: slurp
Disallow:
User-agent: *
Disallow: /
#This will allow only yahoo bot to crawl your site

#Disallow indexing of a specific folder
User-agent: *
Disallow: /folder/

#Disallow a specific crawler bot from indexing of a folder,  except for allowing the indexing of one file in that folder

User-agent: Googlebot
Disallow: /folder1/
Allow: /folder1/myfile.html

Robots.txt Wildcard Matching

Google and Microsoft’s Bing allow the use of wildcards in robots.txt files. To block access to all URLs that include a question mark (?), you could use the following entry:
User-agent: *
Disallow: /*?

You can use the $ character to specify matching the end of the URL. For instance, to block an URLs that end with .asp, you could use the following entry:
User-agent: Googlebot
Disallow: /*.asp$

These wild card matching is especially handy when you use UTM  for creating campaigns for your webpages. This will prevent any canonical issues by not letting bots index same page with 2 different URLs.

*Update after comment:

To remove a specific image from getting crawled and indexed, add the following:
User-agent: Googlebot
Disallow: /images/dontshowme.jpg
#This will disallow Google from crawling and indexing only dontshowme.jpg filetype images

If you dont want a specific file type getting indexed in Google (for example, .gif), use the following:
User-agent: googlebot
Disallow: /*.gif$
#This will disallow Google from crawling only gif filetype images

To remove all images on your site from Google Images:
User-agent: Googlebot-Image
Disallow: /
#This will disallow Google from crawling and indexing only dontshowme.jpg filetype images

 

Some important points to consider while using robots.txt:

  • robots can ignore your robots.txt. All respectable crawlers first read this text but malware robots that scan the web for security vulnerabilities used by spammers will pay no attention.
  • Also the robots.txt file is a publicly available file. Anyone can see what sections of your server you don’t want robots to use.
  • You only need a robots.txt file if you believe that your site includes content that you don’t want search engines to crawl and index. If you want search engines to crawl and index all of your site content then you need not have a robots.txt file.
(Visited 358 times, 1 visits today)

Leave a Reply

Your email address will not be published. Required fields are marked *