Nov 03 2009
A quick recap on robots.txt
Recently, Matt Cutts, who works for the Search Quality group in Google, showcased a video which demonstrates how a robots file should effectively be used. Because of this, I have decided to write a small introduction to the robots file itself, including a quick look at some of the more advanced features beyond simply disallowing pages.
As mentioned above, the robots.txt file, for those that don’t know, has the purpose of blocking search engine spiders from the sections of your website you don’t wish to be crawled.
Simply by adding the below, this will allow Google to index all elements of your website:
User-agent: *
Disallow:
To disallow sections of your website, use the following, where the folder structure I’ve included is replaced with your own websites folder:
User-agent: *
Disallow: /test/
The follow up from this is primarily to allow the test folder but, to also add some exceptions to this rule by including single files from the folder:
User-agent: *
Disallow: /test/
Allow: /test/allow.html
The next step is to look into the more advanced use of wildcards (*). These allow you to exclude and include pages on a much broader scale; like you would use a wildcard in any other normal search.
For example, to allow all PHP pages to be indexed whilst excluding ASP pages, you would do the following:
User-agent:*
Disallow: /*.aspx$
Disallow: /*.asp$
Allow: /*.php$
Hopefully, this will give you a better insight into how powerful the robots file can be when you are fighting with the search engines as to which website content you want to be included or excluded.
Chris Hutchison
SEO Programmer
This SEO news has been brought to you by Just Search; Experts in internet marketing and PPC
from Just Search | Search Engine Optimisation & Internet Marketing Journal







