The robots.txt file on a website determines how, or what, Google and other search engines crawl in terms of pages. If you have a small site (like this one), it’s likely that you’ll want Google to crawl and index all the pages – but larger sites with a different preferred distribution of equity may require a more advanced robots.txt file.
If you don’t want certain pages crawled, or entire directories, you can quite simply stipulate this in the robots.txt. It’s not as complicated as it might sound – it’s just a text file that you need to place in the root folder of your site.
How Important is a Robots.txt File?
One of the first things which a search engine spider peeks at once it comes to a site is the robots.txt file. You can see the robots.txt file of any site simply by adding “/robots.txt” to the end of the domain URL:
My robot.txt isn’t especially exciting at the moment, because I don’t actually have much content with which to justify any disallowing. This is a better example of a robots.txt:
(Wikipedia’s robots.txt is actually very interesting, and will give you an idea of the kind of web crawlers which you might not want peeking at your site.)
A search engine crawler will, at a basic level, crawl your site from the homepage downwards. If you stipulate that there are certain things you don’t want indexed, it won’t index them. Simple as that.
If you don’t want any pages to be missed on your site, you don’t need to add a robots.txt file. Equally, if you already have a robots.txt file and want all pages to be read, your robots.txt file should simply read:
In this instance, the ‘*’ denotes any type of search engine spider.
Disallow denotes the pages and folders which should not be crawled by Google, and thus not indexed. You can prevent entire directories from being indexed in this way.
For example, if I had a directory named ‘kittens’ which I did not want to be indexed, my robots.txt file may look like this:
It is possible to allow certain items within said folder as long as this is specified. If there were one single URL which I wanted indexed within that directory, my robots.txt file may now look like this:
Not sure why you would want to disallow any directory involving kittens (because that’s linkbait if there ever was any), but that’s how you would do it.
You need to be certain that you put your robots.txt file in the top directory of your server or it won’t work. If you want to check if you have disallowed pages properly, you can use the ‘fetch as Google’ tool in Webmaster Tools.
Why Disallow Robots?
E-commerce sites in particular may have a number of directories or pages which they do not wish to be crawled. This is a common practice for filtered product pages, to avoid the risk of content duplication.
For example, if you had a product page of a dress and filtering it by colour creates an entirely different URL, you might want to disallow the variants of the said page. Thus any static content is crawled only once, and you’ll get the full distribution of equity to the core product page.
Retired pages can also be disallowed in the robot.txt, which means that retired pages will stop appearing in the SERPs (eventually – you obviously need to 301 redirect any retired pages to the homepage or a relevant page).
It’s worth noting that any parameters after a hash (#) within a URL string are not indexed anyway.
Different Rules for Different Robots
When you use ‘User-agent: *’ within your robots file, you’re giving instructions to EVERY crawler.
However, if you want to give directions to specific search engines, the file is a little different.
Different user agents apply to different search engines – not it’s not just for your standard search engines such as Google and Bing. User agents can apply to other web crawlers, like Xenu.
Website crawlers like Screaming Frog will crawl any page as directed by the robots file according to the rules you have stipulated for Googlebot as default, except if you specify that it shouldn’t like so:
User-agent: Screaming Frog SEO Spider
When stipulating a number of different directions for different robots, your file won’t look that much different.
Here’s an example of a file which would stipulate certain conditions for different robots.
User-agent: Screaming Frog SEO Spider
In this instance, pages and directories under ‘Googlebot’ may be crawled by the other engines, as it was not stipulated that they shouldn’t. The robots.txt file works under the assumption that if you haven’t flagged it for disallow, then it gets crawled.
In the above example, while the entire directory of /kittens/ is disallowed for Google only, Screaming Frog can still crawl the directory but not the specified page. This is the same for Bingbot in this example.
Google has different user agents which granulate the crawling behaviour of Googlebot. You can see them all here; they’re especially useful if you don’t want things like images crawled, or Google’s advertising bot to see your site.
Setting Crawl Delays
Again, with a site as small as seokitty.net, I really don’t need to put in a crawl delay. The reason is that the site is so small, it’s unlikely to crash when someone decides to give it a crawl or two.
However, big sites – in my experience, e-commerce sites – can end up with crashed servers if multiple crawls are completed. If you crawl a big site repeatedly with your spider tool, you’re likely to get your IP blocked (I’ve seen this happen plenty of times; not ideal when you’ve got a competitor research deadline).
The robots.txt will save your site from the risk of this.
The crawl delay parameter with apply to each user agent separately. It notes the amount of seconds which will be the delay between requests to the server. Of course this makes the crawling slower.
You can also set a crawl rate in Webmaster Tools, although it does only count for Google.
Go to your Webmaster Tools account, and click the picture of the cog in the top right.
Choose ‘site settings’, and you will be taken to this page.
You can then use the slider to slow the crawl rate. This action applies to Googlebot, and thus all the user-agents of googlebot.
Robots.txt Pattern Matching
Google, and as far as I’m aware Yahoo, use pattern matching in the robots.txt with a number of ‘wildcard’ characters.
The asterisk will allow you to block urls which start with a certain word (with * denoting the inclusion of anything after it), like so:
This will mean that these directories would not be indexed:
For the opposite function, the $ sign comes in. Now you can block any URL which ends in a certain way. So if you wanted to block everything that ends with ‘.php’, you can.
You can do this for specific filetypes, like gifs or pngs.
In order to block URLs which contain a question mark, you’ll need to use the asterisk again. You can usually do this to block the indexing of queries on your site.
You’ll see the hash (#) in robots.txt files too. When a bot reads the file, it will ignore anything on a single line after a hash. Thus, it’s used for writing notes.
Allow in Robots.txt
Just as you can disallow, you can allow too. This is often a good idea if you’ve spent a good amount of time with certain parts of the site disallowed and you want to get them indexed again.
Just as above, you can block robots access to your entire site aside from the parts you allow. You can also use it to hone the allowance of certain URL parameters:
In the above, we have disallowed any URL which includes ‘sloth’ EXCEPT for the URLs which END in ‘sloth’.
Robots Meta Tags and Robots.txt
If you’ve got robots meta tags on your pages which give different directions to the robots.txt file of your site, crawlers will read things a bit differently.
If you’ve blocked a page in your robots.txt, any robots meta tags you have to those pages will be ignored, because the page will never be crawled.
However, if you’ve got that page allowed in your robots.txt and your robots meta tag gives the direction not to index it, the page still won’t be indexed.
Finally, you can include directions to your sitemap in your robots.txt using ‘Sitemap:’.