View Full Version : Can Robots.txt hurt rankings?
notset4life
11-5-07, 12:09 PM
I have a fairly large robots.txt file. I created it a few years ago and never really looked at it until recently.
Is has a lot of lines like:
------------------------------------------
User-agent: URL_Spider_Pro
Disallow: /
User-agent: CherryPicker
Disallow: /
User-agent: EmailCollector
Disallow: /
User-agent: EmailSiphon
Disallow: /
User-agent: WebBandit
Disallow: /
----------------------------------------------------------------------
etc.
I've read that a better way to block bots would be through htaccess.
But my question is this: can a large robots.txt file like this hurt rankings?
Google Webmaster Tools won't analyze this file because it says it is over 5k (or something like that). I wonder if the ability to analyze it might actually be hurting my ranking.
Thanks in advance for any help and advice.
Vin
Croc Hunter
11-5-07, 10:51 PM
It's a grey area. Some suggest a bot hitting a large robots.txt file (depending on the bot) will red flag, ignore, or just leave your site. Certainly it will slow the bot traversing your site. Fact is, .htaccess is the best method. The code example you gave is to block 'bad' bots from your site, do you think a bad bot is going to even look for your robots.txt let alone obey it? Keep your basic robots text and use .htaccess to block bad bots, website downloader's etc.
The code goes like this:
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteRule ^.* - [F,L]
notset4life
11-5-07, 11:19 PM
Thank you for the reply.
I saw some other code that went like this:
SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot
SetEnvIfNoCase User-Agent "^EmailWolf" bad_bot
SetEnvIfNoCase User-Agent "^ExtractorPro" bad_bot
SetEnvIfNoCase User-Agent "^CherryPicker" bad_bot
SetEnvIfNoCase User-Agent "^NICErsPRO" bad_bot
SetEnvIfNoCase User-Agent "^Teleport" bad_bot
SetEnvIfNoCase User-Agent "^EmailCollector" bad_bot
Would you say the code you provided is better, would either work just fine, or both?
thanks again
Croc Hunter
11-6-07, 12:06 AM
Either is fine, that list only has bad_bot because it was compiled with a .pl script. That's ok you can use that one, just be sure to include the:
<Limit GET POST HEAD>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>
At the end of your list. I wouldn't just jam the two together.. but yes you could compile the ^BotName parts into one list using either code.
notset4life
11-6-07, 12:11 AM
thanks again
Croc Hunter
11-6-07, 01:04 AM
Here you go.. 230 (only 2 more) in the combined list using the RewriteCond method.
http://www.streetsie.com/webmasters/BadBot.txt
Here you go.. 230 (only 2 more) in the combined list using the RewriteCond method.
http://www.streetsie.com/webmasters/BadBot.txt
Croc, I'm curious... does having all the rewrites add much overhead to the .htaccess file processing?
THX
symo
Croc Hunter
11-6-07, 11:29 PM
Yeah it does, any directive (line) added to your .htaccess has to be processed by Apache whenever a HTTP hit is made upon your site. If you run a huge website.. well you would be on a dedicated server. On a shared host like Powweb if every customer added a 40k htaccess file overnight you'd notice a performace drop. RewriteCond scans the list looking for a user-agent match. If a match (someone/bad bot tries to access the website with one of these website downloaders) is found they get a forbidden message. If no match access is allowed. A very basic .htaccess vs a long one. Were talking about 2ms vs 30ms.
Still it's pointless to waste server resources checking every visitor against bots most of us will never see. Or filling your htaccess with repeated directives (pet hate) you only need one!! RewriteEngineOn. The full list is around 10k that's not what I'd call large. I usually cull the list leaving the most popular offenders.
THX for the info Croc
symo
notset4life
12-5-07, 01:04 PM
I have one more question about robots.txt.
After using a tool to analyze the file, there were lines in it that I just removed:
User-agent: sitecheck.internetseer.com
User-agent: Googlebot
User-agent: MSNBot
From what I understand, the robots.txt is used to "disallow" certain robots or access to certain directories, etc.
Do the lines above really need to be "ADDED" to ALLOW a robot, such as Googlebot?
Or would those lines just be necessary if I'm dissalowing all robots, with the exception of those listed? It seems to me in my case the three lines above would not be necessary at all, therefore, I removed them. Correct?
thanks.
Croc Hunter
12-6-07, 04:48 AM
It depends on the line after the User-agent: you're calling. For instance, to block Googlebot entirely, you can use the following syntax:
User-agent: Googlebot
Disallow: /
Allowing Googlebot: If you want to block access to all bots other than the Googlebot, you can use the following syntax:
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
Googlebot follows the line directed at it, rather than the line directed at everyone. Not all bots can read Allow: / so you can use Disallow: (no trailing slash) if you wish. If the bot is not listed at all chances are it will trawl your site but why not put out the welcome mat.
vBulletin v3.6.0, Copyright ©2000-2009, Jelsoft Enterprises Ltd.