PDA

View Full Version : robots.txt


rbradscott
6-3-05, 03:04 AM
I want my main site page at domain.com/index.html to be the only page that google (et al) crawls. My robots.txt is:

User-agent: *
Disallow: /

Will this prevent search engines like google from listing my site entirely (not my goal), or just prevent them from crawling anything past the first/root page, which is my goal?

B&T
6-3-05, 03:08 AM
Use an empty robots.txt file and put this in the index page

<meta name="ROBOTS" content="INDEX,NOFOLLOW">

rbradscott
6-3-05, 03:12 AM
Wow, thanks for your immediate reply! Can I not achieve my goal by using robots.txt, at all? I like the idea of keeping the "do not enter" sign as a seperate file I can edit to selectively allow access (at a future date) without having to edit my index.html. I do understand how your suggestion would work, regardless - thanks.

B&T
6-3-05, 03:16 AM
I do not believe you can do that with the robots.txt file. But someone else may know more about it than I do.

B&T
6-3-05, 06:44 AM
Using B & T's suggestion you would need to also install this on every webpage that you don't want indexed since by default robots will index all pages:
Not true. The bot starts with the default page and goes from there. That tells it to stop. Look at what I posted again.

Croc Hunter
6-3-05, 08:41 AM
Personaly.. I think you should decide if you want your site indexed or not.
Then use a robots text and htaccess to block them all.
Or password protect where you don't want them.
If you don't want it seen.. don't put it on the net.

stevel
6-3-05, 09:57 AM
I agree with keyplyr - you can't assume that every bot will start indexing at your home page.

If you have "Disallow /" in your robots.txt, that will prevent indexing of any of your site. Unfortunately, there is no way I can think of in a robots.txt to disallow all but one page of a site. See http://www.robotstxt.org/wc/norobots.html for documentation on robots.txt.

The meta tag, applied to every page you don't want indexed, is your best bet. Most spiders will honor it.

YvetteKuhns
6-3-05, 10:35 AM
Your site can be indexed by search engine robots without meta tags or robots.txt. Google does not read meta tags. If you have a link from a site like mine that is visible on other search engines, the robots will follow the link from my site to yours.

Including the meta tags is good for meta search engines such as MSN. MSN search also LOVES robots.txt. It is a good idea to use both meta tags and robots.txt. You can specify what should or should not be indexed. Google has a habit of indexing anything, though, so try using password protection for web stats and other things that should not be indexed.

B&T
6-3-05, 11:31 AM
If you take that approach, you cannot be sure a bot will pay attention to what you say at all.

It has worked as I said for me where I have used it. But you can do something else if you want :rolleyes:

rbradscott
6-3-05, 01:19 PM
Thanks so much for the discussion folks! This is exactly how we (all) get the "best" answer. I can read computer geek chat all day long, too! It's like we're all solving a puzzle together.

rbradscott
6-3-05, 03:11 PM
Just received this advice from a friend, whaddya think?

Although it's not great, it sounds like the best thing you could do
would be to explicitly disallow all the top-level file and folders. You
could do this by just denying all files that start with all characters
except 'i', and then deal with blocking out all the other files and
directories that start with 'i'. So, something like this:

User-agent: *
Disallow: /a
Disallow: /A
Disallow: /b
Disallow: /B
Disallow: /c
Disallow: /C
... (upto but not including 'i')
Disallow: /I
Disallow: /j
Disallow: /J
... (up to zZ)
Disallow: /0
Disallow: /1
Disallow: /2
Disallow: /3
... (up to 9)
(then disallow all files and directories that start with 'i' but not
index.html)
Disallow: /iguana
Disallow: /istanbul

YvetteKuhns
6-7-05, 02:28 PM
How many folders do you plan on having? It would be better to list the actual folder names as they are created, so you don't get confused later. Anything that should NOT be indexed should be stored in a folder to be protected.

I try to protect specific file extensions from being searched.
IndexIgnore *.gif *.jpg *.ico *.zip *.swf *.mp3 *.m3u *.avi *.ra
(This is in my .htaccess file.)