PDA

View Full Version : spider visits


tcjay
3-13-02, 05:58 AM
How do I tell if a search engine spider has visited my site. Is there someplace I can look that will will show me if the site has been spidered.

Tom

bettyfordclinic
3-13-02, 08:22 PM
Check your raw access logs or other traffic monitoring systems. When a spider visits it should first access your robots.txt file, and then proceed to go through every page you have (sometimes in bursts). Look at the IP address. The google spider is called googlebot, the altavista one is trek.av or something like that.

Here's some from my site:
crawl2.googlebot.com - - [06/Mar/2002:11:25:40 -0600] "GET /fjp/photos/people/0code/couple-elderlyjpg.html HTTP/1.0" 200 3490
crawl2.googlebot.com - - [06/Mar/2002:11:26:05 -0600] "GET /fjp/photos/country/0code/kilkenny-castle.html HTTP/1.0" 200 3674
crawl3.googlebot.com - - [06/Mar/2002:11:26:23 -0600] "GET /fjp/photos/shops/0code/blue-lights-3.html HTTP/1.0" 200 3337
crawl5.googlebot.com - - [06/Mar/2002:11:26:41 -0600] "GET /fjp/photos/shops/0code/dummies-plastic.html HTTP/1.0" 200 3432
crawl2.googlebot.com - - [06/Mar/2002:11:27:06 -0600] "GET /fjp/photos/shops/0code/bank-of-ireland-corridor.html HTTP/1.0" 200 3606
crawl4.googlebot.com - - [06/Mar/2002:11:27:29 -0600] "GET /fjp/photos/misc/0code/airplane-010730-3.html HTTP/1.0" 304 -

Hope this helps....

bfc

Jess
3-13-02, 10:49 PM
What is the "robots.txt" file?

I've noticed in my error log that there are frequent entries saying,
"...... file does not exist.....htdocs/robots.txt"

MannInc
3-13-02, 11:47 PM
the robot.txt file is a text file placed in the htdoc folder that tells spiders where they can and cannot go on your site. Here's an example of the text you'd place in this file:

User-agent: *
Disallow: /cgi-bin

This will tell spiders to stay out of the cgi-bin folder of your site.

Jess
3-14-02, 12:31 AM
Ah, thanks!

Methinks I should get busy and create a robots.txt file.

MarkHutch
3-14-02, 01:18 AM
If you need any help in writing and testing your robots.txt file. You might want to visit this site...

http://www.searchtools.com/robots/robots-txt.html

MarkHutch

alphadesk
3-14-02, 01:27 AM
You can also add this in the head section of all pages you do not want indexed.

<meta name="robots" content="NOINDEX,NOFOLLOW">

If you want the page indexed.

<meta name="robots" content="INDEX">

If you want the robot to follow any links on a page and index them also.

<meta name="robots" content="INDEX,FOLLOW">

Get the general idea.

The thing about some robots is that they do what they want and some don't pay any attention to any of this or robots.txt, but all the majors search enginges will.

MarkHutch
3-14-02, 01:31 AM
The only major one that I have found that will get the robots.txt file and not follow it is Inktomi. They do get a copy of the file each time they visit, but there must be some kind of backup caching system that takes it forever to start working via Ink. Google, AltaVista, Ask Jeeves etc check the file and follow it's instructions right away. Kind of strange behavior for Inktomi, I just noticed a few weeks ago.