Crawlers do not take Robots.txt file from website root BUT takes from web root -

April 15, 2010

i've blocked crawlers crawl web root (/var/www/ in case) robots.txt. i've robots.txt in /var/www/ , has below line in it: disallow /

now need 1 of subdirectory of web root(/var/www/mysite.com) crawled crawlers. i've added robots.txt in directory , added virtualhost in apache allow mysite.com crawled. crawlers still takes robots.txt web root(/var/www) instead of (/var/www/mysite.com).

thanks in advance help.

you specify 1 robots.txt goes in root directory.

more information can found in official documentation

where put it

the short answer: in top-level directory of web server.

the longer answer:

when robot looks "/robots.txt" file url, strips path component url (everything first single slash), , puts "/robots.txt" in place.

for example, "http://www.example.com/shop/index.html, remove "/shop/index.html", , replace "/robots.txt", , end "http://www.example.com/robots.txt".

also same page (at bottom) gives example of allowing webpage:

to exclude files except one

this bit awkward, there no "allow" field.

the easy way put files disallowed separate directory, "stuff", , leave 1 file in level above directory:

user-agent: * disallow: /~joe/stuff/

alternatively can explicitly disallow disallowed pages:

user-agent: *  disallow: /~joe/junk.html  disallow: /~joe/foo.html  disallow: /~joe/bar.html

Search This Blog

Parth Code

Crawlers do not take Robots.txt file from website root BUT takes from web root -

where put it

to exclude files except one

Comments

Post a Comment

Popular posts from this blog

c# - WPF Converters DLL - Failed to Add Reference -

sql server - SQL Query get records between 10pm to 6am -

c# - Operator '==' incompatible with operand types 'Guid' and 'Guid' using DynamicExpression.ParseLambda<T, bool> -