Crawlers do not take Robots.txt file from website root BUT takes from web root -
i've blocked crawlers crawl web root (/var/www/ in case) robots.txt. i've robots.txt in /var/www/ , has below line in it: disallow /
now need 1 of subdirectory of web root(/var/www/mysite.com) crawled crawlers. i've added robots.txt in directory , added virtualhost in apache allow mysite.com crawled. crawlers still takes robots.txt web root(/var/www) instead of (/var/www/mysite.com).
thanks in advance help.
you specify 1 robots.txt goes in root directory.
more information can found in official documentation
where put it
the short answer: in top-level directory of web server.
the longer answer:
when robot looks "/robots.txt" file url, strips path component url (everything first single slash), , puts "/robots.txt" in place.
for example, "http://www.example.com/shop/index.html, remove "/shop/index.html", , replace "/robots.txt", , end "http://www.example.com/robots.txt".
also same page (at bottom) gives example of allowing webpage:
to exclude files except one
this bit awkward, there no "allow" field.
the easy way put files disallowed separate directory, "stuff", , leave 1 file in level above directory:
user-agent: * disallow: /~joe/stuff/ alternatively can explicitly disallow disallowed pages:
user-agent: * disallow: /~joe/junk.html disallow: /~joe/foo.html disallow: /~joe/bar.html
Comments
Post a Comment