A Tech Log

October 13, 2008

Avoid search spidering via a robots.txt file

Filed under: Development — adallow @ 10:27 am
Tags:

From google:

To prevent robots from crawling your site, place the following robots.txt file in your server root:

User-agent: *
Disallow: /

To remove your site from Google only and prevent just Googlebot from crawling your site in the future, place the following robots.txt file in your server root:

User-agent: Googlebot
Disallow: /

Each port must have its own robots.txt file. In particular, if you serve content via both http and https, you’ll need a separate robots.txt file for each of these protocols. For example, to allow Googlebot to index all http pages but no https pages, you’d use the robots.txt files below.

For your http protocol (http://yourserver.com/robots.txt):

User-agent: *
Allow: /

For the https protocol (https://yourserver.com/robots.txt):

User-agent: *
Disallow: /

NB: if a robot discovers your site by other means –
for example, by following a link to your URL from another site – your
content may still appear in our index and our search results. To
entirely prevent a page from being added to the Google index even if
other sites link to it, use a noindex meta tag.

Blog at WordPress.com.