A Tech Log

October 13, 2008

Avoid search spidering via a robots.txt file

Filed under: Development — adallow @ 10:27 am
Tags:

From google:

To prevent robots from crawling your site, place the following robots.txt file in your server root:

User-agent: *
Disallow: /

To remove your site from Google only and prevent just Googlebot from crawling your site in the future, place the following robots.txt file in your server root:

User-agent: Googlebot
Disallow: /

Each port must have its own robots.txt file. In particular, if you serve content via both http and https, you’ll need a separate robots.txt file for each of these protocols. For example, to allow Googlebot to index all http pages but no https pages, you’d use the robots.txt files below.

For your http protocol (http://yourserver.com/robots.txt):

User-agent: *
Allow: /

For the https protocol (https://yourserver.com/robots.txt):

User-agent: *
Disallow: /

NB: if a robot discovers your site by other means –
for example, by following a link to your URL from another site – your
content may still appear in our index and our search results. To
entirely prevent a page from being added to the Google index even if
other sites link to it, use a noindex meta tag.

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: