It is possible to keep any web page out of Google and other well-behaved search engines. The better search engines all respect an informal "robot exclusion standard" that provides two different ways to keep users from finding your site, or any part of your site, in searches. The newer approach relies on a meta element in the head element in each page, This method provides the most flexibility.
The older approach involves a text file called robots.txt in the document root of your website. Although robots.txt is less flexible overall, it is still the easiest choice if you simply want to lock all search engines out of your entire site.
The modern way: noindex and nofollowThe newfangled approach calls for a special meta element inside the head element of each page:
<meta name="robots" content="noindex, nofollow"/>
This element does two things:
1. noindex prevents Google from indexing the content of the page. So searches for text found in that page will not lead users to it, at least not via Google and other well-behaved search engines that honor this standard.
2. nofollow means that Google and other correctly coded search engines should not follow any links found in the page to discover new pages.
The old-school way: robots.txtThe old-fashioned method requires that you create a text file, robots.txt, and upload it to the document root of your website (where the home page of the entire site lives).
A robots.txt file that locks out all search engines is very simple:
When we specify / for Disallow:, we are indicating that search engines must not index any path that begins with a /. Since every valid URL on the site begins with a /, this blocks everything.
This works because web browsers and servers automatically add the /.
Note that Disallow: path blocks access to any path on the site beginning with the specified text. If you need more precise control, you should use the meta element method instead.
What this technique is good for... and bad forWhat are you really trying to accomplish?
If you want to keep search engines from indexing parts of your site that are dynamically generated and appear to contain an infinite number of pages, this is a perfect solution for you.
If you would prefer that some of your pages not be instantly discoverable in Google, but you can live with the fact that some search engines (and human beings!) might still link to them, this technique will work well for you too.
But if it is important to you that strangers not be able to find the pages at all, this is not the right solution for you! Badly behaved search engines are out there. And fellow webmasters, bloggers, and any kid with a myspace page might still link to your pages manually, bringing unwanted eyeballs to your site.
If your information is private, a public web page is not the place for it. See my article how do I password-protect my web page? for effective ways to keep unauthorized persons from seeing your web pages. Please do not rely on the robot exclusion standard for this purpose.
See Also...Brown University's search engine exclusion page is an excellent resource with even more details on both approaches.
Got a LiveJournal account? Keep up with the latest articles in this FAQ by adding our syndicated feed to your friends list!