The Robot Exclusion Standard

The robots.txt protocol is a method to prevent web crawlers from accessing all or part of a website which is otherwise publicly viewable.

Google Don't

  • Allowing search result-like pages to be crawled. (users dislike leaving one search result page and landing on another search result page that doesn't add significant value for them)

  • Allowing URLs created as a result of proxy services to be crawled.