popurls

Robots.Txt and the .Gov TLD

SOURCE: OREILLY RADAR • 2009-11-20
TOPICS: Web Crawlers • World Wide Web • Robots Exclusion Standard • Google • Googlebot • Internet Search Engines

I'm on the board of CommonCrawl.Org, a nonprofit corporation that is attempting to provide a web crawl for use by all. An interesting report just got sent to us about the use of robots.txt files within the .Gov Top Level Domain, a standard known as the Robots Exclusion Standard. In examining about 32,000 subdomains in .gov, it turns at least 1,188 of these have a robots.txt file with a global disallow, meaning robots are excluded from indexing this content. Even more curious, on 175 of these... read more »

SEARCH RELATED:

commoncrawlorg web crawl

more buzz on popurls.com →