Robots.Txt and the .Gov TLD
SOURCE: OREILLY RADAR • 2009-11-20
TOPICS:
Web Crawlers • World Wide Web • Robots Exclusion Standard • Google • Googlebot • Internet Search Engines
I'm on the board of CommonCrawl.Org, a nonprofit corporation that is attempting to provide a web crawl for use by all. An interesting report just got sent to us about the use of robots.txt files within the .Gov Top Level Domain, a standard known as the Robots Exclusion Standard. In examining about 32,000 subdomains in .gov, it turns at least 1,188 of these have a robots.txt file with a global disallow, meaning robots are excluded from indexing this content. Even more curious, on 175 of these... read more »
SEARCH RELATED:
commoncrawl • org • web crawl