Public Terabyte Dataset Project

This page has details on the Public Terabyte Dataset project, which was a test crawl we ran in 2009-2010.

This project was subsumed by the Common Crawl project, which (re)started in November of 2011.

If you’re interested in other large public datasets, take a look at our list of public datasets.

Below are some details of how we did the crawl…

  • We crawled approximately 100M pages from the 100K 1 million top (by US-based traffic) domains.
  • The crawl was done by a custom Bixo workflow created by Scale Unlimited, built on top of Cascading/Hadoop and running in EC2 using Amazon’s Elastic MapReduce service.
  • We tried hard to avoid spam/adult content, though getting totally clean results is of course impossible.
  • We honored the robots nofollow and various “no archive” HTML meta tags, and we complied promptly with requests by web site hosters to not crawl their sites. Note that the rel=”nofollow” attribute on links themselves does not mean that the link shouldn’t be followed, but rather that the link shouldn’t be used when calculating a “PageRank” score.