Public Terabyte Dataset Project
This page has details on the Public Terabyte Dataset project, which was a test crawl we ran in 2009-2010.
This project was subsumed by the Common Crawl project, which (re)started in November of 2011.
If you’re interested in other large public datasets, take a look at our list of public datasets.
Below are some details of how we did the crawl…
- We crawled approximately 100M pages from the
100K1 million top (by US-based traffic) domains. - The crawl was done by a custom Bixo workflow created by Scale Unlimited, built on top of Cascading/Hadoop and running in EC2 using Amazon’s Elastic MapReduce service.
- We tried hard to avoid spam/adult content, though getting totally clean results is of course impossible.
- We honored the robots nofollow and various “no archive” HTML meta tags, and we complied promptly with requests by web site hosters to not crawl their sites. Note that the rel=”nofollow” attribute on links themselves does not mean that the link shouldn’t be followed, but rather that the link shouldn’t be used when calculating a “PageRank” score.