Public Terabyte Dataset Project

This page has details on the Public Terabyte Dataset project, which was a test crawl we ran in 2009-2010.

This project was subsumed by the Common Crawl project, which (re)started in November of 2011.

If you’re interested in other large public datasets, take a look at our list of public datasets.

Below are some details of how we did the crawl…

We crawled approximately 100M pages from the ~~100K~~ 1 million top (by US-based traffic) domains.
The crawl was done by a custom Bixo workflow created by Scale Unlimited, built on top of Cascading/Hadoop and running in EC2 using Amazon’s Elastic MapReduce service.
We tried hard to avoid spam/adult content, though getting totally clean results is of course impossible.
We honored the robots nofollow and various “no archive” HTML meta tags, and we complied promptly with requests by web site hosters to not crawl their sites. Note that the rel=”nofollow” attribute on links themselves does not mean that the link shouldn’t be followed, but rather that the link shouldn’t be used when calculating a “PageRank” score.

The Latest