Public Terabyte Dataset

The Public Terabyte Dataset project was a large-scale crawl of top domains, using Scale Unlimited’s elastic web mining platform, Amazon’s Elastic Map Reduce (EMR) web service, and Concurrent’s Cascading workflow API.

This project was subsumed by the Common Crawl project, which (re)started in November of 2011.

If you’re interested in other large public datasets, take a look at our list of public datasets.