Public Terabyte Dataset Project

This page has more details on the Public Terabyte Dataset project, which was recently announced at the ACM data mining unconference.

  • The data comes from a crawl of 50-200M pages from the 100K 1 million top (by US-based traffic) domains.
  • The crawl is done by a custom Bixo workflow created by Scale Unlimited, built on top of Cascading/Hadoop and running in EC2 using Amazon’s Elastic MapReduce service.
  • We’ll be trying hard to avoid spam/adult content, though getting totally clean results is of course impossible.
  • We honor the robots nofollow and various “no archive” HTML meta tags, and we’ll comply promptly with requests by web site hosters to remove any of their content from the datasets. Note that the rel=”nofollow” attribute on links themselves does not mean that the link shouldn’t be followed, but rather that the link shouldn’t be used when calculating a “PageRank” score.
  • The resulting data will be stored as compressed warcAvro files in S3.
  • Hosting for the dataset is being provided by Amazon.
  • Access to the data is free, assuming you’re running code in EC2.
  • The code used to run the crawl, as well as code to access the crawl data, will be available at http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=263.

There’s a form where you can request information and provide input on the crawl.