Announcing the Public Terabyte Dataset project

November 1, 2009

We’re very excited to announce the Public Terabyte Dataset project.

This is a high quality crawl of top web sites, using AWS’s Elastic MapReduce, Concurrent’s Cascading workflow API, and Bixo Lab’s elastic web mining platform.

Hosting for the resulting dataset will be provided by Amazon in S3, and freely available to all EC2 users.

In addition, the code used to create and process the dataset will be available for download from http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=263

Questions and input on the project can be submitted at https://scaleunlimited.com/PTD/

9 Responses to “Announcing the Public Terabyte Dataset project”

  1. […] Announcing the Public Terabyte Dataset project « Elastic Web Mining | Bixolabs "We’re very excited to announce the Public Terabyte Dataset project." (tags: data datasets mapreduce S3) […]

  2. Ken, I’m looking at http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=263 but I don’t see PTD there.
    What exactly is in the dataset? A bunch of raw, unparsed HTML pages, right?

  3. Hi Otis,

    The dataset will be pushed to S3 as a set of compressed warc (web archive) files. Still working out how much additional data to include, but parse would be too big…might include compressed term vectors for Mahout-ers out there.

    The link you reference is where the sample code will be posted, for both the generation of the data (the crawl code) and examples of how to process the data files.

    — Ken

  4. […] The Public Terabyte Dataset project « Elastic Web Mining | Bixolabs […]

  5. Thank you for doing this crawl!

    The Internet Archive would be happy to host this collection for free public access. You can push it to the Internet Archive given our implementation of S3.

    If this is interesting to you, please contact alexis rossi (alexis at archive).

    -brewster

  6. Hi Brewster,

    Thanks for the offer to host the crawl at the Internet Archive.

    I’ll take you up on that offer, and follow up with Alexis.

    — Ken

  7. […] Announcing the Public Terabyte Dataset project « Elastic Web Mining | Bixolabs – This is a high quality crawl of top web sites, using AWS’s Elastic Map Reduce, Concurrent’s Cascading workflow API, and Bixolab’s elastic web mining platform. […]

  8. Any news on the status of this crawl? Any update on when might it be available? Thanks..

  9. Hi Ian,

    Unfortunately the crawl is still tied to a pending release of some new functionality by Amazon, so we’re still in a (very long) holding pattern. But there’s some light at the end of the tunnel…

    — Ken