Announcing the Public Terabyte Dataset project

November 1, 2009

Tags: amazon, ec2, emr, public terabyte dataset, web mining

We’re very excited to announce the Public Terabyte Dataset project.

This is a high quality crawl of top web sites, using AWS’s Elastic MapReduce, Concurrent’s Cascading workflow API, and Bixo Lab’s elastic web mining platform.

Hosting for the resulting dataset will be provided by Amazon in S3, and freely available to all EC2 users.

In addition, the code used to create and process the dataset will be available for download from http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=263

Questions and input on the project can be submitted at https://scaleunlimited.com/PTD/

Filed under:
Uncategorized by kkrugler

9 Responses to “Announcing the Public Terabyte Dataset project”

Notional Slurry » links for 2009-11-02

November 2nd, 2009 at 11:03 pm

[…] Announcing the Public Terabyte Dataset project « Elastic Web Mining | Bixolabs "We’re very excited to announce the Public Terabyte Dataset project." (tags: data datasets mapreduce S3) […]
Otis Gospodnetic

November 3rd, 2009 at 8:19 am

Ken, I’m looking at http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=263 but I don’t see PTD there.
What exactly is in the dataset? A bunch of raw, unparsed HTML pages, right?
kkrugler

November 3rd, 2009 at 4:08 pm

Hi Otis,

The dataset will be pushed to S3 as a set of compressed warc (web archive) files. Still working out how much additional data to include, but parse would be too big…might include compressed term vectors for Mahout-ers out there.

The link you reference is where the sample code will be posted, for both the generation of the data (the crawl code) and examples of how to process the data files.

— Ken
Michael Nielsen » Biweekly links for 11/06/2009

November 6th, 2009 at 3:53 am

[…] The Public Terabyte Dataset project « Elastic Web Mining | Bixolabs […]
Brewster Kahle

November 7th, 2009 at 9:43 pm

Thank you for doing this crawl!

The Internet Archive would be happy to host this collection for free public access. You can push it to the Internet Archive given our implementation of S3.

If this is interesting to you, please contact alexis rossi (alexis at archive).

-brewster
kkrugler

November 10th, 2009 at 1:39 pm

Hi Brewster,

Thanks for the offer to host the crawl at the Internet Archive.

I’ll take you up on that offer, and follow up with Alexis.

— Ken
BotchagalupeMarks for November 13th - 11:23 | IT Management and Cloud Blog

November 13th, 2009 at 11:26 pm

[…] Announcing the Public Terabyte Dataset project « Elastic Web Mining | Bixolabs – This is a high quality crawl of top web sites, using AWS’s Elastic Map Reduce, Concurrent’s Cascading workflow API, and Bixolab’s elastic web mining platform. […]
Ian Upright

March 11th, 2011 at 12:27 am

Any news on the status of this crawl? Any update on when might it be available? Thanks..
kkrugler

March 16th, 2011 at 5:12 pm

Hi Ian,

Unfortunately the crawl is still tied to a pending release of some new functionality by Amazon, so we’re still in a (very long) holding pattern. But there’s some light at the end of the tunnel…

— Ken

Announcing the Public Terabyte Dataset project

9 Responses to “Announcing the Public Terabyte Dataset project”

Recent Blog Posts

Site Tags