SimpleDB Tap for Cascading

March 16, 2010

Recently we’ve been running a number of large, multi-phase web mining applications in Amazon’s EC2 & Elastic MapReduce (EMR), and we needed a better way to maintain state than pushing sequence files back and forth between HDFS and S3. One option was to set up an HBase cluster, but then we’d be paying 24×7 for servers that we’d only need for a few minutes each day. We could also set more…

Announcing the Public Terabyte Dataset project

November 1, 2009

We’re very excited to announce the Public Terabyte Dataset project. This is a high quality crawl of top web sites, using AWS’s Elastic MapReduce, Concurrent’s Cascading workflow API, and Bixo Lab’s elastic web mining platform. Hosting for the resulting dataset will be provided by Amazon in S3, and freely available to all EC2 users. In addition, the code used to create and process the dataset will be available for download more…